creating site maps
Clicking around a live site won't reveal all the nooks and crannies of a legacy website. Often, we find directories in the root of an installation serving up other files, or new systems. Redirects may be in place, or higher level traffic direction via reverse proxy configurations. The brass tax to all these configurations can be, for the most part, understood by spidering the website and creating a sitemap, or a report of all the "pages" and files (PDFs, movies, images) that are accessible as unique links on a website. These details are super important in the context of any site migration.
Spidering a website with wget
wget is a command-line tool. If you're on a mac install it with homebrew. We can use it to traverse the URLs on a site and write these to a file:
# spider the site and write to a file wget --spider --recursive --no-verbose --output-file=wgetlog.txt http://www.mysite.org # filter out the actual URLs and create a new file sed -n "s@.\+ URL:\([^ ]\+\) .\+@\1@p" wgetlog.txt | sed "s@&@\&@" > sedlog.txt # sort the list and get only unique lines as another file sort sedlog.txt | uniq > urls.txt
be careful when using wget to not tank a website that can't handle it. consider passing some flags:
# spider the site and write to a file wget --spider --recursive --no-verbose --limit-rate=128k --output-file=wgetlog.txt http://www.mysite.org
the --wait option introduces a number of seconds to wait between download attempts, the --limit-rate limits the amount of the servers bandwidth you are sucking up. Both good ideas if you don't want to be blacklisted by the servers admin.
Advanced wget
An expanded version of this technique at: http://www.ashleysheridan.co.uk/coding/bash/Generating+sitemaps+for+large+sites that allows us to
- generate separate lists for both mobile and desktop versions of the site.
- log in to an HTTP authentication set up on the server.
Downloading an entire website with wget
This example shows how to download a full subdirectory. Just change your url to www.mysite.org/. Remove the --no-parent flag if you like
$ wget \ --recursive \ --no-clobber \ --page-requisites \ --html-extension \ --adjust-extension \ --convert-links \ --domains mysite.org \ --no-parent \ www.mysite.org/path/to/stuff/
- --recursive: download the entire Web site.
- --domains website.org: don't follow links outside website.org.
- --no-parent: don't follow links outside the directory tutorials/html/.
- --page-requisites: get all the elements that compose the page (images, CSS and so on).
- --html-extension: save files with the .html extension.
- --adjust-extension: adds suitable extensions to filenames (
html
orcss
) depending on their content-type - --convert-links: convert links so that they work locally, off-line.
- --restrict-file-names=windows: modify filenames so that they will work in Windows as well.
- --no-clobber: don't overwrite any existing files (used in case the download is interrupted and resumed).
HTTRACK
http://www.httrack.com is also a great option, made more specifically for this application.
$ brew install httrack
Then you can use it in the commend line:
$ httrack http://mysite.org [any options] -O /path/to/destination/mymirror.org -v
and there are prompts, a wizard and more. Command line docs.
This is a great way of building a complete mirror of a website, say for backup purposes if things go down, or to deliver an archived, static version of an antiquated CMS-powered website.
- --priority=7: get html files before, then treat other files
- --robots=0: ignore robots directives
- --cache=C1: prioritize the cache of already downloaded assets
- --extra-log: log more things
- --assume: deliberately map file extensions to a mime type
- -i: restart where we left off
$ httrack http://eecs.berkeley.edu --robots=0 --cache=C1 --extra-log --priority=7 --assume shtml=text/html -O . -v -i
Other Tools
Scrapy
from scrapy.selector import HtmlXPathSelector from scrapy.spider import BaseSpider from scrapy.http import Request DOMAIN = 'example.com' URL = 'http://%s' % DOMAIN class MySpider(BaseSpider): name = DOMAIN allowed_domains = [DOMAIN] start_urls = [ URL ] def parse(self, response): hxs = HtmlXPathSelector(response) for url in hxs.select('//a/@href').extract(): if not ( url.startswith('http://') or url.startswith('https://') ): url= URL + url print url yield Request(url, callback=self.parse)
Save this in a file called spider.py
.
You can then use a shell pipeline to post process this text:
bash$ scrapy runspider spider.py > urls.txt bash$ cat urls.out| grep 'example.com' |sort |uniq |grep -v '#' |grep -v 'mailto' > url-report.txt
Online sitemap generators
xml-sitemaps.com - up to 500 links for free, outputs in HTML, TXT, XML
Another free tool at http://tools.seochat.com/tools/online-crawl-google-sitemap-generator that will do 100 links for free
Related articles