creating site maps

Clicking around a live site won't reveal all the nooks and crannies of a legacy website. Often, we find directories in the root of an installation serving up other files, or new systems. Redirects may be in place, or higher level traffic direction via reverse proxy configurations. The brass tax to all these configurations can be, for the most part, understood by spidering the website and creating a sitemap, or a report of all the "pages" and files (PDFs, movies, images) that are accessible as unique links on a website. These details are super important in the context of any site migration.

Spidering a website with wget

wget is a command-line tool. If you're on a mac install it with homebrew. We can use it to traverse the URLs on a site and write these to a file:

# spider the site and write to a file
wget --spider --recursive --no-verbose --output-file=wgetlog.txt http://www.mysite.org


# filter out the actual URLs and create a new file
sed -n "s@.\+ URL:\([^ ]\+\) .\+@\1@p" wgetlog.txt | sed "s@&@\&amp;@" > sedlog.txt


# sort the list and get only unique lines as another file
sort sedlog.txt | uniq > urls.txt

be careful when using wget to not tank a website that can't handle it. consider passing some flags:

# spider the site and write to a file
wget --spider --recursive --no-verbose --limit-rate=128k --output-file=wgetlog.txt http://www.mysite.org

the --wait option introduces a number of seconds to wait between download attempts, the --limit-rate limits the amount of the servers bandwidth you are sucking up. Both good ideas if you don't want to be blacklisted by the servers admin.

Advanced wget

An expanded version of this technique at: http://www.ashleysheridan.co.uk/coding/bash/Generating+sitemaps+for+large+sites that allows us to

generate separate lists for both mobile and desktop versions of the site.
log in to an HTTP authentication set up on the server.

#!/bin/bash

mobile=''
user=''
password=''

while getopts ":u:p:m" opt; do
	case $opt in
		m ) mobile='-U  "Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_0 like Mac OS X; en-us) AppleWebKit/528.18 (KHTML, like Gecko) Version/4.0 Mobile/7A341 Safari/528.16"';;
		u ) user="--http-user=$OPTARG";;
		p ) password="--http-password=$OPTARG";;
		\? ) echo "usage: sitemap [-u username] [-p password] [-m]"
			exit 1
	esac
done
shift $(($OPTIND - 1))

if [ $# -ne 1 ];
then
	echo "you must specify a URL to spider"
	exit 1
else
	# spider the site
	wget --spider --recursive --no-verbose --output-file=wgetlog.txt $1 $mobile $user $password

	# filter out the actual URLs
	sed -n "s@.\+ URL:\([^ ]\+\) .\+@\1@p" wgetlog.txt | sed "s@&@\&amp;@" > sedlog.txt

	# sort the list and get only unique lines
	sort sedlog.txt | uniq > urls.txt
fi

Downloading an entire website with wget

This example shows how to download a full subdirectory. Just change your url to www.mysite.org/. Remove the --no-parent flag if you like

$ wget \
     --recursive \
     --no-clobber \
     --page-requisites \
     --html-extension \
     --adjust-extension \
     --convert-links \
     --domains mysite.org \
     --no-parent \
         www.mysite.org/path/to/stuff/

--recursive: download the entire Web site.
--domains website.org: don't follow links outside website.org.
--no-parent: don't follow links outside the directory tutorials/html/.
--page-requisites: get all the elements that compose the page (images, CSS and so on).
--html-extension: save files with the .html extension.
--adjust-extension: adds suitable extensions to filenames (html or css) depending on their content-type
--convert-links: convert links so that they work locally, off-line.
--restrict-file-names=windows: modify filenames so that they will work in Windows as well.
--no-clobber: don't overwrite any existing files (used in case the download is interrupted and resumed).

HTTRACK

http://www.httrack.com is also a great option, made more specifically for this application.

$ brew install httrack

Then you can use it in the commend line:

$ httrack http://mysite.org [any options]  -O /path/to/destination/mymirror.org -v

and there are prompts, a wizard and more. Command line docs.

This is a great way of building a complete mirror of a website, say for backup purposes if things go down, or to deliver an archived, static version of an antiquated CMS-powered website.

--priority=7: get html files before, then treat other files

--robots=0: ignore robots directives

--cache=C1: prioritize the cache of already downloaded assets

--extra-log: log more things

--assume: deliberately map file extensions to a mime type

-i: restart where we left off

$ httrack http://eecs.berkeley.edu --robots=0 --cache=C1 --extra-log --priority=7 --assume shtml=text/html -O . -v -i

Other Tools

Scrapy

http://doc.scrapy.org/en/latest/topics/commands.html

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

Even though Scrapy was originally designed for web scraping, it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler.

Here is some code to extract the unique URLs in a website:

from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request

DOMAIN = 'example.com'
URL = 'http://%s' % DOMAIN

class MySpider(BaseSpider):
    name = DOMAIN
    allowed_domains = [DOMAIN]
    start_urls = [
        URL
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        for url in hxs.select('//a/@href').extract():
            if not ( url.startswith('http://') or url.startswith('https://') ):
                url= URL + url 
            print url
            yield Request(url, callback=self.parse)

Save this in a file called spider.py.

You can then use a shell pipeline to post process this text:

bash$ scrapy runspider spider.py > urls.txt
bash$ cat urls.out| grep 'example.com' |sort |uniq |grep -v '#' |grep -v 'mailto' > url-report.txt

Online sitemap generators

xml-sitemaps.com - up to 500 links for free, outputs in HTML, TXT, XML

Another free tool at http://tools.seochat.com/tools/online-crawl-google-sitemap-generator that will do 100 links for free