Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Clicking around a live site won't reveal all the nooks and crannies of a legacy website. Often, we find directories in the root of an installation serving up other files, or new systems. Redirects may be in place, or higher level traffic direction via reverse proxy configurations. The brass tax to all these configurations can be, for the most part, understood by spidering the website and creating a sitemap, or a report of all the "pages" and files (PDFs, movies, images) that are accessible as unique links on a website. These details are super important in the context of any site migration. 

...

Code Block
# spider the site and write to a file
wget --spider --recursive --no-verbose --output-file=wgetlog.txt http://www.mysite.org


# filter out the actual URLs and create a new file
sed -n "s@.\+ URL:\([^ ]\+\) .\+@\1@p" wgetlog.txt | sed "s@&@\&@" > sedlog.txt


# sort the list and get only unique lines as another file
sort sedlog.txt | uniq > urls.txt 

(warning)(warning) be careful when using wget to not tank a website that can't handle it. consider passing some flags:

Code Block
# spider the site and write to a file
wget --spider --recursive --no-verbose --limit-rate=128k --output-file=wgetlog.txt http://www.mysite.org

...

  • --recursive: download the entire Web site.
  • --domains website.org: don't follow links outside website.org.
  • --no-parent: don't follow links outside the directory tutorials/html/.
  • --page-requisites: get all the elements that compose the page (images, CSS and so on).
  • --html-extension: save files with the .html extension.
  • --adjust-extension: adds suitable extensions to filenames (html or css) depending on their content-type
  • --convert-links: convert links so that they work locally, off-line.
  • --restrict-file-names=windows: modify filenames so that they will work in Windows as well.
  • --no-clobber: don't overwrite any existing files (used in case the download is interrupted and resumed).

...

Even though Scrapy was originally designed for web scraping, it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler. 

...

Here is some code to extract the unique URLs in a website:
Code Block
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request

DOMAIN = 'example.com'
URL = 'http://%s' % DOMAIN

class MySpider(BaseSpider):
    name = DOMAIN
    allowed_domains = [DOMAIN]
    start_urls = [
        URL
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        for url in hxs.select('//a/@href').extract():
            if not ( url.startswith('http://') or url.startswith('https://') ):
                url= URL + url 
            print url
            yield Request(url, callback=self.parse)

Save this in a file called spider.py.

You can then use a shell pipeline to post process this text:

Code Block
bash$ scrapy runspider spider.py > urls.txt
bash$ cat urls.out| grep 'example.com' |sort |uniq |grep -v '#' |grep -v 'mailto' > url-report.txt


Online sitemap generators

...


Another free tool at http://tools.seochat.com/tools/online-crawl-google-sitemap-generator that will do 1000 100 links for free

...