Clicking around a live site won't reveal all the nooks and crannies of a legacy website. Often, we find directories in the root of an installation serving up other files, or new systems. Redirects may be in place, or higher level traffic direction via reverse proxy configurations. The brass tax to all these configurations can be, for the most part, understood by spidering the website and creating a sitemap, or a report of all the "pages" and files (PDFs, movies, images) that are accessible as unique links on a website. These details are super important in the context of any site migration.
...
Code Block |
---|
# spider the site and write to a file wget --spider --recursive --no-verbose --output-file=wgetlog.txt http://www.mysite.org # filter out the actual URLs and create a new file sed -n "s@.\+ URL:\([^ ]\+\) .\+@\1@p" wgetlog.txt | sed "s@&@\&@" > sedlog.txt # sort the list and get only unique lines as another file sort sedlog.txt | uniq > urls.txt |
be careful when using wget to not tank a website that can't handle it. consider passing some flags:
Code Block |
---|
# spider the site and write to a file
wget --spider --recursive --no-verbose --limit-rate=128k --output-file=wgetlog.txt http://www.mysite.org
|
...
- --recursive: download the entire Web site.
- --domains website.org: don't follow links outside website.org.
- --no-parent: don't follow links outside the directory tutorials/html/.
- --page-requisites: get all the elements that compose the page (images, CSS and so on).
- --html-extension: save files with the .html extension.
- --adjust-extension: adds suitable extensions to filenames (
html
orcss
) depending on their content-type - --convert-links: convert links so that they work locally, off-line.
- --restrict-file-names=windows: modify filenames so that they will work in Windows as well.
- --no-clobber: don't overwrite any existing files (used in case the download is interrupted and resumed).
...
...
Code Block |
---|
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request
DOMAIN = 'example.com'
URL = 'http://%s' % DOMAIN
class MySpider(BaseSpider):
name = DOMAIN
allowed_domains = [DOMAIN]
start_urls = [
URL
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
for url in hxs.select('//a/@href').extract():
if not ( url.startswith('http://') or url.startswith('https://') ):
url= URL + url
print url
yield Request(url, callback=self.parse) |
Save this in a file called spider.py
.
You can then use a shell pipeline to post process this text:
Code Block |
---|
bash$ scrapy runspider spider.py > urls.txt
bash$ cat urls.out| grep 'example.com' |sort |uniq |grep -v '#' |grep -v 'mailto' > url-report.txt |
Online sitemap generators
...
Another free tool at http://tools.seochat.com/tools/online-crawl-google-sitemap-generator that will do 1000 100 links for free
Related articles
...