Friday, May 20, 2011

Using wget

Full Site
To download a full website, do
wget -c -w 15 --mirror -p -k -P /path/to/dir http://my.website.com -a my.log
This will download the entire website, waiting 15 seconds between retrievals. It will copy everything needed to show the page correctly (-p), convert links for local viewing (-k), store the files in /path/to/dir (-P), and write (append) messages to my.log (instead of the console). If stopped in between, run this command again, and it will start where it left off (-c).

Add --spider at the end to do a dry run first.
Add --random-wait to vary the wait time (for sites that block automated downloading).
Add --limit-rate=20k to limit download speed to 20 kilobytes per second (an example politeness limit; use your discretion).

Courtesy: this, this and this.

Small part of site
To download a small part (e.g. a given page and its first-level links) to the current directory, do
wget -c -w 3 -a my.log -r -l 1 -p -k http://somesite.com/interesting/link.html
This will download the page link.html and all links in it (-r), but will not recurse further to links of links (-l 1).

A URL list
To download a list of files specified by URLs, do
wget -c -w 15 -i urls.txt
This will download all URLs listed in the file urls.txt.

No comments:

Post a Comment