Sunday, May 22, 2011

Using tar and gzip

To tar a list of files/directories, do
tar -cf mytar.tar file1 file2 dir3 file4
To add more files to this, do
tar -rf mytar.tar file5 dir6 dir7
To list the contents of the tar, do
tar -tf mytar.r
To delete files from the tar, list the contents, do
tar --delete -f mytar.tar file1 dir2
using the exact path as given in the `list contents' step.

To compress the tar, do
gzip -c mytar.tar > mytar.tar.gz
To decompress the file, do
gunzip -c mytar.tar.gz > mytar.tar

To extract files from the tar, do
tar -xf mytar.tar file2 dir3
To extract everything, do
tar -xf mytar.tar

To simultaneously tar and compress, do
tar -czf mytar.tar.gz file1 dir2 file3
To simultaneously decompress and extract, do
tar -xzf mytar.tar.gz
(Note: Additions, using tar -rf, are not allowed for compressed tar files)

Friday, May 20, 2011

Using wget

Full Site
To download a full website, do
wget -c -w 15 --mirror -p -k -P /path/to/dir http://my.website.com -a my.log
This will download the entire website, waiting 15 seconds between retrievals. It will copy everything needed to show the page correctly (-p), convert links for local viewing (-k), store the files in /path/to/dir (-P), and write (append) messages to my.log (instead of the console). If stopped in between, run this command again, and it will start where it left off (-c).

Add --spider at the end to do a dry run first.
Add --random-wait to vary the wait time (for sites that block automated downloading).
Add --limit-rate=20k to limit download speed to 20 kilobytes per second (an example politeness limit; use your discretion).

Courtesy: this, this and this.

Small part of site
To download a small part (e.g. a given page and its first-level links) to the current directory, do
wget -c -w 3 -a my.log -r -l 1 -p -k http://somesite.com/interesting/link.html
This will download the page link.html and all links in it (-r), but will not recurse further to links of links (-l 1).

A URL list
To download a list of files specified by URLs, do
wget -c -w 15 -i urls.txt
This will download all URLs listed in the file urls.txt.