Originally posted by MarillionFan
View Post
For example, if you wanted to snaffle a local copy of R F Streater's physics site:
wget -erobots=off --mirror -p -w 2 --convert-links -P C:\streater http://www.mth.kcl.ac.uk/~streater/
In this example the options are:
-erobots=off => ignore robots.txt file settings (naughty!)
-m[irror] => mirror site on local drive (change all site links to "file://" links)
-p => Get all associated components from site like images and CSS files
-w # => pause # seconds between each download (to avoid snowing site under with requests)
-r => recursive (grab sub-links on same site recursively)
-u mozilla => present wget as mozilla browser
I think these options should do what you want, without trying to mirror the whole Internet for example. But it is slightly confusing, as different versions of wget use different combinations to achieve the same end.
But needless to say, it can only get a "connected" set of web pages; so if a site happens to comprise more than one link-disjoint set of pages or a partially ordered set with more than one "maximal" page (for example, downward links that don't each recursively span every page on the site) you'll need to do a wget for for a set of pages that does span the site, if that makes sense.
edit: Sorry, I see several people have already mentioned wget. But as Bunk pointed out, you can get a Windows version.
There's also cURL ; but I haven't used that recently.
Leave a comment: