5

I need some documentation about XUL but I do not have Internet access most of the time. So, I've tried to download the Mozilla Tutorial with the following command:

wget --no-parent -r -l 2 -p -k https://developer.mozilla.org/en/XUL_Tutorial

My intention was to download both the https://developer.mozilla.org/en/XUL_Tutorial page and its subpages (for example, https://developer.mozilla.org/en/XUL_Tutorial/Install_Scripts). However, even though I passed the --no-parent flag, it keeps getting pages such as https://developer.mozilla.org/index.php?title=Special:Userlogin&returntotitle=en%2FXUL+Tutorial%2FInstall+Scripts.

I do not understand why it happens. How could I achieve the behavior I intended?

studiohack
  • 13,468
  • 19
  • 88
  • 118
brandizzi
  • 165
  • 1
  • 1
  • 6

3 Answers3

18

You need the trailing slash at the end of the URL.

Dyax
  • 196
  • 2
  • 3
  • I tried `wget --no-parent -r -l 2 -p -k https://developer.mozilla.org/en/XUL_Tutorial/` but it only download the `index.html` file... – brandizzi Sep 15 '11 at 13:13
  • 2
    that was my issue! probably common among wget beginners – John Berryman Jan 21 '14 at 23:35
  • `it only download the index.html file` That was because you were using `l 2`. Since you did not change the accepted answer, I guess you never increased the recursion level to realize this is the best answer for the question as it had been asked. – Synetech Dec 13 '14 at 02:42
  • 1
    This is the correct answer to the question. Confirmed. – Johannes Overmann Jun 28 '19 at 09:31
  • I can also confirm. I spent about 1h trying different combinations of flags, and navigating the internet. So frustrating that it all boiled down to a simple missing slash. I wonder how wget interprets the command without the slash. – bartgol Mar 20 '20 at 04:59
2

Was having a similar issue:

wget -r -l1 --no-parent -nH "https://www.website.com/parent/directory/"

I believe there was an issue with https vs. http. I updated $HOME/.wgetrc to:

header = Accept-Encoding: none
header = Accept-Language: en-us,en;q=0.5
header = Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
header = Connection: keep-alive
user_agent = Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2
referer = http://www.google.com/
robots = off

Then changed changed https to http:

wget -r -l1 --no-parent -nH "http://www.website.com/parent/directory/"

The wget program no longer created folders (or retrieved files) from outside the specified directory hierarchy.

Dave Jarvis
  • 3,168
  • 5
  • 32
  • 37
  • I tried it and it seems to work perfectly. Waiting the end of the download (which I do not need anymore actually) to be sure :) However, I did not changed to HTTP - I mean, I changed, but it kept redirecting to HTTPS. Do you know why your `.wgetrc` seems to be changing the behavior? – brandizzi Aug 24 '12 at 16:59
1

I had to disable gzip compression to make it work. I also changed the user-agent because some pages forbid wget. So this is what I've put into my .wgetrc:

header = Accept-Encoding: none

user_agent = Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6

Works great here.

Gaff
  • 18,569
  • 15
  • 57
  • 68