Recursive download of subfolder with wget - --no-parent apparently not working

Question

I need some documentation about XUL but I do not have Internet access most of the time. So, I've tried to download the Mozilla Tutorial with the following command:

wget --no-parent -r -l 2 -p -k https://developer.mozilla.org/en/XUL_Tutorial

My intention was to download both the https://developer.mozilla.org/en/XUL_Tutorial page and its subpages (for example, https://developer.mozilla.org/en/XUL_Tutorial/Install_Scripts). However, even though I passed the --no-parent flag, it keeps getting pages such as https://developer.mozilla.org/index.php?title=Special:Userlogin&returntotitle=en%2FXUL+Tutorial%2FInstall+Scripts.

I do not understand why it happens. How could I achieve the behavior I intended?

score 18 · Accepted Answer · answered Sep 15 '11 at 02:31

18

You need the trailing slash at the end of the URL.

answered Sep 15 '11 at 02:31

Dyax

196
2
3

I tried `wget --no-parent -r -l 2 -p -k https://developer.mozilla.org/en/XUL_Tutorial/` but it only download the `index.html` file... – brandizzi Sep 15 '11 at 13:13
2

that was my issue! probably common among wget beginners – John Berryman Jan 21 '14 at 23:35
`it only download the index.html file` That was because you were using `l 2`. Since you did not change the accepted answer, I guess you never increased the recursion level to realize this is the best answer for the question as it had been asked. – Synetech Dec 13 '14 at 02:42
1

This is the correct answer to the question. Confirmed. – Johannes Overmann Jun 28 '19 at 09:31
I can also confirm. I spent about 1h trying different combinations of flags, and navigating the internet. So frustrating that it all boiled down to a simple missing slash. I wonder how wget interprets the command without the slash. – bartgol Mar 20 '20 at 04:59

score 2 · Answer 2 · answered Aug 18 '12 at 19:38

Was having a similar issue:

wget -r -l1 --no-parent -nH "https://www.website.com/parent/directory/"

I believe there was an issue with https vs. http. I updated $HOME/.wgetrc to:

header = Accept-Encoding: none
header = Accept-Language: en-us,en;q=0.5
header = Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
header = Connection: keep-alive
user_agent = Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2
referer = http://www.google.com/
robots = off

Then changed changed https to http:

wget -r -l1 --no-parent -nH "http://www.website.com/parent/directory/"

The wget program no longer created folders (or retrieved files) from outside the specified directory hierarchy.

I tried it and it seems to work perfectly. Waiting the end of the download (which I do not need anymore actually) to be sure :) However, I did not changed to HTTP - I mean, I changed, but it kept redirecting to HTTPS. Do you know why your `.wgetrc` seems to be changing the behavior? — brandizzi, Aug 24 '12 at 16:59

score 1 · Answer 3 · edited Nov 08 '11 at 02:45

1

I had to disable gzip compression to make it work. I also changed the user-agent because some pages forbid wget. So this is what I've put into my .wgetrc:

header = Accept-Encoding: none

user_agent = Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6

Works great here.

edited Nov 08 '11 at 02:45

Gaff

18,569
15
57
68

answered Oct 14 '11 at 10:15

Julian Ziegler

11
1

The solution is the other answer: the trailing slash at the end of the URL is needed. – markusN Mar 31 '21 at 19:35

Recursive download of subfolder with wget - --no-parent apparently not working

3 Answers3