difference when downloading PDF via wget and browser

Asked by Rosika Schreck

Hi altogeher,

the following refers not so much to a bug per se (as wget works as desired) but to an interesting phenomenon we´ve been discussing for a while now (see: https://itsfoss.community/t/minutely-different-results-when-downloading-pdf-via-wget-and-browser/5128 ).

We´ve compared the download of a PDF (example: e-book from distrowatch “Linux from Scratch”, source: https://www.tradepub.com/free/w_linu01/prgm.cgi?a=1 )
via browser-download (chromium) and wget and found out that the PDF-file obtained via wget is exactly 16 bytes larger than the PDF we got via the browser.

md5sums:

da65d66d0dfd995d7fd4f7e7327506b3 wget_Linux_from_Scratch.pdf
6ec4ff88e8884c61587e124af2e6181d browser_Linux_from_Scratch.pdf

pdfinfo:

File size: 959330 bytes wget_Linux_from_Scratch.pdf
File size: 959314 bytes browser_Linux_from_Scratch.pdf

We´ve also established by the use of a hex viewer that the difference between the two PDFs is thus:
The wget-download is (16 bytes larger) due to this append:

[diff]:
59958c59958,59959
< 000ea350: 0d0a ..
---
> 000ea350: 0d0a 0a3c 2f62 6f64 793e 0a3c 2f68 746d ...</body>.</htm
> 000ea360: 6c3e l>

We are very interested in why that is and what the technical background may be.

Thanks a lot in advance.

Greetings.
Rosika

P.S.:

system: Linux/Lubuntu 18.04.4 LTS, 64 bit
wget: 1.19.4-1ubuntu2.2

Question information

Language:
English Edit question
Status:
Solved
For:
Ubuntu wget Edit question
Assignee:
No assignee Edit question
Solved by:
Manfred Hampl
Solved:
Last query:
Last reply:
Revision history for this message
Best Manfred Hampl (m-hampl) said :
#1

Interesting problem.

To me it seems hat either wget adds that "</body>.</html>" trailer,
or maybe it is even the server.

You have to be aware that the URL is not directly pointing to the file, but the URL has to be interpreted on the server first.
Eventually the server provides slightly different data depending on the requesting application, differing between wget and browsers.

My suggestion:
Enable --debug for the wget command and do some testing with different "User-Agent" values for wget.

Revision history for this message
actionparsnip (andrew-woodhead666) said :
#2

Interesting stuff. Never noticed this to be honest.

What made you check this? Just curious :-)

Revision history for this message
Rosika Schreck (rosika) said :
#3

@Manfred Hampl (m-hampl):

Thank you very much for this explanation.
It´s great to have a solution to our investigation.

> Enable --debug for the wget command and do some testing with different "User-Agent" values for wget.

O.K. That seems interesting. We´ll look into that and report back.

Thanks again and many greetings.
Rosika

Revision history for this message
Rosika Schreck (rosika) said :
#4

@actionparsnip (andrew-woodhead666):

Hi,

actually my comparison happened on the spur of the moment. Out of interest really.

The thing is:

using a Linux-distro (Lubuntu) for quite a while now I´ve become accustomed to do as may things as possible from the command-line.
So for downloads my favourite has become wget.

Yet when connecting to the tradepub-site following the link I received via e-mail the download started from within the browser (automatically).
So there I had my first copy of the PDF.

Afterwards I copied the link provided within the browser (for initiating the download manually in case the automatism failed) and started a second download via terminal with wget.

Just out of curiosity I checked the md5sums afterwards and was flabbergasted to learn they were different.

That initiated the investigation (which can be followed here: https://itsfoss.community/t/minutely-different-results-when-downloading-pdf-via-wget-and-browser/5128 ).

Many greetings.
Rosika

Revision history for this message
actionparsnip (andrew-woodhead666) said :
#5

Same but I install Ubuntu minimal then add LXDE. This is really weird. Never thought about it at all. Just wget and open.... Job done.

Revision history for this message
Rosika Schreck (rosika) said :
#6

@Manfred Hampl (m-hampl):

Hi again,

following your suggestion I may now tell you the following:

1.) detailed output of wget --debug can be found here:
      https://gist.github.com/Rosika2/4507557b4fbf12b481f216851dafdf36

     They refer to use with user-agents chromium, falkon and firefox.

2.) md5sum *.pdf

a.) 6ec4ff88e8884c61587e124af2e6181d browser_Linux_from_Scratch.pdf # from within chromium
b.) 6ec4ff88e8884c61587e124af2e6181d wget_user-agent_chromium_Linux_from_Scratch.pdf # NEW
c.) 6ec4ff88e8884c61587e124af2e6181d wget_user-agent_falkon_Linux_from_Scratch.pdf # NEW
d.) da65d66d0dfd995d7fd4f7e7327506b3 wget_user-agent_firefox_Linux_from_Scratch.pdf # NEW
e.) da65d66d0dfd995d7fd4f7e7327506b3 wget_Linux_from_Scratch.pdf # without explicitely setting user-agent
f.) da65d66d0dfd995d7fd4f7e7327506b3 curl_Linux_from_Scratch.pdf

So that´s it.

a.) and b.) shouldn´t come as a surprise as the user-agent should be the same.

c.) is the same. I.e. user-agent falkon behaves like a.) and b.).

The rest behaves like user-agent firefox.

All this adds up to:

a.), b.) and c.) : File size: 959314 bytes
d.), e.) and f.) : File size: 959330 bytes (the above discussed added 16 bytes)

These are my results.

Thanks so much for your help.

Greetings.
Rosika

Revision history for this message
Rosika Schreck (rosika) said :
#7

@actionparsnip (andrew-woodhead666):

Thanks for your comment.

OFFTOPIC:

> Same but I install Ubuntu minimal then add LXDE

I would consider the same with BodhiLinux install + LXDE. But I´m not sure about it due to the uncertain future of LXDE.

Greetings.
Rosika

Revision history for this message
actionparsnip (andrew-woodhead666) said :
#8

Just means it's Ubuntu and not Lubuntu so I get full support. Being playing with just a tiling WM to go even lighter.

Revision history for this message
Manfred Hampl (m-hampl) said :
#9

What size does downloading with the firefox browser give?

Your wget debug logs show a difference between "user-agent firefox" and the other ones.

For the other ones one line in the response is
Content-length: 959314
and later (I guess it is from wget):
Length: 959314 (937K) [application/pdf]

and for firefox there is no "Content-length" in the header, and only later a
Length: unspecified [application/pdf]

Revision history for this message
Rosika Schreck (rosika) said :
#10

@Manfred Hampl (m-hampl):

Hello Manfred,

thank you so much for taking a look at my wget debug-posts.

> and later (I guess it is from wget): Length: 959314 (937K) [application/pdf]

Yes. All three postings refer to wget with the respective user-agents.

> and for firefox there is no "Content-length" in the header, and only later a Length: unspecified [application/pdf]

Yes. you´re right.But further down in line 176 down it says:

"2020-08-03 16:46:51 (151 KB/s) - ‘index.html?p=w_linu01&w=d&<email address hidden>&key=GcIEvQ1el3M7EFMP0ZFg&ts=65248&u=0821130991791596207566&e=M2Jlcm5oYXJkQHRlbXByLmVtYWls&_afn=’ saved [959330]"

> What size does downloading with the firefox browser give?

I hadn´t downloaded the PDF with the firefox broswer so far. So just did it.
As a result I got:

pdfinfo browser_firefox_Linux_from_Scratch.pdf | grep 'File size'
File size: 959330 bytes

So wget with user-agent firefox and the firefox-browser seem to yield the same results.

Greetings.
Rosika

Revision history for this message
Manfred Hampl (m-hampl) said :
#11

My conclusion is that the web server delivers different data for Chromium and for Firefox.

So we can rule out that wget is the culprit.

Revision history for this message
Rosika Schreck (rosika) said :
#12

@Manfred Hampl (m-hampl):

Hi again,

thanks for your reply.

> So we can rule out that wget is the culprit.

Yes, I´d say that, too.

So I think it´s clear now why there is this minute difference in file-size.
As you said in post#1:

> You have to be aware that the URL is not directly pointing to the file, but the URL has to be interpreted on the server first.
> Eventually the server provides slightly different data depending on the requesting application, differing
> between wget and browsers.

Thanks a lot again for clarifying this phenomenon. Your help is much appreciated.

Many greetings.
Rosika

Revision history for this message
Rosika Schreck (rosika) said :
#13

Thanks Manfred Hampl, that solved my question.

Revision history for this message
Manfred Hampl (m-hampl) said :
#14

With the latest evidence gathered by you I have to slightly rephrase that statement:

You have to be aware that the URL is not directly pointing to the file, but the URL has to be interpreted on the server first.
Eventually the server provides slightly different data depending on the requesting application, differing by User-Agent value, e.g. between Chromium and Firefox.

Revision history for this message
Rosika Schreck (rosika) said :
#15

@Manfred Hampl (m-hampl):

Hi,

thanks for pointing that out.
Yes, that seems to be be the best definition.

Many greetings.
Rosika

Revision history for this message
cord sellers (cor3dx) said :
#16

this was definitely an interesting discussion. thank you @Rosika Schreck (rosika) for sharing the curiosity as well as your findings and @Manfred Hampl (m-hampl) for providing an answer to the why behind those mysterious 16 bytes.

Revision history for this message
Rosika Schreck (rosika) said :
#17

@cord sellers (cor3dx):

You´re most welcome.
Thanks a lot for your help as well.

Greetings.
Rosika