0

In the past I made a backup of a partially full partition with dd if=/dev/sda1 | gzip -5 > file.gz. Some time later, when a free space on partition was smaller I made an image file again with the same command and the output file is a little smaller.

In both cases I used the same version of dd and gzip, the same parameters, the same hardware, the same partition and I got the same (except time and speed) output from dd about amount of records in/out and copied bytes.

What would caused that and how can it be explained? How to check which image file is invalid assuming that the one of them is? What is more probable: HDD corruption which caused undetected loss of data or that a difference is related to some issues with compression?

To-la120
  • 1
  • 1
  • You don't indicate how much smaller. Every single file that exists in the first image exists in the second image? – Ramhound May 29 '15 at 11:54
  • Mainly yes - I didn't delete anything but there may be some changes with system files and obviously there are some additional files from the creation of the first image file – To-la120 May 29 '15 at 14:15

2 Answers2

1

It's the nature of compression. How effective it is depends on the input data. Since you compressed different data each time, you end up with different compressed sizes, even though the uncompressed size is the same.

psusi
  • 7,837
  • 21
  • 25
  • Generally yes, but I'm not sure if in this specific case too. Notice that partition contains the same data and some more. – To-la120 May 29 '15 at 14:20
  • @To-la120, it does *not* contain the same data. If you added some files, then the space where those files are located changed from whatever was there before to what is there now. Different data is different data. – psusi May 29 '15 at 17:32
0

You seem to think that free space compresses better. There is no such rule.

Common filesystems only mark free space as free, they don't overwrite it with zeros or whatever. The old data is still there until overwritten with something new. (Side note: this is why it's sometimes possible to recover deleted files).

dd reads everything, it knows nothing about filesystems or what they consider free space; then gzip compresses everything, including the old data in "free space" which may compress well or poorly. In this context there is no free space; there's only some data stream to process.

It may be some new "highly-compressible" files replaced old "poorly-compressible" data marked as free space. If so, the new archive will be smaller than the old one, despite the fact it contains more data that you consider useful, actual or existing. This may be the main cause of what you experienced.

Please see Clone only space in use from hard disk, and my answer there. The "preparation" step overwrites empty space with zeros, so it compresses extremely well. If you did this before each backup, the sizes of the resulting archives would probably agree with your intuition.

"Probably", because the other answer to your question is right in general: it all depends on the input data. Even after zeroing the free space, a filesystem that is 60% full may compress to a smaller archive than an equally big filesystem that is 50% full, if the files within are different.

Kamil Maciorowski
  • 69,815
  • 22
  • 136
  • 202