0

I am using Node.js to read a file to bytedata (uint8array) and use SHA256 to hash the file’s bytedata like this:

ad505ee6067fba3f6fc20506d3379e190e087aeffe5d958ab9f2f3ed3800sa4f

I am wondering if it is possible or by any chance that for two different files having the same bytedata which might lead to the same SHA256 hash? Or what if same file but with very small change like the modification date, suffix or some other small change.

If so, is there any other practical way to get the unique identifier of a files in string?

Giacomo1968
  • 53,069
  • 19
  • 162
  • 212
Neo Liu
  • 103
  • 5
  • If you have two binary identical files I assume they'll produce the same hash. I assume hash is calculated over contents of the file, modification date is not part of the file but of the file system. IF the file would embed such a date in an EXIF block for example, files are not binary copies. – Joep van Steen Dec 29 '22 at 02:13
  • @JoepvanSteen thanks for the explanation. I am curious that what about the file content will two files with different content by any chance have the same binary data which give the same hash results? – Neo Liu Dec 29 '22 at 02:41
  • See [this answer](https://superuser.com/a/1330700/432690). – Kamil Maciorowski Dec 29 '22 at 05:56
  • 2
    FWIW, your concern is based on the concept of a [hash collision](https://en.wikipedia.org/wiki/Hash_collision). In computing high entropy is desired to avoid collisions and SHA256 has high entropy. Learn more about this in [this post on the Cryptography site](https://crypto.stackexchange.com/q/47809). – Giacomo1968 Dec 29 '22 at 06:13

1 Answers1

1

I am curious that what about the file content will two files with different content by any chance have the same binary data

Generally, no. "Different content" literally means different "binary data". Those are the same thing – the data you read from a file is its whole content. The file's metadata such as file name, modification time, or extended attributes is not considered its "content".


(Files on NTFS on Windows might contain "alternate data streams" which behave more like whole files than extended attributes, but unlike macOS resource forks in the past, no Windows program spreads file content across multiple streams – non-main streams are only ever used as a way to hold extended attributes, so they can safely be considered metadata.)

u1686_grawity
  • 426,297
  • 64
  • 894
  • 966
  • FWIW, in the case of binary files this might all hold true. But in my experience, simple changes in metadata for image files (JPG, PNG, TIFF, etc…) can result in different hashes. This is typically when using MD5 tools and such, but the concept is still the same. – Giacomo1968 Dec 29 '22 at 06:24
  • 3
    @Giacomo1968: The EXIF metadata in JPG/TIFF files is stored as part of the file's byte data (I'd call it "content-level metadata" as opposed to "file-level metadata") so that's a normal result. Still, if you only consider the image data as content, it's the opposite of what OP asks – you can easily have two .jpg files with the same image but different byte data, but you cannot have files with different images yet the same byte data. – u1686_grawity Dec 29 '22 at 06:39