4

I see a bunch of FOSS projects which have ".sha256" files. They look something like this:

dsdfdfdsffdfsdfdsfdsfdsfdsfdsfds23r2ewrefdefdsfdsgfdsgffgfkgdfgg     *meow.exe
Asdfdfdsffdfsdfdsfdsfdsfdsfdsfds23r2ewrefdefdsfdsgfdsgffgfkgdfg3     cool_stuff.exe
dsdfdfdsfDdfsdfdsfdsfdsfdsfdsfds23r2ewrefdefdsfdsgfdsgffg3kgdfgg     even_more_stuff.exe

I currently get these out with:

#^([A-Za-z0-9]{64})\s+(\S+)$#um

That will match dsdfdfdsffdfsdfdsfdsfdsfdsfdsfds23r2ewrefdefdsfdsgfdsgffgfkgdfgg and *meow.exe, etc. Filenames that mysteriously begin with an asterisk (I've countless times tried to look this up without getting any wise as to what this means) are stripped off their beginning *.

Is there anything more to it than that? What happens if the filenames have spaces in them instead of underscores? Then my regexp breaks down. Can they be quoted? If so, is Linux (single quotes) or Windows-style (double quotes) quoting used?

This seemingly simple file format actually has countless questions associated with it, but I don't see it defined anywhere. Nor have I ever so far encountered filenames which use spaces or quotes of any kind. But they do use asterisks, which apparently can also appear in the end of the filename...

How should this madness be parsed to not break one day?

Iago B.
  • 41
  • 1
  • 2
  • *"What happens if the filenames have spaces in them instead of underscores?"* - Speaking solely for the checksum programs I have personally used on Windows, nothing. These are not treated specially and there is no need to quote paths. Beyond taking into consideration that the binary `*` indicator can sometimes come in at the end, I think all you really need to do is adapt for the possibility of spaces, underscores, hyphens, etc. (but likely not quotes) if you want the broadest solution possible. – Anaksunaman Jul 06 '20 at 06:16

1 Answers1

5

A .sha256 file is a text file generated by the sha256sum program. The purpose of a .sha256 file is to enable one to check the integrity of files using the sha256sum program. Its content is not supposed to be manually interpreted by humans. sha256sum's man page refers to https://www.gnu.org/software/coreutils/sha256sum which in turn refers to https://www.gnu.org/software/coreutils/manual/html_node/md5sum-invocation.html:

For each file, md5sum outputs by default, the MD5 checksum, a space, a flag indicating binary or text input mode, and the file name. Binary mode is indicated with *, text mode with (space). Binary mode is the default on systems where it’s significant, otherwise text mode is the default. Without --zero, if file contains a backslash or newline, the line is started with a backslash, and each problematic character in the file name is escaped with a backslash, making the output unambiguous even in the presence of arbitrary file names. If file is omitted or specified as -, standard input is read.

And further:

‘-c’ --check’

Read file names and checksum information (not data) from each file (or from stdin if no file was specified) and report whether the checksums match the contents of the named files. The input to this mode of md5sum is usually the output of a prior, checksum-generating run of md5sum. Three input formats are supported. Either the default output format described above, the --tag output format, or the BSD reversed mode format which is similar to the default mode, but doesn’t use a character to distinguish binary and text modes. Output with --zero enabled is not supported by --check.

dirdi
  • 3,137
  • 14
  • 34