Extract X bytes at Y offset of input files and add it to the end of one output file, in a loop

Question

I'm looking for any method that would allow to : read X bytes at offset Y from each input file in a folder, then add it at the end of a single output file. For instance, there's a folder with many fragments of video files, the idea would be to read for instance 16 bytes at offset 1024 of the first file, write it to a new file, then add a line break, then process the whole folder like that. That file would then be used as a list of keywords for a simultaneous search in WinHex, so as to hopefully solve that issue.

I've done something similar once with ddrescue. Here I need a method working on Windows. I found out that ddrescue can work on Windows as a Cygwin port, but I can't get it to work with individual files. Then there's a dd port for Windows, it can read a chunk of data from an input file but I couldn't find a way to get it to write that chunk at the end of the output file. I know two command line tools called dsfo and dsfi (included in dsfok), which can each do one half of the intended task (dsfo can read any defined chunk of data and extract it as a new file, it can't write it into an existing file ; dsfi can write an input file to any location within an output file, but it doesn't allow to define a specific part of the input file), I tried to make them work together in the same script but it failed.

Could this be done with PowerShell ? How ?

EDIT (as per Keith Miller's suggestion) :

This command works as intended :

foreach ($file in gci *.mts, *.vob) {
$16Bytes = [System.Text.Encoding]::Default.GetString([System.IO.File]::ReadAllBytes("$file"), 1024, 16)
Add-Content -path "G:\PowerShell search terms MTS-VOB.txt" -value $16Bytes
}

It reads 16 bytes at offset 1024 from each input file and writes it to the TXT file, and it automatically adds a line break (0D 0A) after each ASCII encoded string (if that's the correct terminology), so there's no need to add a specific command for that.
But it is too slow, as if each file were being read in its entirety (it's much quicker if I run the command again with different offset parameters, probably because the files were copied to RAM during the first run). Is there a way to speed up the process, to ensure that only the relevant portion of each file is actually parsed ?

Another issue is that WinHex seems to import / export lists of seach terms as “Unicode BOM” only, whereas this PS script produces an ANSI file. If importing the ANSI file directly, the list stays empty. I can copy the output from that file to WinHex, but it gets truncated if there's a “null” (00) character. It also gets truncated if I first convert the ANSI file to “Unicode BOM” with Notepad2 and then import it in WinHex's search window. But there's an option called “direct byte-wise translation for GREP”, which allows to perform a search of any sequence of bytes regardless of the code page.

Excerpt from WinHex Help document :

You can search the same search terms simultaneously in up to 6 code pages. The default code page, that is active in your Windows system, is marked with an asterisk and initially preselected. E.g. on computers in the US and in Western Europe, the usual default code page is 1252 ANSI Latin I. The code pages named "ANSI" are used in Microsoft Windows. "MAC" indicates an Apple Macintosh code page. "OEM" indicates a code page used in MS-DOS and Windows command prompts. If a search term cannot be converted to the specified code page because of characters unknown in that code page, a warning is issued. Code page independent GREP searches for exact byte values are possible when searching in a "non" code page called "Direct byte-wise translation for GREP", which translates byte values without any mapping for certain code pages or case matching. X-Ways Forensics also allows to search in both little-endian and big-endian UTF-16, and in any regional Windows code page plus UTF16 with the MS Outlook cipher (compressible encryption) applied.

If, in WinHex, I copy a small block as “GREP Hex”, it gets pasted in this form :

\xFC\x70\x28\x4C\x00\xB5\x47\x00\x52\x30\x96\xA3\x17\x51\x4A\x44

And doing a search with a list formatted like this, with “GREP syntax” activated, and searching as “direct byte-wise translation for GREP”, seems to work reliably, even with “00” bytes as the example above. So, based on an article linked in a comment below, I tried this :

foreach ($file in gci *.mts, *.vob) {
$16Bytes = [System.Text.Encoding]::Default.GetString([System.IO.File]::ReadAllBytes("$file"), 100000, 32)
[System.BitConverter]::ToString($16Bytes)
Add-Content -path "G:\HGST_recherche_fragments_PowerShell_test_hex.txt" -value $16Bytes
}

But got multiple errors like this :

Impossible de convertir l'argument « 0 » (valeur « Qe▬?'QhNèJ↕?÷ß}÷ûï« → * l ?♠ @ ») de « ToString » en type « System.Byte[] » : « Impossible de con
vertir la valeur « Qe▬?'QhNèJ↕?÷ß}÷ûï« → * l ?♠ @ » en type « System.Byte[] ». Erreur : « Impossible de convertir la valeur « Qe▬?'QhNèJ↕?÷ß}÷ûï« →
* l ?♠ @ » en type « System.Byte ». Erreur : « Le format de la chaîne d'entrée est incorrect. » » »
Au niveau de ligne : 3 Caractère : 32
+ [System.BitConverter]::ToString <<<< ($16Bytes)
    + CategoryInfo          : NotSpecified: (:) [], MethodException
    + FullyQualifiedErrorId : MethodArgumentConversionInvalidCastArgument

Obviously there's something wrong, but it's getting closer to a method that actually works... So how can I get PS to read X bytes at offset Y and write them as a sequence of hexadecimal values ?

Now, an extra step to make this as quick and painless as possible would be to perform automated checksum comparisons, in order to avoid doing manual comparisons once I have a list of potentially matching files. I found out that WinHex can do a “logical search” within a whole volume, meaning that for each search hit it can report the absolute offset (relative to the start of the partition) as well as the file offset (where the searched string was found within a file identified through that partition's filesystem, even if it is fragmented or NTFS-compressed). So, once I have the list of search hits, with the path / names of the files, what I would like is to :
– compute the MD5 checksum of file “A” (the one from which the search term was copied) ;
– compute the MD5 checksum of a block in file “B” (the one where a hit for the search term was found) which supposedly coincides with file “A” ;
– print the result to a report file ;
– if the MD5 checksums match, it means that file “A” is totally and exactly included in file “B”, and can therefore be deleted ; if not, either it's a false positive (the search term wasn't specific enough, or the original file was fragmented so the recovered file may contain foreign data), in which case it has to be manually checked.
To do that I would have to define, for each pair of files, in a loop, a block within file B starting at [offset where the hit was found in file B] - [offset where the search term was copied from file A], and ending at [starting offset] + [size of file A]. Then calculate the MD5 checksum of this block, the MD5 of file A, and report if both values match or not.
Does it seem like this can be done with a simple PowerShell script ?

To get "X bytes at Y offset" the following could be used: `$16Bytes = [System.Text.Encoding]::Default.GetString([System.IO.File]::ReadAllBytes(""),1023, 1039)` by changing to what is required, I have set this to 16 bytes from offset of1024 as you have specified this as an example. the rest of what you describe should be pretty trivial reading up on powershell loops and reading/writing file contents. — CraftyB, May 01 '20 at 11:38
Thanks for this. What does the "System.Text.Encoding" part mean ? I want it to read and write the bytes as-is, with no text encoding, as the binary content of those strings has to match exactly for WinHex to find hits where there should be. Also, where would be a good place to read about reading/writing file contents in Powershell ? Most websites related to PS assume a prior knowledge which I don't have, or are organized in such a way that each small part of a task is treated in a separate page and one has to know how it's called before reading about it, which is a bit overwhelming. — GabrielB, May 01 '20 at 19:08
Tried this but got immediately stuck : https://www.cjoint.com/c/JEbtK0zpSfA (and with the error messages in french it's even more difficult to search for an explanation / solution as most quality sources are in english). — GabrielB, May 01 '20 at 19:39
"...16 bytes at offset 1024 of the first file, write it to a new file, then add a line break" -- there's some conceptual confusion here. Binary files don't have line breaks --- that's a concept specific to text files. — Keith Miller, May 03 '20 at 05:18
Have you see this: https://devblogs.microsoft.com/scripting/use-powershell-and-regular-expressions-to-search-binary-data/ and https://stackoverflow.com/questions/57336893/use-powershell-to-find-and-replace-hex-values-in-binary-files — Keith Miller, May 03 '20 at 06:02
@KeithMiller : I want to create a list of keywords for a "simultaneous search" with WinHex. So I want to copy a short string from a binary file, then add a line break as "0D 0A" in hexadecimal, then a short string from the next file, and so on. I'd be glad if there was an automated solution for what I want to achieve, and even more if the software used for that recovery (R-Studio 8.7) had been kind enough to not clutter the "extra found files" directories with hundreds of small files actually belonging to valid files already identified in the main recovery tree, or at least indicate it. — GabrielB, May 05 '20 at 08:12
@CraftyB : Could you please check my attempt and tell me what's wrong with it ? (I forgot to add a notification the other day.) — GabrielB, May 05 '20 at 08:14
If push come to shove, I found a simple way to add a line break (created a text file containing only a line break, which the dsfi tool mentioned above can append to the end of the output file), but is there a way to add hex characters to a file directly from a PowerShell script ? — GabrielB, May 05 '20 at 08:18
Like this? `[Byte[]](0x10, 0x20, 0xa0) | Set-Content hex.bin -Encoding Byte` and `[Byte[]](0x30, 0x40, 0xb0) | Add-Content hex.bin -Encoding Byte`. or do you want a text file with ascii-encoded bytes? — Keith Miller, May 05 '20 at 09:53
I'm not sure I understood those commands above, and what those values (0x10, 0x20, 0xa0, 0x30, 0x40, 0xb0) are supposed to be ; perhaps it's just a random example ? As for "text file with ascii-encoded bytes", that's probably what I want, not sure of the terminology. Each line has to be, from WinHex's POV, an exact match of the string copied from the input file, so that it gets a hit if that same sequence of bytes is found in another file. Here's an example : https://www.cjoint.com/c/JEfq7SRrCvA. I can create such a list within WinHex, but with hundreds of files it would be far too tedious. — GabrielB, May 05 '20 at 17:03
@KeithMiller Alright, I tested the command above, it does work for adding a sequence of bytes to a given file (I was just confused by the random example chosen). As for my earlier test with a "foreach" loop, silly me, the directory I chose happened to be an empty copy made with Robocopy /CREATE... Tried again with a directory containing actual video files, it did something, although not what I want. I did this : https://www.cjoint.com/c/JEggJtEoAYA ; and got that : https://www.cjoint.com/c/JEggLINs8yA – for each of 7 MKV files it added 1039 null bytes followed by 2 line breaks. What's wrong ? — GabrielB, May 06 '20 at 06:39
Also, the process took much longer than needed, about a minute for a directory containing 7 files, for a total of 3.14GB ; normally reading 16 bytes at the beginning of each file, however large it may be, should be almost instantaneous. — GabrielB, May 06 '20 at 06:49
If I manage to make this work, an extra step that would make that task even more manageable would be, for each small file (A) with a match found in a large file (B), to calculate the MD5 checksum for the chunk of data corresponding to A (if a match for a string copied at offset 1024 of A with size 2048 is found at offset 4813824 of B, then calculate MD5 between 4812800 and 4814848), compare it with the MD5 for the whole small file, so that if the MD5 is equal it means that A is fully included in B and can be discarded with no need to compare them one by one. Any thoughts ? — GabrielB, May 06 '20 at 07:00
Regarding my attempt above : I checked, those MKV files each have a large chunk of empty bytes near the beginning. Tried again with `ReadAllBytes("$file"),23, 39)`, and without the line adding a "0D 0A" line break : this time I get 39 bytes for each file, which upon checking with WinHex are bytes 23 to 61 ; and a line break is added automatically. (Also it was quicker this time, don't know why, writing 1039 bytes should take much longer than 39.) So the second value seems to indicate a length, not an offset, and to get what I want I should use `ReadAllBytes("$file"),1023, 16)` instead. Works. — GabrielB, May 06 '20 at 07:22
Did a test with the list thus generated and those MKV files all open in WinHex : it does find the strings, but only displays one hit in the "Position manager" list, since all the hits are found at the same offset in each file. It would probably work as intended if the search was being performed on the whole partition. I could copy the files of interest (VOB, MTS) to a smaller test partition and run the search there instead of the whole 4TB HDD where that recovery is currently stored. I'm still interested by any trick that would make this faster and more "unattended". Thanks. — GabrielB, May 06 '20 at 07:51
Then I can export the list of hits from WinHex in an easily editable format, but if I want to run MD5 computations as stated above, and I have to run the string search on a whole partition, when there's a hit it will not tell me at which offset of which individual file, so I would have to run computations on the whole partition as well, and it would work only if there's no fragmentation (for file A to be read as a contiguous chunk in the data stream corresponding to file B). Otherwise I'd have to check each file one by one on a rainy day... — GabrielB, May 06 '20 at 08:16
I just tried to catch up reading through these comments & I'm lost. You should edit your original question to relfect your progress. You should include relevant screenshots, actual code, and sample data. You shouldn't expect everyone to be familiar with HexBytes and what data it takes. And since ever you seem uncertain about the difference between byte values and ascii-encoded hex, screenshots are probably best. But the two links to screenshots you did post would be more useful if you copied the text into a code block in your question. — Keith Miller, May 06 '20 at 10:36
@KeithMiller I edited the original question as per your suggestions. It seems to be getting somewhere... As I did further testing, I found out that 1) WinHex can perform a “logical search” on a whole volume, which circumvents the potential problem exposed in a previous comment above ; 2) a search in ASCII seems to fail if a string contains null bytes, so it's more reliable to use the “direct byte-wise translation for GREP” mode. Therefore, I would need the PS script to convert the sequence of bytes from the form “QhNèJ” to the form “x51\x68\x4E\xE8\x4A”. — GabrielB, May 07 '20 at 00:42
Here's one way: `([byte[]](0x51,0x68,0x4E,0xE8,0x4A)).ForEach({'x{0:x2}' -f $_ }) -join '\'` — Keith Miller, May 07 '20 at 01:00
I don't understand what the command above is supposed to do. If I execute it alone I get an error. And those particular values were just an example in my earlier post, a command processing all files in a folder shouldn't include specific hex values found at a specific offset of a specific file. The `[System.BitConverter]::ToString($StringBytes)` command mentioned in the devblogs.microsoft.com article works and prints “48-65-6C-6C-6F”. I could edit that to get the required output with “\x”. But I can't use it in my current script (see edit in initial question for details). What is missing ? — GabrielB, May 07 '20 at 06:26
Again, a month later, I still haven't found a way to convert bytes read from a short segment at the beginning of a file (without reading the whole file) as hex values. I downloaded several PDF books covering basic and advanced use of PowerShell, I couldn't find what I'm looking for with keywords like "readallbytes" or "system.bitconverter" -- problem is, I don't even know how to define what I'm looking for, or if it exists. Hence why I asked a question, but apparently it's considered bad practice here to exchange comments as in a conversation, which makes it nearly impossible find a solution. — GabrielB, Jun 06 '20 at 06:25

Extract X bytes at Y offset of input files and add it to the end of one output file, in a loop

0 Answers0