0

I have over 10000 files in a folder. I was using an Rscript to preprocess the files. It displayed an error:

Error in read.table(wd, comment.char ="#", header=T, sep='\t'): empty beginning of file

When I opened the file in a text editor the file was empty but the size of the file was around 4 MB. Next, I opened the file in a Notepad++, I was able to see the content as NULL NULL NULL ... NULL

File example

I want to move these kind of files from the folder to another folder. How can I accomplish this?

Destroy666
  • 5,299
  • 7
  • 16
  • 35
svp
  • 1
  • 2
  • (1) "these kind of files" – What kind exactly? Files containing null bytes only? Files containing at least N (how many?) null bytes at the beginning? Files beginning with a null byte? Files containing at least one null byte? (2) What is the OS? (3) By "from the folder", do you mean "from the folder and subfolders"? – Kamil Maciorowski May 24 '23 at 05:52
  • (1+) Or strictly files resulting in the `empty beginning of file` error from `read.table(wd, comment.char ="#", header=T, sep='\t')`? – Kamil Maciorowski May 24 '23 at 05:58
  • @KamilMaciorowski I want to move all non-text files – svp May 24 '23 at 06:00
  • [In terms of POSIX](https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_403)? I mean: should we pay attention to `{LINE_MAX}`? and incomplete lines? Or to null characters only? What is the OS? – Kamil Maciorowski May 24 '23 at 06:04
  • OS is Ubuntu. The files whichever is not `ASCII` need to move to separate directory. – svp May 24 '23 at 06:22
  • The null character belongs to the ASCII set… OK, I think I get it anyway. – Kamil Maciorowski May 24 '23 at 06:25
  • Please clarify what "keep" means. Just move them once or keep moving them? Please edit your post accordingly, I changed it to "move" for now. Also, did you attempt anything? Keep in mind that this is not a free scripting service and you'll much more likely receive help if you include some script that you tried to write/use. – Destroy666 May 24 '23 at 08:06

1 Answers1

2

Testing a single file

grep in the following command will return exit status 0 if some_file contains at least one null character:

<some_file tr -dc '\0' | tr '\0' '\n' | grep -q ''

Unless the shell option pipefail is set, the exit status of grep will become the exit status of the whole pipeline, if trs exit. pipefail is unset by default and you want it this way (see what may happen otherwise).

I wrote "if trs exit" because after grep exits the second tr needs to write something in order to get SIGPIPE; then the first tr needs to write something in order to get SIGPIPE; only then the pipe is considered terminated. It may happen the first tr keeps and keeps reading even if grep exits early and the outcome is known. If some_file is a special file generating a neverending stream of bytes (similar to e.g. /dev/urandom) and there is not enough null bytes in the stream then the pipe will never exit. For a regular file the worst case scenario is when the first tr exits after reading the whole file. If some_file is a regular file then trs will exit eventually for sure.

This answer of mine explains a trick you can use to speed things up. In your case the trick will leave tr(s) in the background. Since you're going to test many files, piling up trs is not a good idea.

In practice it's often enough to test the very beginning of a file. The following command will read up to 2 KiB of some_file and analyze only this part:

head -c 2048 some_file | tr -dc '\0' | tr '\0' '\n' | grep -q ''

Alternatively you can use the command file, for a big file it won't read the whole file either. Here we generate exit status 0 if file --mime-type does not print text/whatever:

! file --brief --mime-type some_file | grep -q 'text/'

I expect the two commands to agree in vast majority of cases; there may be cases (files) where they differ though.


Testing many files (and moving accordingly)

This snippet will loop over files in the current working directory, test regular files and move them accordingly:

#!/bin/bash
(
shopt -s nullglob
for f in ./*; do
   [ -f "$f" ] \
   && ! [ -L "$f" ] \
   && head -c 2048 "$f" | tr -dc '\0' | tr '\0' '\n' | grep -q '' \
   && mv -v "$f" /target/directory/
done
)

Notes:

  • Create /target/directory/ beforehand.

  • You can use the other test. The relevant line will be:

       && ! file --brief --mime-type "$f" | grep -q 'text/' \
    
  • The subshell (…) is in case you want to paste the code into an interactive shell. Thanks to the subshell, the code won't change anything in your current shell.

  • Normally * does not match hidden files. Append dotglob to the shopt -s line to make * match hidden files.

  • If you want recursiveness, append globstar to the shopt -s line and use ./** instead of ./*. Be careful, if there are files with identical names then you may lose data; consider mv -i.

  • We want to conditionally move regular files. [ -f "$f" ] checks if we're dealing with a regular file; but it also succeeds for a symlink to (a symlink to (a symlink to (…))) a regular file. This is the reason we additionally check if the file is not a symlink (! [ -L "$f" ]). If you want the code to treat symlinks to regular files like regular files then delete the whole line containing [ -L (including the terminating newline character).

  • In general commands and code in this answer are not portable. You tagged and said the OS is Ubuntu, I made use of what Bash and Ubuntu provide.

  • A solution with find is possible. Each of our tests is a pipeline, so find would have to spawn shell(s) anyway.

Kamil Maciorowski
  • 69,815
  • 22
  • 136
  • 202