3

I have a find -exec grep command pair that groups path/filename.ext:ln#:line contents on a single line. I want to divide the line into two consecutive lines in a second file. The consecutive lines are:

path/filename/ext:ln#
contents of the line itself

I could write a program to do it, but I wondered it there was alteadu a command that would do this?

αғsнιη
  • 35,092
  • 41
  • 129
  • 192
oldefoxx
  • 320
  • 1
  • 2
  • 7

2 Answers2

3

Your question and my understanding of it

Your question currently lacks concrete examples of input and desired output, hence I will try to answer your answer as I understand it, and edit accordingly when you provide more info.

The way I understand your question right now is that you are running something along the following lines:

find /path/to/directory -exec grep -H -n 'SomeString' {} \;

Which produces a result that is something like this:

$ find /home/$USER/fortesting -type f -exec grep -H -n 'HelloWorld' {} \;              
/home/serg/fortesting/file3:1:HelloWorld
/home/serg/fortesting/file1:4:HelloWorld

Or in general /path/to/file:lineNumber:String

Possible solutions

Appropriately enough,this is a job for awk: you have 3 fields separated by colon (field separator), which translates into awk code awk -F":" '{printf $1 FS $2 FS "\n" $3 "\n" }' Thus we can do the following:

$ find /home/$USER/fortesting -type f -exec grep -H -n 'HelloWorld' {} \; | awk -F ":" '{printf $1 FS $2 FS "\n" $3 "\n" }'       
/home/xieerqi/fortesting/file3:1:
HelloWorld
/home/xieerqi/fortesting/file1:4:
HelloWorld

Now, awk is a versatile tool; we can mimick the output of find -exec grep with `find -exec awk '(awk code here)' , which will already will be processed, and saves on piping.

Consider the code bellow :

$ find $PWD -type f -exec awk  '/HelloWorld/ {print FILENAME":"FNR"\n"$0 }' {} \;                                                  
/home/xieerqi/fortesting/file3:1
HelloWorld
/home/xieerqi/fortesting/file1:4
HelloWorld

Less piping and the contents are being processed as they are found.In addition, if the file has colon in its name, this code will still process it correctly, since we are not depending on field separators, but rather printing variable FILENAME followed by colon, followed by FNR (the input record number in the current input file), and the found line separated by newline.

Efficiency

Now, lets consider efficiency as the number of files goes large. First, I create files file1 to file1000, and then we use /usr/bin/time to test each version of command.

$ echo 'HelloWorld' | tee file{$(seq -s',' 1 1000)}
$ /usr/bin/time find /home/$USER/fortesting -type f -exec grep -H -n 'HelloWorld' {} \; | awk -F ":" '{printf $1 FS $2 FS "\n" $3 "\n" }'  > /dev/null
0.04user 0.34system 0:03.09elapsed 12%CPU (0avgtext+0avgdata 2420maxresident)k
0inputs+0outputs (0major+113358minor)pagefaults 0swaps

$ /usr/bin/time find $PWD -type f -exec awk  '/HelloWorld/ {print FILENAME":"FNR"\n"$0 }' {} \; > /dev/null                        
0.82user 2.03system 0:04.25elapsed 67%CPU (0avgtext+0avgdata 2856maxresident)k
0inputs+0outputs (0major+145292minor)pagefaults 0swaps

So the lengthy version seems to be more efficient, takes less time and CPU percentage.

Now, here is a compromise - change \; to + :

/usr/bin/time find $PWD -type f -exec awk '/HelloWorld/ {print FILENAME":"NR"\n"$0 }' {} +

What does the + operator do ? The big difference is that + tells exec to list as many files as input to awk command as possible, while \; makes awk be called each and every time for each and every single found file.

$ /usr/bin/time find $PWD -type f -exec awk  '/HelloWorld/ {print FILENAME":"FNR"\n"$0 }' {} + > /dev/null                         
0.00user 0.02system 0:00.02elapsed 74%CPU (0avgtext+0avgdata 3036maxresident)k
0inputs+0outputs (0major+398minor)pagefaults 0swaps

Hey, much faster, right ? Though still heavy on CPU.

Outputting to another file

As for outputting to another file , add use > operator for redirection

Sergiy Kolodyazhnyy
  • 103,293
  • 19
  • 273
  • 492
2

sed readily does that:

$ echo 'path/filename.ext:ln#:line contents' | sed -r 's/([^:]*:[^:]*):/\1\n/'
path/filename.ext:ln#
line contents

The regex ([^:]*:[^:]*): looks for the first two colon-separated fields and saves them in group 1. The replacement text, \1\n, places a newline after those two fields.

Improvement

If a file name itself contains a colon, this will, of course, give incorrect results. As steeldriver suggests, this can be avoided using the -Z option to grep which will put a NUL character, \x00, instead of a colon after the file name. For example:

grep -ZHn 'regex' * | sed -r 's/\x00([^:]*):/:\1\n/'

Or, if the capabilities of find are required:

find . -type f -exec grep -ZHn 'regex' {} + | sed -r 's/\x00([^:]*):/:\1\n/'

This will work even if colons appear in the file name, or the line matched, or both.

John1024
  • 13,539
  • 42
  • 51
  • To make it work for arbitrary filenames you could add `-Z` to the grep options, which delimits the filename with a NUL byte instead of a colon, and then match everything from the NUL to the first colon e.g. `sed -r 's/\x00([^:]*:)/\1\n/'` – steeldriver Sep 03 '15 at 03:53
  • @steeldriver Excellent! Answer updated to handle file names with colons. – John1024 Sep 03 '15 at 04:07