0

What would be the most efficient method (no repeated command execution) to remove items listed in one file from another file (unordered) ?

One can easily get the list of non-matching items in the second file by

cat first_file.txt second_file.txt | sort | uniq -u

but that would also contain all unmatched items from the first file too... now what?

dronus
  • 1,870
  • 4
  • 24
  • 42

1 Answers1

3

This awk program takes a single pass through each file:

awk '
    NR == FNR {f1[$0] = 1; next}
    !($0 in f1)
' file1 file2

comm is useful for this job. It does require it's input files to be sorted:

# output lines unique to file2
comm -13 <(sort file1) <(sort file2)
glenn jackman
  • 25,463
  • 6
  • 46
  • 69
  • I have no idea of how `!($0 in f1)` works internally, I mean __in__ inside _awk_. If it scans all the array simply, we should have over there O(n!). :-| _sort_ it seems to be [highly optimized](http://vkundeti.blogspot.ru/2008/03/tech-algorithmic-details-of-unix-sort.html)... Do you have any reference about? – Hastur Sep 16 '15 at 15:53
  • the `in` operator tests if the left-hand operand is an index of the (associative or indexed) array. It should be an O(1) operation. For gawk, documented here: http://www.gnu.org/software/gawk/manual/html_node/Reference-to-Elements.html#Reference-to-Elements – glenn jackman Sep 16 '15 at 16:52
  • Thanks for the reference. __`in`__ should scan the full array `f1` not only one element, from here O(n^2) [BTW Errata in the previous comment O(n^2) and not O(n!)]. I did a test with 10^4 up to 10^6 random strings of 32 bytes and `awk` solution scales linearly: it has to be order inside. (`comm` solution is more vary 2x at 10^4, ~1x at 10^5 and 2x 10^6, but I suppose it depends from the memory available). – Hastur Sep 16 '15 at 17:52
  • Cool, I didn't know about `comm`. – dronus Sep 16 '15 at 20:45
  • the same thing can be done with `grep` like this: `grep -v -f <(command1) <(command2)` – Andy Jan 18 '19 at 01:53