Remove list of items from another file in bash

Question

What would be the most efficient method (no repeated command execution) to remove items listed in one file from another file (unordered) ?

One can easily get the list of non-matching items in the second file by

cat first_file.txt second_file.txt | sort | uniq -u

but that would also contain all unmatched items from the first file too... now what?

score 3 · Accepted Answer · answered Sep 16 '15 at 14:37

3

This awk program takes a single pass through each file:

awk '
    NR == FNR {f1[$0] = 1; next}
    !($0 in f1)
' file1 file2

comm is useful for this job. It does require it's input files to be sorted:

# output lines unique to file2
comm -13 <(sort file1) <(sort file2)

answered Sep 16 '15 at 14:37

glenn jackman

I have no idea of how `!($0 in f1)` works internally, I mean __in__ inside _awk_. If it scans all the array simply, we should have over there O(n!). :-| _sort_ it seems to be [highly optimized](http://vkundeti.blogspot.ru/2008/03/tech-algorithmic-details-of-unix-sort.html)... Do you have any reference about? – Hastur Sep 16 '15 at 15:53
the `in` operator tests if the left-hand operand is an index of the (associative or indexed) array. It should be an O(1) operation. For gawk, documented here: http://www.gnu.org/software/gawk/manual/html_node/Reference-to-Elements.html#Reference-to-Elements – glenn jackman Sep 16 '15 at 16:52
Thanks for the reference. __`in`__ should scan the full array `f1` not only one element, from here O(n^2) [BTW Errata in the previous comment O(n^2) and not O(n!)]. I did a test with 10^4 up to 10^6 random strings of 32 bytes and `awk` solution scales linearly: it has to be order inside. (`comm` solution is more vary 2x at 10^4, ~1x at 10^5 and 2x 10^6, but I suppose it depends from the memory available). – Hastur Sep 16 '15 at 17:52
Cool, I didn't know about `comm`. – dronus Sep 16 '15 at 20:45
the same thing can be done with `grep` like this: `grep -v -f <(command1) <(command2)` – Andy Jan 18 '19 at 01:53

1 Answers1