7

How can I remove duplicates in each line, for example here?

1 1 1 2 1 2 3
5 5 4 1 2 3 3

I'd like to get this output:

1 2 3 
5 4 1 2 3

There are lots of lines (100,000) and in each line I want unique values. Perl might be the fastest, but how can I do it in Perl or Bash?

slhck
  • 223,558
  • 70
  • 607
  • 592
Arash
  • 726
  • 3
  • 12
  • 29

3 Answers3

13

Here is an option using awk:

awk '{ while(++i<=NF) printf (!a[$i]++) ? $i FS : ""; i=split("",a); print ""}' infile > outfile

Edit Updated with comments:

  1. while (++i<=NF)

    Initializes the while loop, precrementing "i" since $0 is the full line in awk.

    So it starts at $1 (first field). Loops through the line until the end (less than or equal to 'NF' which is built into awk for "Number of Fields"). The default field separator is a space, you could change the default separator easily.

  2. printf (!a[$i]++) ? $i FS : ""

    This is a ternary operation.

    So, if input is not in the array !a[$i]++, then it prints $i, if it is, it prints "". (You could remove the ! and reverse the $i FS : "" if you don't like it this way).

  3. i=split("",a)

    Normally, that's a null split. In this case, it resets I for the next line.

  4. print ""

    ends the line for the output (not 100% why, actually), otherwise you would have an output of:

    1 2 3 5 4 1 2 3 instead of
    1 2 3
    5 4 1 2 3

nerdwaller
  • 17,054
  • 2
  • 44
  • 44
  • 5
    To help current and future readers, please try to document answers to some extent. This is compact and efficient, but it is quite unreadable for someone not very used to `awk` since it relies on test and operation order, the ternary operator, the `split("",a)` quirk to reset an array (and its return value for resetting `i`) and special variables `NF` and `FS`. Such an explanation makes an answer even better! – Daniel Andersson Dec 19 '12 at 17:57
  • @DanielAndersson My apology for being lazy, updated. Thanks! – nerdwaller Dec 19 '12 at 19:01
  • 1
    nerdwaller: the reason you get 1 2 3 5 4 1 2 3 w/o step 4 is that all your output is done via printf, w/ no \n ever thrown in ... – tink Dec 19 '12 at 20:24
  • Step 2 works since it increments the array value with the index of the current number. If this index was empty, the test returns `!false`, and the increment is done _after_ the comparison. The next time the loop finds the same number, the comparison will return `!true` since the value corresponding to the index was set to a value the last time. The field is incremented again, but this "total count" is not used later (it doesn't hurt, though). – Daniel Andersson Dec 19 '12 at 20:26
  • In step 3, the array `a` is deleted for the next line iteration. `split("",a)` is a shorthand to delete an array `a` (see [the documentation](http://www.gnu.org/software/gawk/manual/html_node/Delete.html#fn-1) for a notice). As a side effect, this operation also returns `0`, and since `i` should be set to `0` for the next iteration, the `split()` call is used instead for assignment instead of a separate `i=0` call, saving some characters (at expense of readability, perhaps). – Daniel Andersson Dec 19 '12 at 20:31
5

Since ruby comes with any Linux distribution I know of:

ruby -e 'STDIN.readlines.each { |l| l.split(" ").uniq.each { |e| print "#{e} " }; print "\n" }' < test

Here, test is the file that contains the elements.

To explain what this command does—although Ruby can almost be read from left to right:

  • Read the input (which comes from < test through your shell)
  • Go through each line of the input
  • Split the line based on one space separating the items, into an array (split(" "))
  • Get the unique elements from this array (in-order)
  • For each unique element, print it, including a space (print "#{e} ")
  • Print a newline once we're done with the unique elements
slhck
  • 223,558
  • 70
  • 607
  • 592
2

Not pure bash, but ...:

while read line; do
    printf "%s\n" $line | sort -u | tr '\n' ' '
    echo ''
done < file

The lines will be sorted as a byproduct.

glenn jackman
  • 25,463
  • 6
  • 46
  • 69