How do I make uniq only consider the first field?

Question

I am using FreeBSD 3.2-RELEASE

If I have some sorted text, like this last output—

zikla13:Oct:20:22:34
zikla13:Oct:5:00:31
zikla14:Oct:17:22:01
zikla14:Oct:12:23:35
zikla14:Oct:12:23:34
zikla14:Oct:12:00:11
zikla14:Oct:11:23:52
zikla14:Oct:5:22:22
zilka13:Oct:13:23:48
zilka13:Oct:11:00:28
zilka13:Oct:9:22:40

—is there a way to get uniq -c to only consider the first field (maybe with -s)? In this case, the output should be this:

2 zikla13:Oct:20:22:34
6 zikla14:Oct:17:22:01
3 zilka13:Oct:13:23:48

Or some other way using awk ?

Welcome to Super User! I've [edited your question](https://superuser.com/review/suggested-edits/455721) for clarity and tag relevance. Please note that this site (and [the others like it](https://stackexchange.com/sites)) focuses on asking and answering questions; things like “thanks” in posts are discouraged in favor of [upvoting](https://superuser.com/help/why-vote) and [accepting](https://superuser.com/help/accepted-answer) helpful answers. — Blacklight Shining, Oct 27 '15 at 22:26
There are multiple different implementations of `uniq`—in particular, the GNU `uniq` (found on most Linux-based systems) differs from the uniq found on BSDs (including Mac OS X). Please [edit your question](https://superuser.com/posts/992668/edit) to indicate which `uniq` implementation you're asking about. — Blacklight Shining, Oct 27 '15 at 22:29

blm · Accepted Answer · 2015-10-27T22:41:16.477

With GNU uniq, which supports the -w option:

$ cat data
zikla13:Oct:20:22:34
zikla13:Oct:5:00:31
zikla14:Oct:17:22:01
zikla14:Oct:12:23:35
zikla14:Oct:12:23:34
zikla14:Oct:12:00:11
zikla14:Oct:11:23:52
zikla14:Oct:5:22:22
zilka13:Oct:13:23:48
zilka13:Oct:11:00:28
zilka13:Oct:9:22:40
$ uniq -c -w7 data
  2 zikla13:Oct:20:22:34
  6 zikla14:Oct:17:22:01
  3 zilka13:Oct:13:23:48

As pointed out in the comments, that assumes the first field is always seven characters, which it is in your example, but if it's not in real life, I don't think there's a way to do it with uniq (plus if you don't have GNU uniq, even -w won't work), so here's a perl solution:

$ perl -ne '/(.*?):(.*)/;unless (exists $x{$1}){$x{$1}=[0,$2];push @x, $1};$x{$1}[0]++;END{printf("%8d %s:%s\n",$x{$_}[0],$_,$x{$_}[1]) foreach @x}' <data
   2 zikla13:Oct:20:22:34
   6 zikla14:Oct:17:22:01
   3 zilka13:Oct:13:23:48

Here's how that works:

$ perl -ne

Run perl, not printing each line by default, and use the next argument as the script.

/(.*?):(.*)/

Split the input line into the stuff before the first colon and the stuff after the first colon, into $1 and $2. split would work here as well.

unless (exists $x{$1}){$x{$1}=[0,$2];push @x, $1}

The hash %x is going to be used to uniquify the lines and array @x to keep them in order (you could just use sort keys %x, but that assumes perl's sort will sort in the same way as the input is sorted.) So if we've never seen the current "key" (the stuff before the first colon), initialize a hash entry for the key and push the key on @x. The hash entry for each key is a two-element array containing the count and the first value seen after the colon, so the output can contain that value.

$x{$1}[0]++

Increment the count.

END{

Start a block that will be run after all the input has been read.

printf("%8d %s:%s\n",$x{$_}[0],$_,$x{$_}[1])

Print the count, padded with spaces, a space, the "key", a colon, and the stuff from after the colon.

foreach @x}

Do that for each key seen, in order and end the END block.

<data

Read from the file called data in the current directory to get the input. You could also just pipe into perl if you have some other command or pipeline producing the data.

This will cause `uniq` to consider only the first seven _characters._ It will work for the asker's example, but it will likely break if the first field is not exactly seven characters long. — Blacklight Shining, Oct 27 '15 at 22:32
@BlacklightShining Good point. I'll add a perl solution that treats the characters up through the : as the field to uniq on, regardless of their length. — blm, Oct 27 '15 at 22:33
uniq: illegal option -- w sorry my mistake in operator `-w` FreeBSD 3.2-RELEASE - dont support `-w` — Da No, Oct 27 '15 at 22:36
Yeah, when you added you were using FreeBSD, I figured `-w` wouldn't work. I've added a perl version though that should work anywhere and doesn't rely on the "key" being 7 characters. — blm, Oct 27 '15 at 22:42

roaima · Answer 2 · 2015-10-28T00:46:10.820

I'd use awk. Filter and count the first colon-separated field, when it changes or we hit EOF print the entire previously saved line and count:

awk -F: '!seen[$1]++ { line[$1]=$0; if(prev){printf "%d\t%s\n",seen[prev],line[prev]}; prev=$1} END {if(prev){printf "%d\t%s\n",seen[prev],line[prev]}}' data

The awk script can be expanded out like this:

# Count the occurrences of the first field. If first time then...
!seen[$1]++ {
    # save the line
    line[$1]=$0;
    # maybe print the previous line
    if (prev) {
        printf "%d\t%s\n", seen[prev], line[prev]
    };
    prev=$1
}

# End of file, so print any previous line we have got saved
END {
    if (prev) {
        printf "%d\t%s\n", seen[prev], line[prev]
    }
}

If you can alter the data supplied to awk by adding a trailing blank line you can dispense with the entire END {...} block, simplifying the awk code and removing the duplication:

( cat data; echo ) | awk ...

Sorry but seen[: Event not found. This is really old BSD. I use bash2. — Da No, Oct 28 '15 at 08:29
@DaNo you used single quotes around the `awk` expression as shown in the one-liner? — roaima, Oct 28 '15 at 09:17

How do I make uniq only consider the first field?

2 Answers2