3

I have a very large text-file (> 50 GB), but most lines are duplicate, so I want to remove them. Is there any way to remove duplicates lines from a file, and handle files > 2GB? Because every method I found until now can only work on small files.

Oliver Salzburg
  • 86,445
  • 63
  • 260
  • 306
Maestro
  • 593
  • 1
  • 5
  • 15
  • Better write Python script, which can do it. What OS? Python can do on any. – RProgram Nov 25 '13 at 13:09
  • Please *always* include your OS. Solutions very often depend on the Operating System being used. Are you using Windows, Linux, Unix, OSX, BSD? Which version? – terdon Nov 25 '13 at 14:02
  • Did you try sort -u on the huge file ? It may work, you know ... otherwise, you can also patch it instead of starting a C program from scratch. – user2987828 Nov 25 '13 at 14:43
  • Are the duplicate lines consecutive? If so, `uniq` is your friend, because it doesn't (need to) sort. If the duplicates are *mostly* consecutive, you can still use `uniq` to preprocess the file for sorting. – David Foerster Nov 25 '13 at 14:57
  • @terdon I was looking for a Windows solution, I should have mentioned that. – Maestro Nov 25 '13 at 15:14
  • @techie007 No dupe, I dont want to manually edit them, I need an automated process. – Maestro Dec 01 '13 at 12:55
  • `The best tool for this job is the tool you write yourself.` I cannot disagree with this. I actually did exactly that. Years ago, I wrote a Pascal program to do all kinds of advanced text processing, including removing duplicate lines. I am still surprised by its speed and still use it to strip duplicate lines in my large text files. I also wrote a PHP script to do the same thing when I was first learning PHP, and it too works surprisingly fast. – Synetech Dec 18 '13 at 04:23
  • `Did you try sort -u on the huge file?` **WHAT‽** I think it is safe to assume that a *50GB+ text file* will probably have upwards of a *billion lines*. I *highly* doubt that there is any sorting algorithm that can sort that many lines in any reasonably short amount of time, especially since there is no way that the whole file can be stored in memory and would have to constantly re-read random line numbers over and over, which would also destroy any performance benefit from caching. Sorting might be practical *after* removing the (hopefully 99%) currently-consecutive duplicate lines. – Synetech Dec 18 '13 at 04:28

2 Answers2

4

Assuming all lines are shorter than 7kB, and that you have bash, dd, tail, head, sed and sort installed from cygwin/unix:

{
  i=0
  while LANG= dd 2>/dev/null bs=1024 skip=${i}000 if=large_text_file count=1021 \
  | LANG= sed -e '1d' -e '$d'  | LANG= sort -u ;
  do
    i=$((1+$i))
  done
  LANG= dd 2>/dev/null bs=1024 skip=${i}000 if=large_text_file count=1021 \
  | LANG= tail -n 1
  LANG= head -n 1 large_text_file
} | LANG= sort -u > your_result

This divides the file in chunks of 1024000 bytes, and adds also 3*7*1024 bytes ("21" in 1021) from next chunk. As the divisions may cut a line, first (1d) and last ($d) lines of each chunks are destroyed (sed).

So to compensate, something containing last chunk is extracted again and only its last line is kept (tail -n 1), and the first line is also extracted again (head -n 1).

When the loop fails, the last chunk has been extracted.

sort -u may be viewed as a compressor, but it only sorts its input then skip duplicates. The first "sort" compresses all chunks. The second sort compresses again the concatenations of all these chunks (and that second sort has been missing from above code since third edit, sorry).

You said text file, but I assume binary anyway, hence the LANG= (gets all faster also).

user2987828
  • 151
  • 6
  • Is this supposed to run on a shell? Which shell? `for i=\`seq 50000\`` won't work on any *nix shell I know, do you mean `for i in $(seq 50000)`?. Could you also add some explanation of what you're doing? You're using a couple of nifty tricks here but don't tell the OP what they are or how they work. – terdon Nov 25 '13 at 14:19
  • Just made this on GNU bash, version 4.2.25(1)-release (x86_64-pc-linux-gnu): for i in `/usr/bin/seq 4`; do echo $i ; done – user2987828 Nov 25 '13 at 14:23
  • Yes, that will work, but is not what you posted. `for i=\`seq 4\`` is not equivalent to `for i in \`seq 4\``. I've edited your answer now that I know it's not some weird windows shell feature. This would really be a great answer if you would add an explanation of what it does. Your trick of reading the file in blocks to get rid of some dupes before sorting to get rid of the rest is a great idea but very hard to understand if you're not conversant with the tools you use. – terdon Nov 25 '13 at 14:29
  • This will only remove dupes that end up in the same chunk. – Loren Pechtel Nov 30 '13 at 18:41
  • I just have put back the second `sort` that is and was documented at the end of my post, that was removing dupes from different chunks. This is an error by my previous edit, sorry : it only removed dupes that ended up in the same chunk, as pointed by Loren Pechtel. – user2987828 Nov 30 '13 at 20:25
0

Fire up a linux instance in AWS/GCE , and use 'uniq'. OSX has it as well...

Docs here: http://www.thegeekstuff.com/2013/05/uniq-command-examples/

Clustermagnet
  • 409
  • 2
  • 7
  • 12
  • /bin/uniq only suppresses duplicates lines that are already together. You should prefer sort -u if it fits in memory. – user2987828 Nov 30 '13 at 20:33