unix - split a huge .gz file by line

Question

I'm sure someone has had the below need, what is a quick way of splitting a huge .gz file by line? The underlying text file has 120million rows. I don't have enough disk space to gunzip the entire file at once so I was wondering if someone knows of a bash/perl script or tool that could split the file (either the .gz or inner .txt) into 3x 40mn line files. ie calling it like:

    bash splitter.sh hugefile.txt.gz 4000000 1
 would get lines 1 to 40 mn    
    bash splitter.sh hugefile.txt.gz 4000000 2
would get lines 40mn to 80 mn
    bash splitter.sh hugefile.txt.gz 4000000 3
would get lines 80mn to 120 mn

Is perhaps doing a series of these a solution or would the gunzip -c require enough space for the entire file to be unzipped(ie the original problem): gunzip -c hugefile.txt.gz | head 4000000

Note: I can't get extra disk.

Thanks!

You can use gunzip in a ipe. The rest can be done with head and tail — Ingo, Jan 23 '12 at 11:25
@Tichodroma - no I don't need them gziped again. But I could not store all the split text files at once. So i would like to get the first split, do stuff with it, then delete the first split, and then get the 2nd split.etc finally removing the original gz — toop, Jan 23 '12 at 11:42
@toop: Thanks for the clarification. Note that it's generally better to edit your question if you want to clarify it, rather than put it into a comment; that way everyone will see it. — sleske, Jan 23 '12 at 12:06
The accepted answer is good if you only want a fraction of the chunks, and do not know them in advance. If you want to generate all the chunks at once, the solutions based on split will be a lot faster ,O(N) instead of O(N²). — b0fh, Aug 13 '14 at 15:27

score 26 · Answer 1 · answered Jan 23 '12 at 16:41

26

pipe to split use either gunzip -c or zcat to open the file

gunzip -c bigfile.gz | split -l 400000

Add output specifications to the split command.

answered Jan 23 '12 at 16:41

jim mcnamara

867
1
6
7

3

This is massively more efficient than the accepted answer, unless you only require a fraction of the split chunks. Please upvote. – b0fh Aug 13 '14 at 15:29
1

@b0fh: Yes, your are right. Upvoted, and referenced in my answer :-). – sleske Sep 07 '16 at 07:20
Best answer for sure. – Stephen Blum Mar 07 '18 at 21:27
1

what are the output specs so that the outputs are .gz files themselves? – Quetzalcoatl Oct 26 '18 at 05:53

sleske · Accepted Answer · 2016-09-07T07:19:58.543

11

How to do this best depends on what you want:

Do you want to extract a single part of the large file?
Or do you want to create all the parts in one go?

If you want a single part of the file, your idea to use gunzip and head is right. You can use:

gunzip -c hugefile.txt.gz | head -n 4000000

That would output the first 4000000 lines on standard out - you probably want to append another pipe to actually do something with the data.

To get the other parts, you'd use a combination of head and tail, like:

gunzip -c hugefile.txt.gz | head -n 8000000 |tail -n 4000000

to get the second block.

Is perhaps doing a series of these a solution or would the gunzip -c require enough space for the entire file to be unzipped

No, the gunzip -c does not require any disk space - it does everything in memory, then streams it out to stdout.

If you want to create all the parts in one go, it is more efficient to create them all with a single command, because then the input file is only read once. One good solution is to use split; see jim mcnamara's answer for details.

edited Sep 07 '16 at 07:19

answered Jan 23 '12 at 11:29

sleske

22,652
10
69
93

1

From performance view: does gzip actually unzip whole file? Or is it able to "magically" know that only 4mn lines are needed? – Alois Mahdal Mar 22 '12 at 12:57
3

@AloisMahdal: Actually, that would be a good separate question :-). Short version: `gzip` does not know about the limit (which comes from a different process). If `head` is used, `head` will exit when it has received enough, and this will propagate to `gzip` (via SIGPIPE, see Wikipedia). For `tail` this is not possible, so yes, `gzip` will decompress everything. – sleske Mar 22 '12 at 15:26
But if you are interested, you should really ask this as a separate question. – sleske Mar 22 '12 at 15:35

score 7 · Answer 3 · answered Jan 23 '12 at 11:33

As you are working on a (non-rewindable) stream, you will want to use the '+N' form of tail to get lines starting from line N onwards.

zcat hugefile.txt.gz | head -n 40000000
zcat hugefile.txt.gz | tail -n +40000001 | head -n 40000000
zcat hugefile.txt.gz | tail -n +80000001 | head -n 40000000

score 4 · Answer 4 · answered Feb 24 '19 at 10:53

4

Directly split .gz file to .gz files:

zcat bigfile.gz | split -l 400000 --filter='gzip > $FILE.gz'

I think this is what OP wanted, because he don't have much space.

answered Feb 24 '19 at 10:53

siulkilulki

141
1

score 4 · Answer 5 · edited Feb 20 '12 at 00:22

4

I'd consider using split.

split a file into pieces

edited Feb 20 '12 at 00:22

Tamara Wijsman

57,083
27
185
256

answered Jan 23 '12 at 11:24

Michael Krelin - hacker

686
5
7

score 2 · Answer 6 · answered Jan 23 '12 at 13:39

Here's a python script to open a globbed set of files from a directory, gunzip them if necessary, and read through them line by line. It only uses the space necessary in memory for holding the filenames, and the current line, plus a little overhead.

#!/usr/bin/env python
import gzip, bz2
import os
import fnmatch

def gen_find(filepat,top):
    for path, dirlist, filelist in os.walk(top):
        for name in fnmatch.filter(filelist,filepat):
            yield os.path.join(path,name)

def gen_open(filenames):
    for name in filenames:
        if name.endswith(".gz"):
            yield gzip.open(name)
        elif name.endswith(".bz2"):
            yield bz2.BZ2File(name)
        else:
            yield open(name)

def gen_cat(sources):
    for s in sources:
        for item in s:
            yield item

def main(regex, searchDir):
    fileNames = gen_find(regex,searchDir)
    fileHandles = gen_open(fileNames)
    fileLines = gen_cat(fileHandles)
    for line in fileLines:
        print line

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Search globbed files line by line', version='%(prog)s 1.0')
    parser.add_argument('regex', type=str, default='*', help='Regular expression')
    parser.add_argument('searchDir', , type=str, default='.', help='list of input files')
    args = parser.parse_args()
    main(args.regex, args.searchDir)

The print line command will send every line to std out, so you can redirect to a file. Alternatively, if you let us know what you want done with the lines, I can add it to the python script and you won't need to leave chunks of the file laying around.

score 2 · Answer 7 · answered Jan 23 '12 at 20:54

Here's a perl program that can be used to read stdin, and split the lines, piping each clump to a separate command that can use a shell variable $SPLIT to route it to a different destination. For your case, it would be invoked with

zcat hugefile.txt.gz | perl xsplit.pl 40000000 'cat > tmp$SPLIT.txt; do_something tmp$SPLIT.txt; rm tmp$SPLIT.txt'

Sorry the command-line processing is a little kludgy but you get the idea.

#!/usr/bin/perl -w
#####
# xsplit.pl: like xargs but instead of clumping input into each command's args, clumps it into each command's input.
# Usage: perl xsplit.pl LINES 'COMMAND'
# where: 'COMMAND' can include shell variable expansions and can use $SPLIT, e.g.
#   'cat > tmp$SPLIT.txt'
# or:
#   'gzip > tmp$SPLIT.gz'
#####
use strict;

sub pipeHandler {
    my $sig = shift @_;
    print " Caught SIGPIPE: $sig\n";
    exit(1);
}
$SIG{PIPE} = \&pipeHandler;

my $LINES = shift;
die "LINES must be a positive number\n" if ($LINES <= 0);
my $COMMAND = shift || die "second argument should be COMMAND\n";

my $line_number = 0;

while (<STDIN>) {
    if ($line_number%$LINES == 0) {
        close OUTFILE;
        my $split = $ENV{SPLIT} = sprintf("%05d", $line_number/$LINES+1);
        print "$split\n";
        my $command = $COMMAND;
        open (OUTFILE, "| $command") or die "failed to write to command '$command'\n";
    }
    print OUTFILE $_;
    $line_number++;
}

exit 0;

unix - split a huge .gz file by line

7 Answers7