How can I parse extremely large (70+ GB) .txt files?

Question

I have several .txt files with >30 million lines each, and anywhere from 20 to 40 "columns" (some comma-separated, some space-separated, all ASCII with lines separated by a new-line). I don't need all (or even most) of the columns, and some of them have null spaces filled with NULL.

My goals are to:

Remove the columns I don't need
Re-order the columns as I see fit (for readability)
Pipe output to another .txt file, with columns delimited by colons

I just did this with one large text file by splitting it into ~40 .txt files with 1,000,000 lines each, importing them one by one into Excel, and using CONCATENATE, but that approach has yielded no fruit with my next target. The file is comma-separated, but still needs to be converted from .txt into .csv, and Excel chokes on it during the importing process; even if I got it into Excel, the main file breaks down into 200+ smaller files to comply with Excel's upper line limit, and doing the same thing more than 200 times is not efficient.

I'm working on a late-2020 MacBook Pro and am not versed enough in any coding languages to even know where to begin, but I'm semi-comfortable scripting in shell and always down to learn new tricks, just don't know where to start.

after you copied data to Excel ,, better use Text to Column and while doing this yo may skip columns U don't need !! — Rajesh Sinha, Dec 20 '20 at 07:38
I'd be thinking this is a job for a database, not a spreadsheet - not that I'd know how to get one to the other in any simple way, but at least it wouldn't fall over on a 'mere' 70GB. You might need to polish up your php/sql chops, but it's not the hardest structure to learn. — Tetsujin, Dec 20 '20 at 08:50
Use some of the standard text streaming applications, e.g. `awk` (you'll need to read up on the "language" used, but `awk` is also standard to use inside shell scripts, just like `grep`). As your text files have a different format each, you probably need a custom solution for each. As how to get Excel not to choke on large files, I have no idea, but I'd second that for large sets of data maybe Excel is the wrong tool. — dirkt, Dec 20 '20 at 09:44
To convert a .txt file which contains comma separated columns and newline separated rows to .csv format... you rename the file to .csv. Done. — Nobody, Dec 20 '20 at 15:42
Are you assigned some database dumps by your company because they didn't want to pay a DBA to do this? At this size, it really should be handled with better tools (import into a DB, parse, done). I wouldn't do this if your job is on the line. — Nelson, Dec 20 '20 at 15:59
@Nelson as a DBA, it's pretty obvious that this is a job for Python, Perl or awk, not a database. — RonJohn, Dec 20 '20 at 16:51
@RonJohn: Or FTM, with a small C/Flex program. WRT Excel or database, you DON'T want something that reads the whole file at once, you want to process it as a stream. Awk &c would do that. — jamesqf, Dec 20 '20 at 17:22
Would you be able to process each line individually (or said in other words, could the lines come in any order)? If so this is a prime example for one of the small text processing languages from the Unix world. — Thorbjørn Ravn Andersen, Dec 21 '20 at 16:19
if you need to work with this data a lot and do more than one operation, i recommend writing a script to put the data into a database (mysql, mariadb, postgres, mongo, etc) then do your processing. putting the data into a database will make processing easier because you will not need to store all the data in memory. then write another script to write out the result to file from the database when you are done. (HOWEVER. if you do have a machine with 80GB memory then you can be lazy and just do everything in memory.) — Trevor Boyd Smith, Dec 21 '20 at 16:54
@Prometheus an RDBMS has a lot more overhead than a text processing program. (Especially ones that have been optimized for processing text files. — RonJohn, Dec 21 '20 at 19:56
@Prometheus: Also note the OP's comment about having to divide the large text files into ~40-200 files (manually, if I understood correctly), so that Excel could handle them. A suitable text processing program, whether it's a utility like Awk or something purpose-written using Flex or Perl, would process the file as a stream, never reading in more than a file block or so at a time. Which might make it faster, as all operations could be carried out in cache memory. — jamesqf, Dec 22 '20 at 04:02
@jamesqf True for Excel vs UNIX tools, but for UNIX tools vs RDBMS, the latter seems like it would be more suited to querying and handling the dataset. I've actually considered this a lot myself lately as someone who also has large text file datasets and uses Unix tools to handle them. Unix text processing tools are fast, but I'm guessing still much slower than the same dataset would be in an RDMBS. — Hashim Aziz, Dec 22 '20 at 04:26
@Prometheus: If I understand the OP, what he wants is to strip out a lot of extraneous data & garbage from the big files. Do that in a single quick pass, and it becomes much easier/faster to import the much smaller result into your database. — jamesqf, Dec 22 '20 at 17:46
@AsteroidsWithWings Ye Olde Hierarchical DBMSs aren't readily available to PC users, and No SQL "DBMSs" are tuned for things like json, key-value and graph databases aren't particularly tuned to OP's task. Besides, I'm a DBA, and unstructured No SQL is anathema. — RonJohn, Dec 22 '20 at 23:19
@RonJohn You keep repeating "I'm a DBA", as if that makes you automatically right despite saying silly things . — Asteroids With Wings, Dec 22 '20 at 23:27
@AsteroidsWithWings you need to reread what I wrote, because I most certainly didn't say that I'm automatically right. My 35 years of experience as a programmer and then DBA *does*, though, indicate that hierarchical DBMSs are damned rare on PCs, and that DBMSs have a lot of overhead. — RonJohn, Dec 22 '20 at 23:31
You must parse it as a STREAM. It doesn't matter what environment you do it in. You only need a database if you need to index and query it. There is absolutely no way to hold the result of this parse, or the original file in memory. — Rob, Dec 28 '20 at 15:50
Once you are parsing it as a stream, you have options. You could feed the records into a database if you want to. But if you want summary statistics, then you could just as well bump counters as the data streams through. That would generally mean that you need filter criteria, and criteria for what you want to summarize. The size of the file won't matter all that much if this is what you are doing. It will just take a while to parse the stream. — Rob, Dec 28 '20 at 15:53

RedGrittyBrick · Answer 1 · 2020-12-23T10:38:34.283

Choose your tools

It seems Excel is not a suitable tool for what you want to do.

One approach would to be to use a different tool to consolidate or summarise the data. awk, sed, grep or perl might be better suited to this initial processing and create a smaller CSV file that can then be processed in Excel or other tools.

There are other tools that may be better suited to doing the whole job. Maybe something like R or a DBMS. It depends what you want to do with the data.

For simply taking a bunch of dissimilar text files and reordering and selecting columns I'd jump immediately to perl. Others would use awk.

Since tools like awk and perl can process files line by line and don't need to store everything in memory, they can process huge files that would choke other tools. They can also be surprisingly fast.

Just for fun, an example

With this data

Apples,27,500,10.2,fruit,100,200,300
Chairs  1   501 123.78  furniture 101   201 301
Europe, 655, 502,0.0001,continent,   102, 202,302

we can produce this output

fruit:Apples:10.2
furniture:Chairs:123.78
continent:Europe:0.0001

using this command

perl -l -n -e "print join(':',(split(/[, \t] */))[4,0,3])" *.txt

explanation

element	what it does
`-l`	add a newline after every print
`-n`	process line by line but don't implicitly print
`-e`	what follows is a program to execute
`print`	print the result of the following expression
`join(":"` list`)`	create one string from a list, use ":" between each
`split (/`expr`/)`	use expression to divide line into fields
`[, \t]`	either a comma, a space or a tab followed by
`*` (space asterisk)	0,1 or more spaces
`(`list`)[4,0,3]`	select the 4th, 0th and 3rd items from a list

That one line program is equivalent to the following, which may be easier to follow

#!perl
use warnings;
use strict;

while(<>) {  # read line by line all input or all listed files
    my @columns = split(/[, \t] */);   # split on whitespace or commas
    my @chosen = @columns[4,0,3];      # choose what to keep
    my $new_line = join(":", @chosen); # join items using colons between
    print "$new_line\n";               # print with line-separator
}

invoked as perl data.pl *.txt > newdata.txt

I like perl and am moderately familiar with a subset of it, although it is waning in popularity partly because it is easy to write perl programs that are very hard to read. However it was designed for exactly your use-case. Anyone familiar with awk, python, ruby or any of a large number of tools would solve this just as easily.

I'd add another vote for Perl, with the caveat that regular expressions should be used carefully. I think the approach depends in part on whether this is a "one off" or whether the same-format files will be encountered again, but in any event I'd expect to take several iterations getting each type of file processed correctly and I'd COMMENT WHAT I WAS DOING PROFUSELY since while Perl etc. are stunningly powerful you might have problems understanding how you've tackled the job if you have to do some maintenance in a few months time. — Mark Morgan Lloyd, Dec 20 '20 at 18:43
Success! Thank you so much for your help! I wound up using the longer-format example you gave, which helped me understand what I was doing better. It chewed through my dataset blazing fast. This will be a recurring problem, and often with varying filetypes. Will perl also be able to tackle .json files? .sql? — hadrian4909, Dec 20 '20 at 19:04
@hadrian4909 most file formats require correct parsing. CSV can be done with simple splitting by separators as long as the fields do not contain special characters. JSON requires a real json parser, you might want to learn `jq` to handle it from the command line. SQL files are meant to be parsed by a database importer, which is another beast yet. Dealing with different file types is best done if you have a good toolbox and know which tool to use. — Hans-Martin Mosner, Dec 20 '20 at 20:20
@hadrian4909: Perl can properly parse JSON. See [JSON::Parse](https://metacpan.org/pod/distribution/JSON-Parse/lib/JSON/Parse.pod) but you'd need to learn a bit more about Perl data structures such as the hash of hashes (HoH) and so on. — RedGrittyBrick, Dec 20 '20 at 22:38
The reason you found Perl fast processing CSV data is that it simply "reads a line", "processes that line", and "outputs the result". You stream lines in, and you get a stream of lines out. The entire file is processed in one pass and never loaded into memory (like it is with Excel) Once you get into something like XML or JSON, the structure is more complex than a simple line of data. The process is necessarily more complex and less likely to be a *stream-in/stream-out* effort — Flydog57, Dec 21 '20 at 05:00
@Flydog57: processing on the fly mixed with IO is certainly helpful, but is not sufficient for good performance. e.g. a bash `while IFS= read varname < giant_file.txt; do ... ; done` loop can be pretty slow, compared to a language like perl that has a faster interpreter. Especially if your bash code invokes `sed` or something separately for each line, but even using bash string expansion and matching stuff is not particularly fast. Pure bash is good for small files, where a bash loop can be done before fork/exec and startup overhead for perl or even awk would complete, but not for large. — Peter Cordes, Dec 21 '20 at 18:06

Hannu · Answer 2 · 2021-01-31T08:54:52.587

33

OS agnostic answer:

Learn just a tiny bit of Python and you will have a tool to do similar conversions in any way you wish.

Type this into a file, save it as e.g. cvt.py (originally based on code from here)

import sys

exportcolumns = [3, 4, 5]
with open(sys.argv[1], 'r') as fi:
    for line in fi:
        columns = line.split(',')
        print( '\t'.join( columns[col] for col in exportcolumns) )

After installing Python (version 3, nothing else!) you should be able to run the above by
Python3 cvt.py filename >newfile
where filename is one of your datafiles, and newfile is where you want the result.

As written the code looks for , as column separator, outputs columns 3,4,5 (in that order) with tabs \t as separator (at the end of each column).

If you have more complex (inconsistent) column separation you may well do

import re

... as shown here: https://stackoverflow.com/a/4998688/3720510

Short explanation for the above

First line makes the sys module available. This allows the use of sys.argv here; making the command line arguments available to the script as a simple list.
The second line creates a list with the indices of the columns to extract from the input data.
The with-line opens the file and makes it available during the following indented block - the file closes as the block has been executed.
for - loop once for every line that can be read from the file.
Next line; create a list of one line of content, splitting at every ,.
The print; uses a "list comprehension" to pick the columns from the list, join them with \t (a tab char) between them, then print them to sys.stdout (implicit with print()), which may be a file - if you redirected using > on the command line.

edited Jan 31 '21 at 08:54

answered Dec 20 '20 at 10:15

Hannu

8,740
3
21
39

Updated the code (arg -> argv). Also: It assumes that ALL lines has the columns that are to be exported. – Hannu Dec 20 '20 at 11:40
Well, untested code, running tests: one more error spotted. ;-) fixed. – Hannu Dec 20 '20 at 11:49
It would probably be faster (and datasets of OP's size need speed) to execute one `print()` statement instead of dozens. – RonJohn Dec 20 '20 at 16:49
The difference in speed will vary, yes... the intention was to write as simple code as possible. now - I might have gone a bit past that. Updated the script above, for a single print. Can't tell if the change will help much. – Hannu Dec 20 '20 at 17:35
@RonJohn The time to execute the `print()` statement will be completely overshadowed by the time needed to read/write the data. So in this case the criterion should be which form is more readable. – AndyB Dec 20 '20 at 18:49
2

You don't need `iter(fi.readline, '')`. It's much simpler (and faster) to just use `for line in fi:`. – user2357112 Dec 20 '20 at 18:53
nitpick: `print( '\t'.join(columns[col] for col in exportcolumns))` to avoid the dangling \t and string-addition (with that many lines every ms counts ;) . Outputting (appending) directly to a new file will be much faster as well as it does not rely on printing to stdout and capturing that printed output to a file :o) – Patrick Artner Dec 21 '20 at 07:20
Included suggested edits. Simplifies the reading. But might add some complexity / pythonism, - may be be bit harder to digest for a beginner. – Hannu Dec 21 '20 at 19:26
Hmm... considering opening a file; not hard to do, but I really like how simple this is currently. Maybe leave it as something for the reader to ponder on and explore? ;-) – Hannu Dec 21 '20 at 19:35
This does not work when a field contains an escaped comma. (How do you escape a comma, you ask? Well, let's just say CSV files are not standardized and injection vulnerabilities are here to stay) – Navin Dec 23 '20 at 11:53
The script is INTENTIONALLY written in a somewhat simple manner, to some level reflect something you write for single-time use. So; doesn't "support" anything beyond simple SINGLE CHARACTER DELIMITERS. You need e.g. regular expressions for that - read above. – Hannu Dec 23 '20 at 15:36

score 15 · Answer 3 · answered Dec 20 '20 at 17:35

Disclaimer: I have not actually tried this with a 70 GB file, but I have done a couple of GB and above 4 million rows.

The intended workflow for huge files is not to load whole file into a sheet, but to connect to the file.

Open the data tab, select "From text/CSV", select your file. When the preview dialog appears, click on the caret besides "Load" button and choose to "Only Create Connection". That's it. Here's a more elaborate tutorial: https://excel.officetuts.net/en/examples/open-large-csv

There might be some quirks and more challenges to solve with the column transformations, but it's worth giving a try if you feel much better in Excel than with command line tools.

Another option — if you have access to Access, you can import and handle the data there as well. That software is THE database system for power users.

That being said, I would choose awk for the particular task. But you should be at least somewhat comfortable with shell then.

The "Get & Transform" inside Excel will be able to handle huge files, but it is the wrong tool for this job because: A) It's Windows only (at least it was in 2017 when I left the dev team), and B) you can only output into the sheet / Data Model and exporting to Excel is the only way to create TXT files: that won't work to produce millions of lines. — Carl Walsh, Dec 22 '20 at 18:33

score 9 · Answer 4 · answered Dec 21 '20 at 07:32

If your data format is well-known (CSV or other character delimited file, JSON, etc.) you can usually find a general-purpose command-line tool to help query it.

xsv is a popular one for comma-separated/space-separated data
jq is popular for JSON data (download available here)

xsv can chew through hundreds of MB per second depending on your hardware and the type of query.

score 4 · Answer 5 · answered Dec 21 '20 at 13:45

Lots of good advice from elsewhere about the mechanics of data extraction, however you are going to need some dirty coding skills to do anything useful with it.

Large data sets often contain corrupt lines, loopy data, strange characters, ohs instead of zeroes and every manner of formatting glitches. You need to validate and filter what you've got. (An example. Split a file into two then join them. There may well be the most subtle of flaws at the join. Possibly all normal lines are CRLF but at the join the end of line is just CR. This can go unnoticed or even cause the read-in to assume end of file!) As a minimum I would make sure that you're outputting exactly the same number of lines as you read.

Still on line-by line processing, it's very simple, and worthwhile, to add very basic sanity checking to the data. Even if a field isn't getting outputted, if it's easy to check then do it because it could indicate some more subtle trouble. Be aware that actual data may not conform to the official specs. Why does a price of -1 sometimes appear? An especially useful field to check is the last one that should always have something in it or the last in each row.

Log processing somewhere. That way you can set the process running and go to lunch. You have a record of what version of your program was used to create what outputs. Of course you're looking for '...lines rejected:0' all the time.

Bad source lines should be outputted to a failure file. (But quit after 15 lines.) You can visually examine a small amount of data to see what sort of weirdness you've got.

It may well be that inside the loop that processes each line you have to apply filters. This may not happen at the first pass, but as downstream analysis progresses you may be asked to give a more select data-set. Eg. exclude lines with products with 'test' in the name or product code starting with 9.

An often missed validation issue is missing or duplicated data. For example somehow Friday's raw data has been added to the end of Thursday's and Friday's is from the week before. How will anybody know? The network failed from 3pm to 5pm so nothing was recorded. Monday was a bank holiday where there shouldn't be any transactions but somebody has supplied data from the previous Monday. You are in a good position to do some simple sums, for example daily turnover or maximum period of no activity etc. These are bulk sanity checks used to give a human pause for thought and prompt for checking before poisoned data is passed further down the chain. It's probably not your job to decide what to do with a loopy batch, but you can highlight it and probably tweak your code to give a better data-set.

All the above is 'easy', one step at a time programming. You'll learn about automation, tidy workflows, loopy formatting and basic data anomalies. You'll also be a bit of an expert on spotting unusual data and what the fields are supposed to mean. That will be useful for...

Doing something useful with the data. You should be involved with the downstream analysis. This is not to suggest you should build analysis into your translation program, but you've got a framework ready to do it. Totals, averages, max and min, hourly, daily, weekly are all possible easy (NB Automated) outputs. You might think a database is a better tool, but for fiddly things simple coding may be better. Let me give an example: Smooth a set of data points. An easy moving average is nextPoint = (lastPoint *(0.8)) + (rawValue *(0.2)) [Adjust .8 and .2 to suit]. That's fine for continuous data but what about start of business each day? That's a special case where nextPoint = rawValue. Something to code perhaps.

Spurious data values is a good example of the cross-over between raw data crunching and analysis. When somebody punched in £175 when they meant £1.75 do we really want to include that in our analysis? It's a bit of an art, or fudge, but the raw data processor can easily calculate a mean and standard deviation for a couple of thousand data points, or an actual distribution for all rows of data. You /might/ want to throw out, mark, highlight or otherwise draw attention to unexpected values at the data crunching stage or use it to inform the analysis stage. Perhaps add another column with a blank for OK and 'H' for higher than expected and so on.

You will become a skilled craftsman, able to turn a huge tree into useful planks from start to finish. You'll learn who wants what sort of planks for what purpose and be able to saw up the raw wood in the right way to avoid splits and shakes. Moreover if you spot a diseased tree you can raise the alarm.

CSM · Answer 6 · 2020-12-31T12:25:36.847

I don't use Macs, but assuming the standard UNIX tools are available, then

Open a terminal
type in cd and drag in the folder containing the files
type in awk -v FS=, -v OFS=, '{ print $3, $2, $5 }' < source.txt > dest.txt

Note the space after cd and the use of single quotes with the awk command. For both commands, the case (upper/lower) is important.

This will read in all of the CSV file source.txt, and print the third, second and fifth columns as a CSV file, dest.txt

AWK is primarily designed to operate on text files that contain data in columns. The -v sets the input and output column separators to a comma

Should you get stuck, we have a Mac specific forum, Ask Different which may be more suited to your question.

score 1 · Answer 7 · answered Dec 21 '20 at 00:01

As always you have two ways to choose.

A. Make a fairly complex program (multiple choices of separators, multiple choices of data handling etc.)

B. Use a "Unix toolbox" approach. Learn some tools you have available in every Unix or Unix-like system (macOS was once certified Unix system, you should have everything already at your disposition). There is some effort needed, but it's worth it.

If you would like to follow the A. way, there is plenty of languages to choose from. For the purpose perl would be great, python being trendy option now, but there is much more...

If you would like to follow the B. way, I would suggest a calm start and some patience. I love a book called Classic Shell Scripting, I even use it on courses I teach...

I guess you need one step backwards. Before 1. you need a kind of step:

Avoid unnecessary mess. You can convert files to a common delimiter, if it's possible. Yes, it can be the first command in a pipe.

Then you can use more commands in pipe (cut might be a good option, or awk - it would even be able to select columns, change their order and set the proper delimiter in one step).

score 0 · Answer 8 · answered Dec 21 '20 at 04:39

0

If you want to use the built in commands in Windows, rather than downloading something else like awk, Python or Perl (which are much better) then you could always just use the Windows Command Prompt:

https://ss64.com/nt/for_f.html

Again, I'm not suggesting this as the "best" tool but if you have to distribute your solution or don't have the ability to download third party software, it may be your best bet.

answered Dec 21 '20 at 04:39

throx

101
1

4

Even on Windows, I would suggest PowerShell instead of command prompt these days. Although they mentioned in the question that they're using a "several months-old MacBook Pro," PowerShell is also available for MacOS(Linux too). – Booga Roo Dec 21 '20 at 05:09
Yep, missed the MacBook bit. My bad. – throx Dec 21 '20 at 05:40

score 0 · Answer 9 · answered Dec 22 '20 at 04:13

In case you don't mind throwing a few bucks at the solution then EmEditor can handle extremely large text files in a notepad++-like interface. I used to use that at an eDiscovery firm that had to work with very large datasets and it was user friendly enough for the paralegals to use.

score 0 · Answer 10 · answered Oct 13 '22 at 01:35

All this can be done easily in GS-Base. GS-Base is a database that uses up to 256 million rows and should be very efficient with very large CSV files (e.g. 10GB in about 2 minutes). Data can be displayed in tables or forms. You can re-order columns by simple drag-and-drop and for further analysis, field calculations, pivot tables, filtering (and much more) GS-Base can use up to 100 processor cores. The installation requires around 7MB and it can be installed on a portable USB storage device. Not free, but the price is rather hard to beat… ( https://citadel5.com/gs-base.htm )

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). — Community, Oct 13 '22 at 01:37

How can I parse extremely large (70+ GB) .txt files?

10 Answers10

Choose your tools

Just for fun, an example