61

I have around 100000 rows of data. If I store this data in a text file format, it takes more space than if I store it in an Excel file format. Why is that?

stkent
  • 105
  • 3
user734178
  • 499
  • 1
  • 4
  • 4

3 Answers3

118

The xlsx format used by modern Excel is actually a compressed format. It's a ZIP archive that contains text (XML) files in a certain structure.

If you compress your plain text file with a similar ZIP compression tool, you should achieve similar file sizes.

Additionally, as mentioned by Bradley Uffner and Morgen in the comments, Excel will deduplicate identical strings and only store one copy of them. I'm not sure about the exact gains of such a method, and it will depend on your data set, but simple zip compression will probably get you most of the way there.1


9.1.3 Physical Packages

Each Office Open XML document is implemented as a ZIP archive.

ECMA-376-1:2016


1 My guess is that this deduplication is most effective when you have multiple worksheets, since zip compression applies independently to each file in an archive and only over limited sections of the data at a time - by storing all strings together in a single file, there should be some benefit to the later compression. More practically, if your plain text format is in a single file anyway then there'll probably be little difference.

Bob
  • 60,938
  • 25
  • 191
  • 216
  • 2
    This is great! It gives very good insight on how Excel files are handled. Thanks! – Dominique Jun 01 '17 at 09:34
  • 3
    @Dominique If you would like to know more, you can play around with one by unzipping it (e.g. with 7zip, or by renaming it so it ends in `.zip`). The spec is also available, but it makes for rather dry reading. – Bob Jun 01 '17 at 09:40
  • Thanks for the info. This is exactly what I meant with my comment: the XML format shows the way that the files are treated internally by the application (which shows, e.g. that string values get a value 0, outside of the normal scope, ...). – Dominique Jun 01 '17 at 09:49
  • 3
    I think excel also uses a string dictionary, where bits of text that are identical between cells can be reused while only being stored once. – Bradley Uffner Jun 01 '17 at 13:47
  • BTW, this is also the case for OpenOffice/LibreOffice formats. – leonbloy Jun 01 '17 at 13:51
  • @BradleyUffner I don't have time to search right now - but such a dictionary would not be much different from what the normal compression does anyway, so I doubt it. – Bob Jun 01 '17 at 13:57
  • 1
    Based on the API exposed by the Apache ooxml library, a shared strings dictionary is at least probable. – Morgen Jun 01 '17 at 15:28
  • @BradleyUffner, @Morgen — You're right, it does create a `sharedStrings.xml`. And it's described briefly in §12.3.15 (and probably more elsewhere). Interesting. I suppose, as far as compression goes - this is probably more efficient if there's strings shared across multiple worksheets (since ZIP compression isn't [solid](https://en.wikipedia.org/wiki/Solid_compression)). – Bob Jun 01 '17 at 16:13
  • 2
    Shared Strings were a feature of Excel even back when it wrote XLS (no x) files, which were a record-oriented format called BIFF within an OLE container. If you compare old XLS specs with XLSX, you'll see XLSX is just a asciified xmlified zipped version of BIFF. So bascially MS didn't invent the shared string feature for XLSX, where it doesn't make much sense due to compression; they just went the easiest route to turn XLS into XLSX. – Guntram Blohm Jun 01 '17 at 19:39
  • "...you should achieve similar file sizes." I'd imagine it would be *smaller*, since you don't have all the XML and formatting overhead (assuming a simple csv or fixed width file). Plus you can use better compression algorithms like 7z. – jpmc26 Jun 02 '17 at 22:26
3

The answer given is correct, it is due to Excel storing your data as xml. It is also due to this, that sorting your data efficiently will also reduce the file size. Test it yourself - say you have data like

A            B                              C
John         Smith-Johnson-Williamson       12345
Sally        Smith-Johnson-Williamson       67890
John         Williams                       34567

If you sort by C (a column with all or almost all unique values) only, then the identical values of B will not be adjacent. In Excel's xml it looks like this:

<12345><John><Smith-Johnson-Williamson>
<34567><John><Williams>
<67890><Sally><Smith-Johnson-Williamson>

If you sort by B (a column with common values), then the identical values are adjacent. In Excel's xml it looks like this:

<Smith-Johnson-Williamson><John><12345>
  <Sally><67890>
<Williams><John><34567>

Because that long string is identical and adjacent, Excel knows it can pack them together, similar to when people write lists, and to repeat part of the above line, they type quotes rather than re-write the same thing. I did not find any evidence of a shared-string dictionary in my investigation - just this indentation in place of the repeated field value.

I had mailing lists of 250,000 customers across only 11 states, and on each record there was a field that was one of two strings identifying the offer they get. Our employees were for some reason accustomed to looking people up by their street address as spoken, so it was sorted on the street number column, then the street name, city, etc... when I resorted by the offer first, then state, zip code, city, street name, street number, and finally address-line-2, the file size was reduced incredibly. I examined the unpacked xml on the file sorted each way to see what was going on, and the above is what I deduced. If there are multiple fields with more than say 5 characters, but the values are of a limited set (say, ticket dispositions like 'resolved','rejected','approved',etc), then give some thought to sorting and see if it helps.

-4

If you have a number like 3.14159265359, you need 13 bytes to store this in a text file; if you store this number as a float, you need only 4 bytes.

TRiG
  • 1,310
  • 1
  • 15
  • 33
Mirko Ebert
  • 223
  • 2
  • 3
  • 19
    While this might've been the case with `xls` (BIFF), I don't think `xlsx` does this. `xlsx` stores all data in XML files, which do not perform binary encoding of numerals - they're converted to and stored as text strings. I'm still trying to hunt down the relevant section of the specification, but my empirical testing (i.e. extracting a saved document) shows that `1.123` is literally stored as a 5 character string in XML. – Bob Jun 01 '17 at 07:36
  • 14
    *ECMA-376-1:2016 § 18.3.1.4 `c` (Cell)* shows that cells store their data in `v` tags, as defined in *ECMA-376-1:2016 § 18.3.1.96 `v` (Cell Value)* where they say *The possible values for this element are defined by the ST_Xstring simple type (§22.9.2.19).* — they provide examples such as `28086.3541666667` where the data is clearly stored as a string. – Bob Jun 01 '17 at 07:40
  • Fortunately spreadsheets do not store numbers as floats, as this would lead to a loss of precision which is bad for anyone using spreadsheets to store numbers relating to money. Scaled integers / fixed point would work, though. – gerrit Jun 01 '17 at 14:53
  • Yes, float is not precise enough for money related formats or programming. – Mirko Ebert Jun 01 '17 at 15:22
  • 4
    @gerrit *Unfortunately* Excel *does* interpret numbers as floats within the program, leading to interesting problems like phone numbers dropping the last few digits. It just doesn't save them to disk as floats (anymore?). – Bob Jun 02 '17 at 00:46
  • 2
    @Bob this is also the case for `xlsb`, which is highly recommended for huge sheets – phuclv Jun 02 '17 at 12:16
  • 2
    I don't know why this answer gets so many downvotes. The OP didn't specify which Excel format so technically this is also a valid answer – phuclv Jun 02 '17 at 13:48
  • @LưuVĩnhPhúc Thanks, that's a very good point re: xlsb. And for the record, while I did point out a potential issue, I've had an upvote on this answer from the beginning -_- – Bob Jun 02 '17 at 14:56
  • 2
    @LưuVĩnhPhúc Not a downvoter, but: It also assumes that the rows are mostly numbers. And does a pretty simplistic explanation at it, to say the least. – This company is turning evil. Jun 02 '17 at 15:10
  • 2
    anyway [Excel always uses double precision](https://en.wikipedia.org/wiki/Numeric_precision_in_Microsoft_Excel), not float. https://msdn.microsoft.com/en-us/library/bb687844.aspx – phuclv Jun 03 '17 at 14:08
  • Yeah, maybe not a +2 answer, but certainly not a -4. -1 or 0, maybe. – can-ned_food Jun 04 '17 at 11:13
  • -1 Ignores the more significant point of zip compression. – le3th4x0rbot Jun 04 '17 at 20:28
  • Per [the open group's documentation of the BIFF format (PDF file)](https://www.openoffice.org/sc/excelfileformat.pdf) numeric data may be stored as double precision numbers, but the records would also carry 4 bytes of identifying overhead. This may or may not represent the practice of real files from before the open standard was implemented but represents 12 bytes in any case. – dmckee --- ex-moderator kitten Jun 04 '17 at 20:31