0

I have a very large .zip that is 174GB large and 800GB decompressed. I am trying to work with this using the R programming language, however, it is far too large for me to work with.

I have found that I can split the .zip file using my terminal in my Mac: How do I split a .zip file into multiple segments?

So I have dropped my .zip file into my documents folder, and placed this code into my terminal:

zip species.zip --out new.zip -s 3000m

To split it into 3GB per file, as it is easier to work with. However, the files made are not .zip format, they are just documents:

Image of the file

Then when using the R code to extract it:

> zi.fl <- zip_to_disk.frame2(zipfile = "new.zip", outdir = data_dir)  %>%
+   rbindlist.disk.frame() %>% filter(year > 2019)

I get the following error

Error: archive.cpp:24 archive_read_open_filename(): Unrecognized archive format

How can I get it to split into useable .zip files?

Stackbeans
  • 115
  • 4
  • 1
    I haven't the faintest idea how to do this in R, but they are usable [at least according to the file names, new.zip, new.z02 etc] you might just need something smarter to read them. Try [Keka](https://www.keka.io/en/) (donationware, free direct download or paid from App Store) – Tetsujin Jul 01 '21 at 15:39
  • 1
    I'd be wondering about whether the R zip library supports multipart archives. If `zip -T new.zip` successfully tests the *entire* zip, then you need to find a better or updated library in R. – Mokubai Jul 01 '21 at 15:42
  • 1
    The other problem though, is that it depends on the data *inside* the zip file as to whether this is a good method to use in the first place. If you are loading a single monolithic 800GB file that has been compressed then the entire set of compressed zip files will be needed to decompress the original file resulting is no net gain. The same problem occurs if you have a "solid" zip archive where all the files are treated as a single data stream, everything that went before is needed to decompress a particular file. – Mokubai Jul 01 '21 at 15:47
  • The only case this kind of splitting is useful for being split this way is for small files which need to be handled separately and in an old style "compress a file, add that to the archive, compress another file separately, add that to the archive" non-solid style zip archive, in which case individually compressing the contents of the original archive into their own "per file" archive is probably more useful. – Mokubai Jul 01 '21 at 15:50

1 Answers1

1

When you use the Mac OS command split, it simply breaks the single file into binary chunks, without file type. For example, the Zip file header will only be in the first segment, and the header would be misleading, since it describes the contents of the entire file.

To reconstitute the original file, use cat to concatenate the binary segments back to a Zip file. However, from your question, it appears that would still be too large to work with.

If you need to work with smaller pieces that are truly Zip archives, then you'd need to split the original, 800 GB data* into separate pieces and then Zip each segment. Each piece would be a true Zip file, and could be extracted and then concatenated to yield the original data file.

DrMoishe Pippik
  • 25,661
  • 4
  • 36
  • 54
  • thank you for the explanations; Would you perhaps know of any software, or commands that can split the .csv file within the .zip file? As I cannot officially decompress the file because my disk-space is 512GB (Max), insufficient for the following file. – Stackbeans Jul 01 '21 at 16:27
  • With only 512 GB, I doubt there's *any* way to manipulate that large file. A two TB HDD costs ~US$50. That said, anything swapping RAM to HDD will be *very* slow. – DrMoishe Pippik Jul 01 '21 at 20:53
  • You are right, I made sure to purchase one many hours ago! Although, it seems that such a method of splitting the contents within the .zip (if a .csv file), would be a pretty neat trick if one were to be implemented. – Stackbeans Jul 01 '21 at 21:34