Unzipping Large Files in Foundry(100GB+)

balhan · February 27, 2025, 6:35pm

Hi,

I have a large zip file - 100GB. I split it first to upload to Foundry (to 100 different pieces, each 1GB).

I now want to unzip these files in Foundry. Here are the steps required as I understand:

Merge the split zip files (as there is no native way to interpret split zip files separately)
Unzip them

However, due to the size of the data, I am not sure how to approach the problem.

In the first step, I would need to collect all the split zip files into driver memory to achieve merge operation. And 100 GB driver memory is not possible as I explored.
In the second step, I would still have the same issue for the driver memory (before distributing it to executors)

Please let me know if you have any recommendations for unzipping these files.

Kind regards,
Baris

balhan · February 27, 2025, 7:15pm

A couple of iteration points:

Later realised that driver memory != driver local storage. So, I would not need 100gb driver memory as the results will be stored in the local storage.
But also the operations will be network heavy and there would be bottlenecks around this:

Downloading 100gb files from Foundry FileSystem to Spark Driver
After zip merge, uploading it back to Foundry and doing write operation.

joe · February 27, 2025, 7:29pm

You can unzip a file that does not fit into memory in python. A large driver / lightweight transform should be able to handle this pretty easily.

As you said, downloading/uploading the files might take a little time but I wouldn’t expect it to be prohibitively slow as transforms regularly download/upload terabytes of data

From a cursory google search, some options seem to be

gzip
stream-unzip
…

system · April 28, 2025, 7:29pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.