Unzipping Large Files in Foundry(100GB+)

Hi,

I have a large zip file - 100GB. I split it first to upload to Foundry (to 100 different pieces, each 1GB).

I now want to unzip these files in Foundry. Here are the steps required as I understand:

  1. Merge the split zip files (as there is no native way to interpret split zip files separately)
  2. Unzip them

However, due to the size of the data, I am not sure how to approach the problem.

  • In the first step, I would need to collect all the split zip files into driver memory to achieve merge operation. And 100 GB driver memory is not possible as I explored.
  • In the second step, I would still have the same issue for the driver memory (before distributing it to executors)

Please let me know if you have any recommendations for unzipping these files.

Kind regards,
Baris

A couple of iteration points:

  • Later realised that driver memory != driver local storage. So, I would not need 100gb driver memory as the results will be stored in the local storage.
  • But also the operations will be network heavy and there would be bottlenecks around this:
  1. Downloading 100gb files from Foundry FileSystem to Spark Driver
  2. After zip merge, uploading it back to Foundry and doing write operation.

You can unzip a file that does not fit into memory in python. A large driver / lightweight transform should be able to handle this pretty easily.

As you said, downloading/uploading the files might take a little time but I wouldn’t expect it to be prohibitively slow as transforms regularly download/upload terabytes of data

From a cursory google search, some options seem to be

gzip
stream-unzip

1 Like