SFTP ingestion of files - How/where to compress to increase ingestion speed?

VincentF · July 3, 2024, 8:38am

I’m setting up a sFTP connection.

I’m wondering of my options to optimize the ingest - in particular to avoid too slow ingests - mainly where and what should be compressed.

Let’s assume I have raw CSVs files I want to ingest.

I want to compress the files in my source FTP. I can split the CSVs and compress them in tar.gz on the FTP and ingest them.
Question: Is there a particular size to aim for ? Are there better format that would help the processing downstream ?
To what I know, Spark can’t split or stream gzipped csvs, so the size of the file should be low enough to not require too much memory in my pipeline.
I see that when setting up the Data Connection Sync, I can add a “transformer”, and there is one that compress the files.
Question: Is this compression happening during the ingest ?
For example: this affects the compression as part of the client/server protocol with the sFTP server, or are the files getting compressed post network transfer, before getting saved in a dataset on Foundry ?

sandpiper · December 18, 2024, 5:14am

Generally, the most common performance issue with SFTP or FTP ingests is excessive time to list files due to overly nested subdirectories (since there is a back-and-forth between the Data Ingestion process and the SFTP server for each cd command). This means that by far the most effective way to optimize performance is to ensure that there is minimal directory nesting on the server.

Assuming for the sake of argument that the SFTP server directory hierarchy is entirely flat - you can certainly additionally improve performance by preemptively combining multiple CSVs into .tar.gz files on the SFTP server, which will reduce the amount of files as well as the total bytes to be transferred. However, by doing that you introduce additional complexity, because by combining multiple CSVs into tar archives, you lose the file path / file size / file modified heuristic provided by Data Ingestion for identifying new files to ingest, and you need to keep track of that state some other way. This trade-off is rarely worth the performance gain, unless you are dealing with a pathological case where there are tens of thousands of original CSVs and they are all super tiny. That said, .tar.gz is a reasonably good file format for parsing in a Foundry pipeline because - although it is not splittable and you cannot divide a single file into multiple spark tasks - it is still possible to parse efficiently in a byte-streaming fashion in Python or Java transforms.

You can also compress the CSVs on the SFTP without combining multiple files into an archive using bz2 compression, which is splittable, but very slow both to compress and decompress. gz, while not splittable, is also fine as long as the original, uncompressed files aren’t that big (up to a gigabyte or so should be fine).

The files get compressed post network transfer - i.e., the agent receives the original, uncompressed bytes from the SFTP server and compresses them before uploading to Foundry.