SFTP ingestion of files - How/where to compress to increase ingestion speed?

I’m setting up a sFTP connection.

I’m wondering of my options to optimize the ingest - in particular to avoid too slow ingests - mainly where and what should be compressed.

Let’s assume I have raw CSVs files I want to ingest.

  • I want to compress the files in my source FTP. I can split the CSVs and compress them in tar.gz on the FTP and ingest them.
    Question: Is there a particular size to aim for ? Are there better format that would help the processing downstream ?
    To what I know, Spark can’t split or stream gzipped csvs, so the size of the file should be low enough to not require too much memory in my pipeline.

  • I see that when setting up the Data Connection Sync, I can add a “transformer”, and there is one that compress the files.
    Question: Is this compression happening during the ingest ?
    For example: this affects the compression as part of the client/server protocol with the sFTP server, or are the files getting compressed post network transfer, before getting saved in a dataset on Foundry ?

image