I’m setting up a sFTP connection.
I’m wondering of my options to optimize the ingest - in particular to avoid too slow ingests - mainly where and what should be compressed.
Let’s assume I have raw CSVs files I want to ingest.
-
I want to compress the files in my source FTP. I can split the CSVs and compress them in
tar.gz
on the FTP and ingest them.
Question: Is there a particular size to aim for ? Are there better format that would help the processing downstream ?
To what I know, Spark can’t split or stream gzipped csvs, so the size of the file should be low enough to not require too much memory in my pipeline. -
I see that when setting up the Data Connection Sync, I can add a “transformer”, and there is one that compress the files.
Question: Is this compression happening during the ingest ?
For example: this affects the compression as part of the client/server protocol with the sFTP server, or are the files getting compressed post network transfer, before getting saved in a dataset on Foundry ?