Workaround to skip "replay on all input data" for first deployment of incremental pipeline

Hello,

We have a dataset with 100 million append-only files (not kidding) that we want to use as an input to an incremental pipeline.

We have been unable to deploy this pipeline initially because since the output is empty, there is no first transaction to compare logic versions to, which triggers the bad path of could not run incrementally → replay from start of input data → spark OOM.

We really don’t care about the existing 100 million files for this pipeline (although that data is important for other workflows, so we can’t nuke the input dataset). Is there any way to manually commit an empty snapshot transaction to the output dataset, and set it to the current eddie logic version so that we can circumvent the forced replay?

Hey, you might be able to try the filter files transform. If you don’t care about the old data, could you make a new dataset?


Is there meant to be a “Filter Files” transform? If so, I’m not seeing it, or am I not looking in the right place? That sounds like exactly what we’d want!

As for creating a new dataset, the input is written to by a Data Ingest associated with a “pointer” to an external system. If we set up a new Ingest it will necessarily involve creating a new “pointer” that might be out of line with the existing - so I don’t think that is an option for us

Ah yes, you’d need to do it on a raw files dataset - if you have this coming from an ingest it might not be an option for you.

Hm, is scaling your compute to have large memory an option?