We have a dataset with 100 million append-only files (not kidding) that we want to use as an input to an incremental pipeline.
We have been unable to deploy this pipeline initially because since the output is empty, there is no first transaction to compare logic versions to, which triggers the bad path of could not run incrementally → replay from start of input data → spark OOM.
We really don’t care about the existing 100 million files for this pipeline (although that data is important for other workflows, so we can’t nuke the input dataset). Is there any way to manually commit an empty snapshot transaction to the output dataset, and set it to the current eddie logic version so that we can circumvent the forced replay?
Is there meant to be a “Filter Files” transform? If so, I’m not seeing it, or am I not looking in the right place? That sounds like exactly what we’d want!
As for creating a new dataset, the input is written to by a Data Ingest associated with a “pointer” to an external system. If we set up a new Ingest it will necessarily involve creating a new “pointer” that might be out of line with the existing - so I don’t think that is an option for us