Compaction of incremental pipelines without breaking the provenance

I was wondering what’s best practice to compact parquet files of long running, incremental pipelines with many small files - without breaking the provenance records, so that downstream transforms can still consume the results incrementally.

The issue with long running (years) pipelines is that the number of input files keeps on growing which makes reading slower and slower in downstream transforms.

We have a first-class solution for this problem called “projections.” See https://www.palantir.com/docs/foundry/building-pipelines/maintaining-incremental-performance/#dataset-projections and the additional documentation linked from that section for details.