I was wondering what’s best practice to compact parquet files of long running, incremental pipelines with many small files - without breaking the provenance records, so that downstream transforms can still consume the results incrementally.
The issue with long running (years) pipelines is that the number of input files keeps on growing which makes reading slower and slower in downstream transforms.