Imagine we have an incremental pipeline that is receiving many update/append transactions in the input dataset, say a data ingest.
Our pipeline does some operation and then appends to the output.
This output dataset is going to have many many files over time with only a few rows / data in each file.
I believe this high number of files with only a small amount of data can negatively effect performance in most other worlfows/apps like downstream transforms or Contour.
Normally in a PySpark or code based transforms you could if you wished incorporate a trigger that executes maybe in the weekends or when the number of files in the output dataset reaches a certain level to basically repartition the entire output dataset into a smaller optimal number of files and snapshot the output.
Is this sort of functionality available in Pipeline Builder? If not, does this become a problem and whats the best way to deal with it?
Hey I would say using the repartition expression in Pipeline Builder here is your best bet, but we don’t currently have a way to only run a subset of logic on a particular schedule cadence. You would have to go in and snapshot the repartition manually. We will track this as an FR though!
I think this is what dataset projections are for: https://www.palantir.com/docs/foundry/optimizing-pipelines/projections-overview/
If you have a dataset that is created by an incremental transform, so that the dataset rows are append-only, an dataset projection will let you break out of the strict “append only” paradigm while keeping incrementality. On a regular schedule, the projection will compactify and rearrange your rows into a smaller, more manageable number of files, and it will allow that projection dataset to count as “incremental” (even though the rows are constantly being rearranged, not strictly appended).
I haven’t used pipeline builder, but I assume that your pipeline builder would need a “dataset output” for this to work, and then you could define a projection on that output dataset.