Hi everyone,
I’m working on a project in Pipeline Builder where I’ve applied several transformations (filtering, joining, aggregations, pivoting, etc.) and then deployed the result as an output dataset.
The challenge I’m facing is with incremental input data.
-
Since the dataset is very large, I only want to build the pipeline once a week (instead of daily).
-
However, when I do this, the pipeline only applies my transformations to the new incremental rows from that week.
-
This leads to misleading results. For example:
-
Let’s say I initially calculate a yearly average of a column (using snapshot mode).
-
On the next weekly build, the average is computed only on the last 7 days of data, completely ignoring the previous year.
-
As a result, the average is no longer representative.
-
I could potentially solve this for averages by using rolling calculations, but this problem extends to other transformations (like joins, pivots, and aggregations) where rolling approaches don’t work.
What I need:
Even though I want to build the pipeline only once a week and process only the new rows, I want my transformations to still take into account the entire historical dataset, not just the weekly increment.
Has anyone solved a similar issue, or is there a recommended approach in Palantir for handling this kind of incremental + historical transformation scenario?
Thanks in advance!