I’d like to confirm my understanding on the limitations of incremental pipelines in code repo & pipeline builder.
In pipeline builder you cannot output an incremental dataset that adds new rows & replaces old rows.
Ex: Prev transaction
PK = 1 | val = A
PK = 2 | val = B
New append transaction
PK = 1 | val = C
We could not create an incremental dataset in pipeline builder that looks like PK = 1 | val = C ← replaced
PK = 2 | val = B
We could create this as a snapshot dataset using Snapshot replace.
We also cannot accomplish any replace functionality using incremental transforms in code repos. Is that correct? Other data processing tools (for example dbt) have incremental transforms that can replace old rows with new rows so I am surprised Foundry does not support this.
Hey! The above about pipeline builder is correct, you would need to use the Snapshot replace output write mode.
As for code repos, you can replace using an incremental transform!
One way to do this is to read in the previous output and use that to join with the latest output and add in your custom logic to figure out how you want to replace the rows
Read the previous output into a DataFrame. previous = out.dataframe('previous', schema)
Join/antijoin this DataFrame with the latest output.
Implement your custom logic to decide how to replace the rows. In your case it sounds like you just want the latest/newest rows which I believe you can just antijoin (step 2) and then union the antijoin output with the new data
How does this affect downstream transforms when you overwrite the entire output?
Ie. what do downstream transforms process when their entire input is overwritten? Does it essential make the rest of the pipeline function as a batch pipeline and you lose the performance benefits of an incremental?
The transactions of the output you’re working with would be a snapshot rather than an append. In general for this use case you can’t really get away with a purely incremental+append transaction since you have to compare the latest rows with your previous output to see what to keep/drop. So in that sense, yes you wouldn’t have the same performance benefits as a typical append only incremental pipeline because you have to read in your previous output