I have an S3 source that delivers a full snapshot of data on every run, containing historical records, new records, and records that may have been updated. I want to configure Pipeline Builder so that I only ingest new records and also update existing records without rewriting or duplicating the entire dataset each time. What is the best practice in Foundry for handling this type of full‑load‑every‑time pattern?
If you were only going to update new data, I would’ve recommended using an incremental build:
The incremental computation is an efficient method of performing transforms on an input dataset to generate an output dataset. By leveraging the build history of a transform, incremental computation avoids the need to recompute the entire output dataset every time a transform is run.
- Full documentation here
- example of how to create it in pipeline builder here
- example of how to create it in code repositories here
The tricky part is if you want to update existing records you won’t be able to get away with just a pure incremental build. One idea is you could have a dataset that just takes in the new records and then create another dataset downstream that joins the new records + changes to the default data + update those values. You could also do a full snapshot.
Thanks for your answer! I was thinking to Configure UPDATE transaction type and then do the incremental load in the Pipeline Builder. Will that work?
If you have some columns on which you can filter out data that has not changed I would use that in sync by setting it with incremental mode + append. If you do not, then you’ll have to take full snapshot version and then find changes.
In first scenario you can create pipeline builder with input dataset set to “incremental” mode. In second you’ll have to use snapshot input.
Pipeline builder also lets you to configure write mode for pipeline’s output dataset. Snapshot replace mode is solution to your case. Primary key column must be present in data for this mode to work.
Also note, that if you have many rows and compute-intensive transformations you can write more advanced but also more flexible transformation logic in code repositories.
I’m correcting myself; I want to know if there is way that we could achieve this at the sync. I have a column which specifies the time the row got updated, I guess this column could be used to compare with the existing records. How exactly should I configure the Sync?
