Hi @VincentF, do you want the newest CSV to completely replace your dataset or could multiple CSV’s contain the latest records you care about?
If it’s the former, I would snapshot the input and then just filter out the deleted records based on the column. If it’s the latter, I would also delete the records marked as deleted but then do a window on the primary key and sort by the latest date to filter on the latest record. (You would need to have the upstream be incremental or make the pipeline in PB incremental to keep all the rows)
Let me know if that makes sense or if you have any other questions!
It’s more and more new CSVs getting loaded upstream.
Given the scale I would like to avoid snapshoting my output.
To what I know:
Strategy 1 - Make the whole pipeline handle added/deleted flags
The whole pipeline can be incremental by customly handling new rows (that would be only strict addition, potentially by having a time-based-primary key).
The transform would split the “new rows”, the “modified” rows" and the “deleted” rows to separately handle the cases (recompute an aggregation if the deleted rows comes in, etc.)
In short, in some cases, it’s not even needed to filter to latest, strictly speaking.
Strategy 2 - Compute the “latest view”
We really need to compute the “latest” view.
We can use Views which can deduplicate per some timestamp and delete records marked as deleted.
Pro: very fast (it will compute the latest on the fly and as a background task will precompute what is needed to make it efficient)
Con: precomputation in the background will incur some cost (can still be more efficient than a snapshot)
If the goal is to obtain an Object, you can just rely on the way Ontology reads dataset to let the Ontology deduplicate data for you by leveraging how the Ontology works with incremental syncs. You basically need to expose a “changelog” dataset which the Ontology will be able to interpret.
We can use a “snapshot replace and remove” logic in Pipeline Builder, so that Pipeline Builder will only update the new rows and not rewrite the whole data. See Snapshot replace and remove. It’s still technically a snapshot.
If the goal is to still “just update” the relevant rows, you can move off Pipeline builder to Code Repository to have more control over the files you write. (e.g. if you know that it is an update “per date” then you can partition per date and rewrite only the relevant file for this date)
Do you have other ideas ? Or more precise pointers ? In particular natively in Pipeline Builder.
I actually think you could do Snapshot replace (rather than replace and remove) and in your transform path remove all the deleted rows
Snapshot replace should automatically replace old rows with the newer rows based on a primary key. If there are duplicate primary keys in the input you’ll just have to filter down to the latest one. What you linked, Snapshot replace and remove