Best approach for incremental processing when S3 source provides full snapshots

We’re ingesting data from an S3 bucket where an upstream Spark job writes a full snapshot (single file, fully replaced) on a scheduled basis. Each snapshot contains all rows — both unchanged and new/modified records. Our data has a unique key and a last-modified timestamp column. Our challenge is that we don’t want to process the entire snapshot through our downstream Pipeline Builder pipelines every time, as they include compute-heavy transforms and LLM calls. Our current approach is to use a Python transform to compare each new snapshot against the previous state, detect only the new and modified rows (delta), and append only the delta to a master dataset. This allows downstream Pipeline Builder pipelines to read the master dataset with the incremental input toggle enabled, processing only the newly appended rows. The Pipeline Builder output is then set to Snapshot Replace for deduplication. However, since the Snapshot Replace output is a replace transaction, any further downstream pipeline reading from it cannot be set to incremental — meaning the incremental chain stops there. Is this the recommended pattern for handling full S3 snapshots with incremental downstream processing in Foundry? Is there a way to extend the incremental chain beyond the Snapshot Replace output, or is this a known limitation?

You need to write a transform that will “convert” the snapshot into an incremental.

This processing by itself is somewhat intensive (you basically do an antijoin on the full historical data compared to the current data), but this is an investment “upstream” for your full rest of the pipeline to be able to run incremnetally.

You should look at the merge-and-append example there: https://www.palantir.com/docs/foundry/transforms-python-spark/incremental-examples#merge-and-append

You can write it in Code repository or Pipeline Builder (via the output write modes)

Thanks for the confirmation! I’ve implemented the anti-join pattern — comparing the full snapshot with the previous output and appending only the delta. This works great and downstream pipelines can read the master incrementally.

However, I’m running into a challenge at the deduplication stage. Since the master dataset accumulates via APPEND, it contains duplicate keys (old and new versions of modified rows). To produce clean, deduplicated data I use Snapshot Replace as the output write mode — which creates a REPLACE transaction. This means any pipeline reading from that deduplicated output cannot be set to incremental anymore, since everything looks “new” after a REPLACE.

In short: the incremental chain breaks at the deduplication step. Is there a recommended pattern for maintaining incremental processing beyond the Snapshot Replace output? Or is the accepted approach to keep a parallel APPEND feed for any further downstream pipeline that needs incremental input?

Well, the big question is: Why do you need to deduplicate ?

When you have an incremental pipeline, you get “updates” (a bit like in a CDC fashion) or you get “whatever is new” in. As long as you propagate this, you benefit from incremental. As soon as you revert back to “what is the latest” you loose the incremental benefit.

Usually, those are the reason why you want to deduplicate/get “the latest of”:

  • Because you use want to expose a dataset of the “latest” for analytics (Contour, etc.) for end users. Or because you want to export it => You can use a View to expose one version of the dataset, deduplicated efficiently. You can keep its input “duplicated” for further pipeline usage. See https://www.palantir.com/docs/foundry/data-integration/views
  • Because you sync the data in the Ontology => That’s actually not needed. The Ontology can sync data incrementally, as long as there are no duplicated primary keys in each transaction. You can have duplicated primary keys at a dataset level. See https://www.palantir.com/docs/foundry/object-indexing/funnel-batch-pipelines
  • For downstream pipelines => In this case I would advocate to keep it incremental and propagate the logic downstream.

Note, another post that might be relevant: https://community.palantir.com/t/how-to-perform-an-update-transaction-on-a-single-partition-in-an-incremental-output-dataset/4548/2

Thank you for the detailed reply! Here’s my understanding based on your suggestions:

  1. Don’t deduplicate in the pipeline. The incremental benefit is lost the moment you revert to “what is the latest” (i.e., Snapshot Replace creates a REPLACE transaction, and downstream pipelines can no longer read it incrementally).

  2. Keep everything APPEND throughout the pipeline. Delta detection → business transforms → ML inference — every output should be APPEND. This keeps the incremental chain alive from start to finish.

  3. Duplicate primary keys across transactions are OK. As long as each individual APPEND transaction has unique primary keys within itself, duplicates across transactions are acceptable and expected in an incremental pipeline.

  4. The Ontology handles deduplication automatically. Object Storage V2 uses a “most recent transaction wins” strategy — if the same primary key appears in multiple APPEND transactions, the Ontology keeps the row from the latest transaction. No need to deduplicate before syncing to the Ontology.

  5. For analytics and dashboards, use Views instead of materialized datasets. A View can deduplicate at read time (e.g., keep latest per primary key) without creating a new dataset or a REPLACE transaction. The underlying APPEND dataset stays incremental-friendly.

  6. Snapshot Replace is only needed for specific cases like exporting clean data to external systems that can’t handle duplicates — not for Ontology or internal Foundry consumption.

  7. The final architecture is: APPEND everywhere, deduplicate only at the point of consumption (Ontology / Views).

Does this align with what you meant? Anything I’m missing?

Thanks again!

Correct.

One note on

Snapshot Replace is only needed for specific cases like exporting clean data to external systems that can’t handle duplicates — not for Ontology or internal Foundry consumption.

You should be able to export from a View, too. So technically, this might not even be required if the data is already in the exportable-shape and format.