How to Edit/Remove rows from a dataset in incremental pipelines without affecting the incrementality

1. Architecture Overview

  • Incremental Pipelines: Sets of pipelines run incrementally whenever an action is taken on the source mediaset.
  • Ontology Dataset Population: These incremental pipelines populate datasets that underlie ontology objects.

2. Issue

  • Some null values have been introduced into the dataset backing the ontology, causing the indexing process to fail.

3. Strict Requirements for the Fix

No. Requirement Comments
1 Incrementality must be preserved If you change the dataset outside the pipeline, the next run might trigger snapshot mode, which should be avoided.
2 No invocation of snapshot mode Snapshot processing would reprocess/modify all data, including data that is already correctly extracted and must remain untouched.
3 No new build/release The fix must be applied directly in production; deploying new code or a release is not an option for this fix.
4 Direct removal of problematic rows only The solution should target and remove only the rows containing the null values responsible for the indexing failure. Do not impact otherwise valid data.
5 Data already extracted must NOT be modified The primary concern is to avoid any process (like a rebuild or snapshot) that could alter already-extracted data, as this may cause data corruption, loss, or inconsistencies.

4. Main Goal

Do not modify or disturb data that has already been correctly extracted through the pipeline. The solution must remediately and surgically remove only the faulty rows from the production dataset (those with null values), preventing any data reprocessing or overwriting of existing, valid extracted data.


5. Desired Solution

  • Direct Production Data Fix: Precisely and safely remove only the problematic (null-containing) rows from the production dataset—without:
    • Breaking or altering the incremental nature of the pipeline,
    • Triggering a full snapshot rebuild,
    • Or deploying/releasing new code.

If a bypass is possible

If there is any safe and supported method or workaround to surgically fix the dataset (by removing only the problematic rows) without putting the pipeline into snapshot mode and without any rebuild—that is the preferred solution.

Checkout this thread:
https://community.palantir.com/t/deleting-files-in-historic-transactions-without-breaking-the-incremental/824/4

You will have to identify the transaction and files in that transaction that contain your invalid data and issue a DELETE transaction with special metadata to Trick Foundry into thinking it’s coming from the retention service.

In that transaction you need to delete the files with invalid data.

3 Likes