1. Architecture Overview
- Incremental Pipelines: Sets of pipelines run incrementally whenever an action is taken on the source mediaset.
- Ontology Dataset Population: These incremental pipelines populate datasets that underlie ontology objects.
2. Issue
- Some null values have been introduced into the dataset backing the ontology, causing the indexing process to fail.
3. Strict Requirements for the Fix
No. | Requirement | Comments |
---|---|---|
1 | Incrementality must be preserved | If you change the dataset outside the pipeline, the next run might trigger snapshot mode, which should be avoided. |
2 | No invocation of snapshot mode | Snapshot processing would reprocess/modify all data, including data that is already correctly extracted and must remain untouched. |
3 | No new build/release | The fix must be applied directly in production; deploying new code or a release is not an option for this fix. |
4 | Direct removal of problematic rows only | The solution should target and remove only the rows containing the null values responsible for the indexing failure. Do not impact otherwise valid data. |
5 | Data already extracted must NOT be modified | The primary concern is to avoid any process (like a rebuild or snapshot) that could alter already-extracted data, as this may cause data corruption, loss, or inconsistencies. |
4. Main Goal
Do not modify or disturb data that has already been correctly extracted through the pipeline. The solution must remediately and surgically remove only the faulty rows from the production dataset (those with null values), preventing any data reprocessing or overwriting of existing, valid extracted data.
5. Desired Solution
- Direct Production Data Fix: Precisely and safely remove only the problematic (null-containing) rows from the production dataset—without:
- Breaking or altering the incremental nature of the pipeline,
- Triggering a full snapshot rebuild,
- Or deploying/releasing new code.
If a bypass is possible
If there is any safe and supported method or workaround to surgically fix the dataset (by removing only the problematic rows) without putting the pipeline into snapshot mode and without any rebuild—that is the preferred solution.