Indexing object types backed by large incremental dataset with duplicate primary keys

MoezBH · July 10, 2024, 1:47pm

Hello,

We are having trouble to index large Object Types that are backed by frequently updating incremental datasets.

The incremental datasets contain duplicated primary keys due to the possible ingestion of updated rows from our data sources, this makes backing our object types with the incremental datasets impossible.

Currently our only working solution is to deduplicate the incremental datasets which forces us ultimately to back our object types with snapshot datasets and thus takes multiple hours to index.

We tried converting the pipeline to streaming to index as Streaming object type which allows duplicate PKs in the backing datasource, but the first index job in this case takes multiple weeks which is not acceptable.

Anyone found a solution to this type of issue ?

Thanks!

VincentF · July 10, 2024, 2:09pm

It is possible to incrementally sync a dataset to an Object. See
https://www.palantir.com/docs/foundry/object-indexing/funnel-batch-pipelines/#incremental-indexing-of-incremental-datasets

In short, you need the primary keys to be unique per transaction. So you will need to deduplicate on each transaction (for each run of the build), and not across all the existing data.

A new row (in the current transaction) which has the same key as an existing row (that was created by a previous transaction) will be updated.

MoezBH · July 11, 2024, 10:45am

Hello,

That looks feasible in our case. We will try this,

Thank you!