Update rows using SQL in Code Repository

Dear community,

I’m working with customer order data in Foundry. New orders generate new rows, while changes in order status update existing rows. Both cases modify the updated_at column, which I’m using as a reference for incremental extraction.

The incremental extraction runs every 15 minutes, bringing both new and updated orders to the Code Repository. The Code Repository is marked with @incremental, so both types of orders are appended to the output dataset.

To deduplicate them, I need to load the output dataset, union both the new and the existing datasets, sort it by updated_at, and use dropDuplicates on the id column. This process is taking almost 5 minutes to complete.

Given that the number of orders is relatively small compared to the full dataset, I believe it could be much faster if I could just update the necessary rows, using operations similar to SQL UPDATE or INSERT. Is it possible to do this within the Code Repository? If not, is there a workaround, such as saving these records in a separate dataframe and using some kind of API calls to perform SQL-like updates?

I appreciate any guidance or suggestions on this matter.

Thank you for your assistance.

Best regards,

The best way to achieve performant “keep latest record for each ID” behavior in cases like this is to create a View with a primary key configured on the incrementally-appended dataset.

1 Like

Thank you, Sandpiper. I’ll give it a try and update the thread with the results for future reference.

It worked perfectly. This is the deduplication solution we’ve been looking for.

One small tip for those implementing it: since it deduplicates using descending order on the reference column, we encountered a specific case where we needed ascending order instead. To achieve this, we created a new column with the negative value of the reference column (in case it’s a date, convert it to a Unix timestamp first)