Update rows using SQL in Code Repository

TieCrowdfarming · October 14, 2024, 12:28pm

Dear community,

I’m working with customer order data in Foundry. New orders generate new rows, while changes in order status update existing rows. Both cases modify the updated_at column, which I’m using as a reference for incremental extraction.

The incremental extraction runs every 15 minutes, bringing both new and updated orders to the Code Repository. The Code Repository is marked with @incremental, so both types of orders are appended to the output dataset.

To deduplicate them, I need to load the output dataset, union both the new and the existing datasets, sort it by updated_at, and use dropDuplicates on the id column. This process is taking almost 5 minutes to complete.

Given that the number of orders is relatively small compared to the full dataset, I believe it could be much faster if I could just update the necessary rows, using operations similar to SQL UPDATE or INSERT. Is it possible to do this within the Code Repository? If not, is there a workaround, such as saving these records in a separate dataframe and using some kind of API calls to perform SQL-like updates?

I appreciate any guidance or suggestions on this matter.

Thank you for your assistance.

Best regards,

sandpiper · October 15, 2024, 4:29am

The best way to achieve performant “keep latest record for each ID” behavior in cases like this is to create a View with a primary key configured on the incrementally-appended dataset.

TieCrowdfarming · October 15, 2024, 8:49am

Thank you, Sandpiper. I’ll give it a try and update the thread with the results for future reference.

TieCrowdfarming · October 30, 2024, 9:28am

It worked perfectly. This is the deduplication solution we’ve been looking for.

One small tip for those implementing it: since it deduplicates using descending order on the reference column, we encountered a specific case where we needed ascending order instead. To achieve this, we created a new column with the negative value of the reference column (in case it’s a date, convert it to a Unix timestamp first)