Dear community,
I’m working with customer order data in Foundry. New orders generate new rows, while changes in order status update existing rows. Both cases modify the updated_at column, which I’m using as a reference for incremental extraction.
The incremental extraction runs every 15 minutes, bringing both new and updated orders to the Code Repository. The Code Repository is marked with @incremental, so both types of orders are appended to the output dataset.
To deduplicate them, I need to load the output dataset, union both the new and the existing datasets, sort it by updated_at, and use dropDuplicates on the id column. This process is taking almost 5 minutes to complete.
Given that the number of orders is relatively small compared to the full dataset, I believe it could be much faster if I could just update the necessary rows, using operations similar to SQL UPDATE or INSERT. Is it possible to do this within the Code Repository? If not, is there a workaround, such as saving these records in a separate dataframe and using some kind of API calls to perform SQL-like updates?
I appreciate any guidance or suggestions on this matter.
Thank you for your assistance.
Best regards,