We have a very large incremental input that builds every 10 minutes adding ~100K rows).
The implementation as can be down in code repositories is as follows:
Read past_output with output.dataframe(“previous”). Filter out rows with Timestamp < 30 days ago. Append this to the incremental rows from the input. Set output mode to replace. Done.
We want to do this in pipeline builder.
The suggestion I received was to use an intermediate incremental output dataset with a retention policy (to keep size minimal), then adding the incremental input dataset with the intermediate output to get the past month of output.
Let me know if anyone has any other ideas!