Low (<10 min) latency method of maintaining only the latest month of a Timestamp column on an extremely large incremental input dataset using pipeline builder

weilinh · August 27, 2024, 7:05pm

We have a very large incremental input that builds every 10 minutes adding ~100K rows).

The implementation as can be down in code repositories is as follows:
Read past_output with output.dataframe(“previous”). Filter out rows with Timestamp < 30 days ago. Append this to the incremental rows from the input. Set output mode to replace. Done.

We want to do this in pipeline builder.

The suggestion I received was to use an intermediate incremental output dataset with a retention policy (to keep size minimal), then adding the incremental input dataset with the intermediate output to get the past month of output.

Let me know if anyone has any other ideas!