What options are there for creating a rolling window dataset?
In my scenario,
I have an incremental pipeline that reads from a snapshot dataset that is updated daily and only the data for the current day is written to the incremental dataset. After 365 days I will have 1 year’s worth of data. For rows with a timestamp older than 1 year, they should not be in the incremental dataset.
The easiest would likely be to use a retention policy to wipe transactions older than 365 days.
If you prefer not to use retention, you could also read in the output, filter out data older than 365 days, and write to the output via a replace transaction. This stackoverflow post should have some details.
As a follow up question, how is this data used? Are there output datasets that would also need to respect the window? If so, you could use data lifetime to clear old transactions downstream as well.
history_df = history.dataframe(“previous”, schema)
# Filter history_df here to keep only 365 days record
final_df = input_df.unionByName(history_df)
history.set_mode('replace')
history.write_dataframe(final_df)