How can I start incremental pipeline from a specific transaction

skedida · June 27, 2024, 5:14pm

Hello, I have a very large historical dataset. I want to apply some transformations and create a new dataset that’s derived from the first dataset. However, since the dataset is really large, I want to only consider the data from a specific date and make it incremental. I know how to make it incremental starting from today but I want to make it incremental starting from June 1st.

Is there a way to make an incremental build starting from a specific transaction of the input dataset? Alternatively, is there another way I can achieve what I want?

helenq · June 27, 2024, 6:22pm

If you have a column in your dataset that has the timestamp can you filter your output to only keep the data from June 1st onwards and then make all future transactions starting from today incremental?

What would be the benefit of making a dataset incremental starting from a past date since you already computed those rows in your snapshot transaction?

nicornk · December 14, 2024, 9:09am

In the initial snapshot run of your new incremental transform you can use the token provided in the context to list the dataset files. The catalog API will return the transaction where the files were added and also the timestamp. You can filter the files by timestamp and only keep files from transactions starting with your cutoff date. You than need to pass this list of files to a spark read command, appended by the hadoop_path which you also find in the filesystem call of the input.

If your input has a timestamp column but is not partitioned by this timestamp I wouldn’t recommend filtering by it as this operation can be extremely expensive.