How to only read latest transaction in an incremental build?

jake · November 26, 2024, 2:52pm

Heyo,

I have a raw dataset that syncs new data on a five minute cadence from a SQL source.

The new files get added as an APPEND transaction. The transaction view for this raw dataset is below:

I’d like to have a cleaning build that only reads in the newest transaction’s worth of data, but I’m struggling to figure out how to do this. If I look at the Spark performance of my current job, it spends an awfully long amount of time reading in the files:

And if I look at the inputs to that build, it’s reading in the entire transaction view as input.

Is there any way to read in just the latest transaction, instead of the entire transaction view?

jdemoully · November 26, 2024, 3:09pm

Where are you transforming this data? In pipeline builder, transforms?

While there is a bit more of a complicated setup, using code repositories / transforms will give you more control. The best way to grab new rows would be to use the “added” mode in the input (docs).

Pipeline builder is a bit easier to set up but does not allow you to read the previous output. (docs). Make sure you have “incremental” selected on the input:

Tradeoffs for builder vs transforms can be found here.

jake · November 26, 2024, 3:10pm

This is already in transforms, with a read mode of “added”

redboyben · November 26, 2024, 3:14pm

One debugging step would be if you’re already in transform to log the files present in your transaction to ensure you’re only reading from the latest files.
Do you have a lot of files per transaction? In the ingest settings, compacting the input files as only one file per transaction might help (if you don’t care about the names of the files).

jdemoully · November 26, 2024, 3:15pm

Is there multiple transactions on the input dataset since your last build? Added will include everything since the last transaction on the output, not the most recent transaction on input.

jake · November 26, 2024, 3:21pm

I can look into logging, ty for the tip
Nope, typically only 2 files per txn (a data ingestion log and the parquet file with the data)
Hmm, there may be multiple transactions, but the downstream roughly builds right after the upstream finishes (we build the entire clean schedule right after the raw schedule, so sometimes the raw runs a few times before the whole clean finishes). At most there’d be four transactions, which still doesn’t equate to ~30k that it’s taking in right now

redboyben · November 26, 2024, 3:35pm

Could you also double check from the spark details that you’re running the transaction as incremental?

jake · November 26, 2024, 3:56pm

Yep it’s running incrementally

sandpiper · December 2, 2024, 7:53am

Are you by any chance referring to the following UI element in the history tab for the transaction/build?

If so, you may be misinterpreting what this means - this is simply telling you the latest dataset view that the build used, not the data that was actually “read in.”

A lot of things go on in the gap between the start of the timeline and the first Spark task execution stages (query plan generation, etc.) - you can’t assume that it’s just reading in the files in that time (and in fact, the actual reading of files should happen in the subsequent Spark tasks, unless you’re doing something custom with the filesystem API).

jake · December 3, 2024, 5:05pm

This helps - ty!

After some digging, I think it’s taking a long time to query plan

system · February 1, 2025, 5:05pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.