Why does my incremental pipeline builder take 3 min to run with no new files?

I have a Pipeline Builder with an incremental dataset coming in and only incremental datasets being outputted. When the incremental input dataset contains no new files and I run/deploy the pipeline, the build still takes ~3 minutes in “Running” mode before aborting due to “Job succeeded but no transaction was committed. The transaction may have been aborted because there were no new files to add.” The aborted transaction is expected given no new files to process in the input, but why does the pipeline run for 3 minutes before aborting?

Hey, the additional 3 minutes is likely due to the fact that there’s additional overhead like waiting in the resource queue and initializing the Spark application before determining whether new data is present in your inputs / whether a new output transaction is required.

Thanks @achung! The 3 min is after “Waiting for resources” / “initializing” has completed though

Got it. There is also some Spark work being performed here – the build essentially runs the computations and then checks if the output dataframe is empty. If it is, then the output transaction is aborted.

Usually this should be pretty quick if there isn’t any new data, but if you have other static datasets being read and/or a large number of input txns then this might cause your build to run for longer. The work being done here is a bit non-trivial, so I’m guessing that’s why it’s taking a bit of time.

1 Like