Build run time for pipeline builder has increased from 2 minutes to 2 hours with minimal changes

jojo · October 30, 2024, 7:44pm

Hi,

We are trying to run a build on pipeline builder but it is taking an unexpected longer amount of time (from 2 minutes to 2 hours). It looks like it is running.

I see that the pipeline has changed recently since last successful run but the change is not too drastic adding another dataset that’s around 16MB only. This data set is being unioned and some simple filtering is happening only.

Any ideas on why the build time would change so much with such few changes?

helenq · October 30, 2024, 7:47pm

Hey @jojo, is this a snapshot or incremental pipeline (any chance it accidentally snapshotted when it was incremental previously?)

And just to confirm there are no joins or aggregates or anything that should be changing the row count other than the union? What build profile are you on?

jojo · October 31, 2024, 8:36am

Snapshot
Medium Build Profile
There are no joins or aggregates. The biggest thing is the union. The new dataset that is being added to the union is 16.6MB. It is being unioned to 2 other datasets that are approximately 2-3MB.
The transforms include: normalize column names, renaming a column, 2 case statements, 2 extract date parts, casting to a double, apply to multiple columns which casts columns to a string, regex replaces, and a filter.
After 15 hours it failed with the following message:

Module (i.e. driver) ran out of memory

Message not helpful?
The driver running the job ran out of memory while running your job. Common reasons include
-Broadcasting large datasets. If query plan for this job contains broadcast joins, consider removing them from your code (if manually applied) and disabling automatic broadcast join by applying the AUTO_BROADCAST_JOIN_DISABLED profile; or increasing driver memory.
-Using .collect() or other Spark actions that retrieve data to the driver.
-Doing computations locally on the driver, using for example Pandas
-Having a large number of tasks

jojo · October 31, 2024, 9:13am

I just tried two things to try and debug:

Remove the new dataset from the union but it is still building
Small build profile but it was still running after 10 minutes

From these tests, it doesn’t seem like it is something wrong with the dataset or the build profile

jojo · October 31, 2024, 10:30am

Ok after going through the pipeline transform by transform node, I found that the filtering step seems to be the culprit. Not sure why this is so intensive.

helenq · October 31, 2024, 11:55am

Are the build profiles natively accelerated? Or are you using the default small and medium ones

jojo · October 31, 2024, 1:00pm

Not using native acceleration

cpottiez · November 5, 2024, 5:43am

Could you please share the physical plan of the 2 min and the 2hours runs ?
Please remove any sensitive information

helenq · November 5, 2024, 6:15pm

Iterated with @jojo and it looked like from the stack trace it was doing constraint propagations. Turning off constraint propagations (screenshot below) fixed the build time: