We are trying to run a build on pipeline builder but it is taking an unexpected longer amount of time (from 2 minutes to 2 hours). It looks like it is running.
I see that the pipeline has changed recently since last successful run but the change is not too drastic adding another dataset that’s around 16MB only. This data set is being unioned and some simple filtering is happening only.
Any ideas on why the build time would change so much with such few changes?
Hey @jojo, is this a snapshot or incremental pipeline (any chance it accidentally snapshotted when it was incremental previously?)
And just to confirm there are no joins or aggregates or anything that should be changing the row count other than the union? What build profile are you on?
There are no joins or aggregates. The biggest thing is the union. The new dataset that is being added to the union is 16.6MB. It is being unioned to 2 other datasets that are approximately 2-3MB.
The transforms include: normalize column names, renaming a column, 2 case statements, 2 extract date parts, casting to a double, apply to multiple columns which casts columns to a string, regex replaces, and a filter.
After 15 hours it failed with the following message:
Module (i.e. driver) ran out of memory
Message not helpful?
The driver running the job ran out of memory while running your job. Common reasons include
-Broadcasting large datasets. If query plan for this job contains broadcast joins, consider removing them from your code (if manually applied) and disabling automatic broadcast join by applying the AUTO_BROADCAST_JOIN_DISABLED profile; or increasing driver memory.
-Using .collect() or other Spark actions that retrieve data to the driver.
-Doing computations locally on the driver, using for example Pandas
-Having a large number of tasks
Ok after going through the pipeline transform by transform node, I found that the filtering step seems to be the culprit. Not sure why this is so intensive.
Iterated with @jojo and it looked like from the stack trace it was doing constraint propagations. Turning off constraint propagations (screenshot below) fixed the build time: