I noticed a large increase in build times (from ~7 min to >30min) in a pipeline builder after moving some logic to a UDF. Does the UDF inherit the spark profile from the pipeline builder? What else could be causing the increased build times?
There’s significant overhead to using UDFs in your Builder pipeline as compared to using the built-in functions. An increase from 7 to 30 minutes seems drastic but may currently be expected at scale.
I’d strongly recommend trying to leverage the existing set of functions. We’re continuously growing this set of capabilities and would be interested in identifying the gaps that are forcing you to use a UDF for this.
It isn’t a gap in functionality but re-usability. There was a request to move logic into a UDF so when it is edited / maintained the changes happen in every pipeline builder that uses it
Just to clarify given the original question - does the UDF effectively “inherit” the spark profile specified in the PB pipeline? Just sanity checking that these don’t run as a separate process with separate resources.
These UDFs actually do run in a separate process with entirely separate resources. No Spark profiles are being inherited and there’s currently not a way for users to control the resources in that separate environment.
The UDF runs on a sidecar on each executor, so as spark scales the number of executors, each will be given it’s own sidecar to run the UDF. In Batch pipelines, we just released a feature to rune the batch size that is sent to the UDF container when you deploy. This will likely improve performance quite a lot as it minimizes network costs.