Whats a good way to compare the proficiency of pipeline builder logic?

bwolcott · December 17, 2024, 5:29pm

Im working on a pipeline that involves an aggregation and a UDF. Im curious in which order they’re most proficient, curious how someone would test that. Is it better to aggregate first so there’s fewer rows through the UDF, or UDF first so there’s fewer resultant aggregations.
One idea that comes to mind would be to translate it to a Juypter notebook and do more traditional benchmarking where the time to compute the two patterns in measured and compared. Just curious if there’s a best practice that people are using with success.

achung · December 17, 2024, 8:40pm

Hey! One thing to note is that UDFs run in a Jupyter notebook will not be the same architecturally as UDFs defined in the Foundry platform, so you can’t really expect to get accurate results with that test. Spark also optimizes your jobs, meaning that it might re-order your operations based on what it deems to be the most efficient.

The most accurate thing to do would be to create different branches in your pipeline, construct the two different approaches you want to test out, and deploy/build each branch on the same compute profile. You can compare their query plans to see if they’re actually doing the same thing under the hood. If not, you should then be able to compare execution times / other metrics shown in the build dashboard to see which path would be better suited for your use case.