I have a question about the the “UDF batch size” input in the Advanced configuration. Is this number independent of # of executors? Or is it on a per executor basis? I’m hitting a Compute module in my UDF and want to get a better idea of the load being placed on the Compute Module by the pipeline.
The UDF batch size is independent of the number of executors. When a UDF is running in a pipeline, each executor has its own copy of a container which can run the UDF, so throughput (and therefore load on your compute module) should scale roughly linearly with the number of executors assuming your data is partitioned evenly.
The UDF batch size controls how many rows at a time each container will be sent when it is executing the UDF, and there is some performance overhead associated with each batch regardless of the number of rows. From that perspective, a larger batch size will increase both throughput and the load on your compute module, but it doesn’t scale linearly as it does with the number of executors.
In general, you will get significant performance and throughput improvements with a batch size larger than 1 (assuming your compute module can handle the load), but batch sizes larger than the current default of 64 tend to not produce significant improvements. Within that range, it can take some experimentation to find the setting that is optimal for your specific use case.