Found that spark pipelines in Foundry can be speed up with Velox (ref: https://www.palantir.com/docs/foundry/transforms-python-spark/velox-overview). However, after I tried to do it by myself, it only increased the build duration.
Does anyone know how to properly estimate which pipeline is a good target for Velox acceleration?
We tested it on a few transforms last year, and didnât notice much of a change in most cases.
In the end we only moved one transform to use this profile (which is a very large transform doing quite a lot of regex work), but even then the improvement was not dramatic.
Changing the configuration is pretty quick and easy (itâs not like re-writing it in one of the Lightweight languages), so Iâd suggest just trying it and seeing the result. Do test with more than one build and take an average, and donât forget to set the accompanying EXECUTOR_MEMORY_OFFHEAP_FRACTION_* option.
Thanks for response! (happy, that somebody also tried to use Velox)
Yeah, I was doing like that, but the problem was: I need to wait some time for build finish and so time for analyzing one dataset, whether it is suitable for Velox, acceleration was huge.
Generally, I tried to calculate the amount of conversion operations (VeloxColumnarToRow and RowToVeloxColumnar) in Spark Physical Plan, because they are computationally expensive and then to estimate the effect of applying Velox. It shows some result, but it is clear that the dataset suitability not only dependents in this parameter. So, I want to find something like ârule of thumbâ in Velox-Spark world.
Yes, sadly I canât give you a ârule of thumbâ for when Velox makes sense or not.
At the moment we donât really consider it as a major option in our pipelines. Itâs nice as in theory itâs just a âdrop-inâ replacement, but we havenât found situations where it reliably makes a positive impact.