Great question. Short answer: I don’t thing Pipeline Builder is ideal for this (although you could pull some manual stuff with some UDF and if you have a transparent enough partition strateg). It’s on case by case and without knowing the details of what xform you are willing to run it’s a bit tough to have a definitive answer. BUT When you’re dealing with datasets large enough to justify having more control on partitioning of job execution in code repositoires, two approaches worth considering:
1. Transforms Generator — split your build into multiple jobs from a single code repo
Rather than manually creating 20 Pipeline Builders or 20 separate transforms, you can use the Transforms Generator pattern in a Python code repository. This lets you programmatically define N transforms from a single piece of logic — for example, one transform per partition key, per date range, or per business segment. Each generated transform runs as its own independent job with its own Spark profile, so you get parallelism and independent retries without manually duplicating anything.
This is the cleanest approach when the split logic is well-defined (e.g., by region, by product line, by time window). You write one function, parameterize it, and the generator creates the individual output datasets that you then union downstream.
Docs: https://www.palantir.com/docs/foundry/transforms-python/pipelines#transform-generation
On memory and cost: each generated transform gets its own executor profile, so you can right-size memory per partition rather than over-provisioning a single massive job. On the other hand, you are also paying the overhead of instantiating 20 different jobs. This is significantly more cost-efficient than running one huge transform that needs enormous driver/executor memory to avoid OOM.
2. Dynamic repartitioning — fix the root cause before splitting
Before splitting into 20 outputs, it’s worth asking: is the OOM happening because of the data volume itself, or because of a wide shuffle during joins/aggregations (or wide-operations in general)? In many cases the issue isn’t the row count but a skewed or un-partitioned shuffle that concentrates data on a few executors.
Things to try in a code repo:
-
Repartition your data before heavy transformations using df.repartition(N, "partition_key") to distribute the load evenly across executors. Pick a key with good cardinality.
-
Right-size your Spark profile — increase executor memory and number of executors rather than splitting the logical dataset. You can configure this per-transform in a code repo via the @configure decorator.
-
Use incremental transforms if your data is append-heavy (and that that you dont have to process past transaction to compute the latest dataset view). Processing only new rows per build drastically reduces the volume per execution. Docs: https://www.palantir.com/docs/foundry/transforms-python/incremental-reference/
About cost, I guess this could possibly be even better than generators in the case there is a partitioning trick that lets you compute all of your data in a single transform.
I hope this helps,
Nico from Sibyl