Single Pipeline Builder vs 20 Pipeline Builders

Hi!

I’m dealing with an extremely dataset. Because the volume is too large to process in a single transform to create one output, I plan to split it into multiple parts and union them later.

In this case, would it be better to generate 20 datasets within one single pipeline builder, or to create 20 separate pipeline builders? Please provide advice regarding memory usage, computing speed, and cost efficiency.

Great question. Short answer: I don’t thing Pipeline Builder is ideal for this (although you could pull some manual stuff with some UDF and if you have a transparent enough partition strateg). It’s on case by case and without knowing the details of what xform you are willing to run it’s a bit tough to have a definitive answer. BUT When you’re dealing with datasets large enough to justify having more control on partitioning of job execution in code repositoires, two approaches worth considering:

1. Transforms Generator — split your build into multiple jobs from a single code repo

Rather than manually creating 20 Pipeline Builders or 20 separate transforms, you can use the Transforms Generator pattern in a Python code repository. This lets you programmatically define N transforms from a single piece of logic — for example, one transform per partition key, per date range, or per business segment. Each generated transform runs as its own independent job with its own Spark profile, so you get parallelism and independent retries without manually duplicating anything.

This is the cleanest approach when the split logic is well-defined (e.g., by region, by product line, by time window). You write one function, parameterize it, and the generator creates the individual output datasets that you then union downstream.

Docs: https://www.palantir.com/docs/foundry/transforms-python/pipelines#transform-generation

On memory and cost: each generated transform gets its own executor profile, so you can right-size memory per partition rather than over-provisioning a single massive job. On the other hand, you are also paying the overhead of instantiating 20 different jobs. This is significantly more cost-efficient than running one huge transform that needs enormous driver/executor memory to avoid OOM.

2. Dynamic repartitioning — fix the root cause before splitting

Before splitting into 20 outputs, it’s worth asking: is the OOM happening because of the data volume itself, or because of a wide shuffle during joins/aggregations (or wide-operations in general)? In many cases the issue isn’t the row count but a skewed or un-partitioned shuffle that concentrates data on a few executors.

Things to try in a code repo:

  • Repartition your data before heavy transformations using df.repartition(N, "partition_key") to distribute the load evenly across executors. Pick a key with good cardinality.

  • Right-size your Spark profile — increase executor memory and number of executors rather than splitting the logical dataset. You can configure this per-transform in a code repo via the @configure decorator.

  • Use incremental transforms if your data is append-heavy (and that that you dont have to process past transaction to compute the latest dataset view). Processing only new rows per build drastically reduces the volume per execution. Docs: https://www.palantir.com/docs/foundry/transforms-python/incremental-reference/

About cost, I guess this could possibly be even better than generators in the case there is a partitioning trick that lets you compute all of your data in a single transform.

I hope this helps,

Nico from Sibyl

2 Likes

Hey! How large is your dataset and what are you roughly trying to do with it? Depending on size + what transforms you’re trying, you could also try a larger compute profile (if no LLMs) before splitting up the datasets into multiple parts and union-ing them later.

To answer your original question, if you want to do this in Pipeline Builder I would generally advise to do your first idea of generating 20 datasets within one single pipeline builder over 20 separate pipeline builder files, especially if you are sharing logic across these 20 separate paths, expect these outputs to all build at the same time, and expect all outputs to be snapshot. Also, have you tried this on faster pipeline types or just batch?

1 Like