Single Pipeline Builder vs 20 Pipeline Builders

Hi!

I’m dealing with an extremely dataset. Because the volume is too large to process in a single transform to create one output, I plan to split it into multiple parts and union them later.

In this case, would it be better to generate 20 datasets within one single pipeline builder, or to create 20 separate pipeline builders? Please provide advice regarding memory usage, computing speed, and cost efficiency.

Great question. Short answer: I don’t thing Pipeline Builder is ideal for this (although you could pull some manual stuff with some UDF and if you have a transparent enough partition strateg). It’s on case by case and without knowing the details of what xform you are willing to run it’s a bit tough to have a definitive answer. BUT When you’re dealing with datasets large enough to justify having more control on partitioning of job execution in code repositoires, two approaches worth considering:

1. Transforms Generator — split your build into multiple jobs from a single code repo

Rather than manually creating 20 Pipeline Builders or 20 separate transforms, you can use the Transforms Generator pattern in a Python code repository. This lets you programmatically define N transforms from a single piece of logic — for example, one transform per partition key, per date range, or per business segment. Each generated transform runs as its own independent job with its own Spark profile, so you get parallelism and independent retries without manually duplicating anything.

This is the cleanest approach when the split logic is well-defined (e.g., by region, by product line, by time window). You write one function, parameterize it, and the generator creates the individual output datasets that you then union downstream.

Docs: https://www.palantir.com/docs/foundry/transforms-python/pipelines#transform-generation

On memory and cost: each generated transform gets its own executor profile, so you can right-size memory per partition rather than over-provisioning a single massive job. On the other hand, you are also paying the overhead of instantiating 20 different jobs. This is significantly more cost-efficient than running one huge transform that needs enormous driver/executor memory to avoid OOM.

2. Dynamic repartitioning — fix the root cause before splitting

Before splitting into 20 outputs, it’s worth asking: is the OOM happening because of the data volume itself, or because of a wide shuffle during joins/aggregations (or wide-operations in general)? In many cases the issue isn’t the row count but a skewed or un-partitioned shuffle that concentrates data on a few executors.

Things to try in a code repo:

  • Repartition your data before heavy transformations using df.repartition(N, "partition_key") to distribute the load evenly across executors. Pick a key with good cardinality.

  • Right-size your Spark profile — increase executor memory and number of executors rather than splitting the logical dataset. You can configure this per-transform in a code repo via the @configure decorator.

  • Use incremental transforms if your data is append-heavy (and that that you dont have to process past transaction to compute the latest dataset view). Processing only new rows per build drastically reduces the volume per execution. Docs: https://www.palantir.com/docs/foundry/transforms-python/incremental-reference/

About cost, I guess this could possibly be even better than generators in the case there is a partitioning trick that lets you compute all of your data in a single transform.

I hope this helps,

Nico from Sibyl

4 Likes

Hey! How large is your dataset and what are you roughly trying to do with it? Depending on size + what transforms you’re trying, you could also try a larger compute profile (if no LLMs) before splitting up the datasets into multiple parts and union-ing them later.

To answer your original question, if you want to do this in Pipeline Builder I would generally advise to do your first idea of generating 20 datasets within one single pipeline builder over 20 separate pipeline builder files, especially if you are sharing logic across these 20 separate paths, expect these outputs to all build at the same time, and expect all outputs to be snapshot. Also, have you tried this on faster pipeline types or just batch?

1 Like

Hey!

I had a similar problem with a disgusting JSON workflow. Not sure if this is what you mean.

Background

The JSON would be incredibly nested and we were extracting our data into multiple tables from individual JSON files. I noticed that the individual pipelines would go back and run against the same dataset directly. They all had a common feed from the single JSON

solution

I agree with the others: repartioning strategies, coalescing, filtering, etc. It’s all pyspark so these strategies should be implemented. For me, I took advantage of the caching mechanism to have them all share the common entry point. This sped up the rest of the downstream processing.

Hope that helps if you have a similar problem where it might be one input data and many outputs.