What are the most important performance guidelines and anti-patterns to avoid when building PB pipelines for datasets large datasets (between 100M-200M rows)?
Are there specific operations or patterns that are particularly problematic at scale?
Are there any rules of thumb for how much a pipeline might cost based on a few different inputs?
Any concrete examples of “never do this” scenarios would be especially helpful.
Assuming you’re talking about batch pipelines (not streaming), underneath the hood pipeline builder’s backend is running spark, so I would say many of the best practices and common pitfalls from the broader Spark and data engineering world are directly relevant and will help you avoid performance issues. That being said, Pipeline Builder also does a lot of these automatically so you don’t need to worry about everything you would consider if you were coding up spark pipelines from scratch.
Some top of mind tips include:
Transformations
Push Down Filters: Apply filters as early as possible to minimize data processed downstream.
Select Only Needed Columns: Use select to limit columns early.
Beware of Exploding Data: Operations like explode or cross joins can increase data size dramatically—use with caution and only when necessary.
Leverage Window Functions Carefully: Window operations can be expensive; partition your data appropriately and avoid unnecessary use.
Avoid Using distinct and dropDuplicates Unnecessarily: These operations trigger expensive shuffles. Use them only when truly required.
Repartition Wisely: You can use repartition to increase/decrease partitions for parallelism before wide transformations (like joins), and .coalesce partitions to decrease partitions before writing to disk to avoid too many small output files.
LLM Best Practices
Use caching for LLMs: Use the skip recompute rows when using use LLM nodes so rows aren’t re computed each time a build runs
Use Small or Medium profiles (rather than XLarge) for LLM builds to avoid rate limits
Test with a small subset of data: Start off with a small subset of your data if you’re worried about scale. The cache will not keep track of anything from a failing build.
Tip: If you’re running into OOMs with use LLM builds, when possible, avoid large reshuffles after use LLM (eg. aggregations, joins) and just output the LLM outputs. Then in a separate pipeline you can do the aggregations/joins.
Happy to give more tips if there are things in particular you want to know or operations you think your users will definitely be doing.
For expensive operations, we sometimes provide an option to skip recomputing rows. So, if you expect some rows of data to remain largely the same, you can skip re-running the expensive operation on these rows, and instead read from a cache. This is particularly useful for Llm-based workflows.
Previews also use some amount of compute cost. You can turn off automatic previews on pipelines with expensive operations. This, however, will have no effect on the build time.