What would be the most efficient way to handle conditional paths in pipeline builder?

hweinstein · October 10, 2024, 12:22am

Hi all – there are instances where I’d rather not have pipeline builder run through unnecessary transforms if I can determine whether a condition is met. So far, I just filter to rows where x condition is met and rows where it isn’t and let each branch off into a separate path of transforms – and then I just union (or join where applicable) at the end to ensure I have the most comprehensive set of data. But often in the instance above (where if x condition is met there will be no rows for ‘is y condition met’), I’d just need to compute one path – the one for whom the condition is met.

I haven’t noticed a way I could achieve this in pipeline builder, although it’s fairly simple set-up in PySpark code (would be glad for this to be pointed out to me if it’s there!). Does Pipeline Builder automatically handle this case with optimal efficiency (e.g. stop running a path once the source dataset’s rows = 0), or is there a best practice here – or some upcoming feature I should keep an eye on?

mtelling · October 10, 2024, 12:30am

The most optimal solution right now is doing what you’re already doing.
It can be beneficial in some cases to add a checkpoint before the split (right click on the node and select checkpoint) as this will ensure that everything upstream is only computed once. Otherwise Spark might just compute both paths fully.

Whether the optimization of whether a path is worth running I would probably just let Spark handle. If Spark gets zero data for a query it’s probably not going to be very expensive. My guess is that it’s better to let Spark do it’s own thing than manually checking whether there are zero rows.
That might not be fully true and you’d have to run some benchmarks on the specific scenario. But I wouldn’t worry too much about it.

Can I ask why you’re asking this question? Have you seen instances of this being slower than expected?

hweinstein · October 10, 2024, 12:41am

Thanks @mtelling! I haven’t seen instances but was more curious – and wanted to adopt best practices.