Pipeline Builder unexpected row count behavior

jshelby · January 23, 2025, 9:53pm

Hey!
I’m seeing strange behavior in one of my pipelines. I have a fork in my transform path where I separate “valid” and “invalid” rows based on a bool column. Before the fork, there are 1K rows. After the fork, “valid” has 952 rows and “invalid” has 56 rows (952 + 56 = not 1000 )

Additional notes:

I first noticed this because the final output of the pipeline, the union of these forks, has more rows than expected. The numbers above are checked from the intermediate output datasets
the is_valid column is produced by a python UDF in the parse response struct transform

Attaching screenshots showing the forked path and the filter block

.

drew · January 23, 2025, 10:03pm

Hey @jshelby , thanks for your post!

Have you tried checkpointing the LLM node or the “parse response struct” transform node? Uncheckpointed, there’s no guarantee that each of your outputs is getting the same LLM results because each may run its own upstream. By checkpointing, you guarantee everything downstream (in the same job group) will get the same set of computed results.

Checkpointing is especially important when dealing with nondeterminism. In your case, any part of the LLM result is being used to partition your rows could explain the unexpected row count.

jshelby · January 23, 2025, 10:16pm

Hi @drew, thanks for the quick response. I will try the checkpointing and update here. I would like to know if checkpointing is relevant in the case where I first observed the issue - where there is only one output? I ran the pipeline which unions together the forked logic paths (to the right in the screenshot) and the output produced >1K rows, with duplicate pks appearing. The added outputs are solely for my debugging.

jshelby · January 24, 2025, 5:18pm

@drew My pipeline is incremental, so I can’t use checkpoints it seems. Understanding that the individual outputs might rerun the nondeterministic parts of the pipeline, this still wouldn’t explain why I am seeing duplicate and
missing PKs in the final output?
Is it correct that the final output would not recompute intermediate nodes? For example below, LLM block would only be computed once.

sperchanok · January 24, 2025, 6:24pm

Hi @jshelby, despite the LLM node being a single parent to multiple children on the graph, it may be recomputed for each output “child”. This is because each output dataset may be built independently of each other.

However, if you only have (or care about) one output, this shouldn’t matter. You may just see discrepancies between other output datasets.

In your case, it seems the number of rows before and after the filter block should be equal. However, if you’re judging that the number of rows before and after the filter block are not equal because of intermediate output datasets… your analysis may be influenced by non-determinism.

I hope that makes sense! Let me know which (if any) of the above options apply.

jshelby · January 24, 2025, 6:34pm

Hey @sperchanok
The discrepancy was noticed based on the output as shown in my diagram. the convergence of the forks is a union (not included in diagram). In that output, I am seeing strange behavior including duplicate pks and missing pks. I’m thinking this could be a result of the partitioning possibly? Let me know what you think about this theory. I’m re-running a test now with enforced partitions on the filter column to see if that helps.

sperchanok · January 24, 2025, 7:03pm

Hey. To be honest, I’m not quite certain how partitioning could affect this.

One other thing just to check off. Do you have any null values in your boolean column? If you’re filtering for “true” or “false”, you may be inadvertently filtering out null rows.

jshelby · January 24, 2025, 8:27pm

All values are either true or false in the filter column

system · March 25, 2025, 8:27pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.