Does Pipeline Builder Streaming Preview Run Across All Data?

sandpiper · February 7, 2025, 7:27am

My understanding is that for Pipeline Builder batch pipelines, unless a sampling strategy is configured, the full input dataset is always in scope for the calculation of output previews.

Do output previews for streaming pipelines work the same way? I’m asking this question because we have a streaming pipeline that just filters data to rows for which a string column starts with a particular substring, and the Pipeline Builder output preview shows zero rows, but actually running the stream produces a nonzero amount of rows. This seems inconsistent with the documented behavior for batch pipelines, and we’re trying to determine if this is within the scope of expected behavior for streaming pipelines.

sperchanok · February 7, 2025, 3:14pm

Hey! Can you tag in streaming on your post?

svercillo · February 7, 2025, 3:45pm

Hey there! Streaming preview works by only running some number of sampled records through the pipeline. Depending on the data-shape or filtering, it is possible that the preview output is not representative of what the actual will be.

There is the concept of pipeline builder unit tests, where you can provide specific input you want to run, and it is possible to validate the output against expected results that way. That may be of interest for you.

system · February 21, 2025, 3:46pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.