Is it expected for PB to not fail when the input schema changes, and the pipeline has not been redeployed?

jrobison · May 17, 2024, 5:30pm

Summary of Issue:

We have two input columns of type string. For the sample, we have two rows of data.
In PB, we then add a constant column to aggregate across, and create two arrays.
We then use the multi column expression to convert both arrays into comma separated strings.
When sample_column changes from type string to type decimal, we would expect this to cause a failure in the build.
PB does show a schema error, but the build succeeds.

Question: is this behavior expected?

david · May 17, 2024, 5:35pm

Hey @jrobison! Thanks for reaching out. The builds that are succeeding are being kicked off outside of builder (specifically via build schedules). This means they’re using the latest pipeline logic that successfully deployed. I think raw builds and build schedules will try to run on whatever input data is passed in from upstream. I would not be surprised at all if there is type coercion that’s happening behind the scenes. For pipeline builder itself, we ensure that new deployments of the pipeline are type compatible with the upstream data!

We also have a set of features to configure data expectations on the outputs of a pipeline builder pipeline: https://www.palantir.com/docs/foundry/pipeline-builder/dataexpectations-overview/

One quick workaround that you until we enable full data expectations on inputs is to create a dummy pipeline that outputs a dataset with data expectations on it, and then use the output from that as the input to your pipeline!