I have a streaming pipeline that applies a filter on a stream and produces a time series sync. I want to remove the filter, and redeploy the pipeline, however, I would like to avoid replaying the whole stream, as it would take a while and I am happy with the filter not being applied for future data only. Is this the expected behaviour if I don’t select “replay on deploy”, or will not replaying the stream cause the existing output data to be lost?
Hi! If you do not press replay then the output dataset is not reset. Therefore, the behavior you mention will persist, i.e. the filter will not be applied to future data only.
We’re having a similar issue, where we removed some properties / columns to a dataset in am incremental pipeline in Pipeline Builder. This is considered a breaking change.
These properties in the ontology would be dropped.
We would like this change to only impact new data, and can’t reprocess the pipeline as a one-off batch.
Any idea ?
If you’re unable to replay then adding null values to the columns that are dropped is one way to tackle this problem. Nulls are very cheap to store at rest so very little additional cost is added. The alternative is to replay and build as a snapshot since this is a breaking change to the schema.
In our case, the simplest solution was indeed to keep the columns to avoid a breaking change, and simply set them as null moving forward.