We are developing a streaming pipeline where we have a peculiar workflow. Incoming rows contain an incomplete picture of some observation. For example, some rows contain the one property (call it X) and some rows contain the a second property (call it Y). When a row contains the X property, we will have a null value for the Y property column (and vice versa). Each of our observations has a unique id. As our observation changes, the X property and Y property will change and we only want the lastest observation (by timestamp; noting that rows may arrive out of order). Additionally, we want the window size to reset everytime we get a row for a specificied partition. For example, if our window size is 10 minutes and we get an initial observation at 2:00 PM and then another one at 2:02 PM, we would want an observation at 2:11 PM to be included in the parition.
Our initial idea was to use the Window
transform (unsure which type of window to use), partitioning on our unique id, and try to use the last
aggregation (ignoring nulls) to aggregate all of the fields. We have two concerns. First, we want the last row by timestamp not by arrival at the transform (rows could arrive out of order). Second, there is the warning about non-deterministic behavior on aggregate or unordered windows, so we want to confirm that whatever window type we use would work.
Do y’all have any ideas on how we could accomplish this workflow in the cleanest way?