Hey!
I have a stream of events, keyed by a column idColumnA
, I want to use to back an object type with idColumnA
as the primary key.
The stream also contains another idColumnB
which contains a reference to an ID for a row from another dataset generated via a batch sync.
I want to do a left join of the stream with the batch dataset (following the docs here) to populate some additional columns on the streamed events from the batch synced data.
What’s the expected behavior on existing streaming records when new matching rows are added to the batch dataset in subsequent transactions?
e.g.
- New record arrives in stream for
idColumnA=123
,idColumnB=456
, and there is no matching row in batch dataset- Output of join has nulls for columns from batch dataset
- Batch dataset is updated to add new row for
idColumnB=456
- Assuming the join in the streaming pipeline runs when the next record arrives, does it include the latest record for each streamed key in the join against the new batch dataset, or only join newly received records?
The desired behavior I’m looking for here is that the object for a given idColumnA
gets populated with the latest view of columns from the associated idColumnB
row