Stream and batch dataset join behavior with keyed stream

ClaireR · May 20, 2025, 5:44pm

Hey!

I have a stream of events, keyed by a column idColumnA, I want to use to back an object type with idColumnA as the primary key.

The stream also contains another idColumnB which contains a reference to an ID for a row from another dataset generated via a batch sync.

I want to do a left join of the stream with the batch dataset (following the docs here) to populate some additional columns on the streamed events from the batch synced data.

What’s the expected behavior on existing streaming records when new matching rows are added to the batch dataset in subsequent transactions?
e.g.

New record arrives in stream for idColumnA=123, idColumnB=456, and there is no matching row in batch dataset
- Output of join has nulls for columns from batch dataset
Batch dataset is updated to add new row for idColumnB=456
- Assuming the join in the streaming pipeline runs when the next record arrives, does it include the latest record for each streamed key in the join against the new batch dataset, or only join newly received records?

The desired behavior I’m looking for here is that the object for a given idColumnA gets populated with the latest view of columns from the associated idColumnB row

svercillo · May 22, 2025, 3:42am

Hey there!

The behavior of the left join is that of a lookup table, your intuition is correct. A match will only be emitted once a record from the stream side is processed and a valid match exists in the RHS.

If the batch dataset is incremental and receives append transactions, then the RHS of the join will be periodically updated and reindexed to reflect the most up to date data at some cadence, however matches against records already seen in the stream will not be emitted.

It sounds like the behavior you desire somewhat mimics the outer caching join. If you read your RHS as a stream instead of a lookup table, you can use it in a the OCJ stateful join, which will store in state matches, and emit appropriately whether the RHS or LHS record is emitted first.