question about aggregating on a stream
what we are trying to do:
we have two datasets: a stream dataset streamA
and static dataset staticB
streamA
has a column containing string arrays keys
and staticB
has a column containing strings key
we want to join streamA
with staticB
when key (staticB)
exists in keys (streamA)
then we want to create a new column in streamA
containing all values in key (staticB)
that exists in keys (streamA)
called joinedKeys (streamA)
how we are doing it:
-
we applied an
assign timestamps and watermarks
transform onstreamA
-
we are exploding
keys (streamA)
intoexploded_key (streamA)
then joining withstaticB
whenexploded_key (streamA) = key (staticB)
-
then we are using
aggregate over window
with paritionparitionCol (streamA)
, window typesession event time window
and window gap1 millisecond
and no lateness defined -
then we collect distinct array of
key (staticB)
asjoinedKeys (streamA)
what’s the problem:
we did not see any data when defining the window this way - should it be defined differently?