How do I add row number partitioned by a given set of columnsto my existing streaming pipeline builder.
Using window function, presents with below options
Bounded out-of-order event time tumbling window
LRU cumulative partitioned
static gap event time bounded lateness session window
Hey @sgariki, could you describe what you’re looking to do in a little more detail? Specifically, what do you mean by “row number”? Are you looking to get the total number of rows in each window?
Within each dataset in streaming pipeline, I would like assign row number by keys similar to oracle row_number() over(partition by)
Hey, you can do this in batch by applying the row number expression in a sorted window. I do not believe there is a streaming equivalent of this as of current, since we only support aggregations over windows.
I think there are two approachs here that could work:
-
In pipeline builder you could use an aggregate over window transform with a tumbling window to group by on a key over a timeframe, you could then collect the records in an array of structs, sort that array, and then explode it with position to create all the rows again with the row number coming from the position
-
You could also write a streaming UDF and implement a custom window function to get a row number in a given partition