Can I make the following transform directly on PB, or do I need to use an UDF? This is what I want to do:
I want to sort on two columns A and B, and implement a sliding window holding 2 rows at a time. I’m pretty sure I know how to do this on PB so far. However, within this window, I want to perform some casing. For example, I want to check whether the value for column C for the second row is 1 plus the value for column C in the first row . If the result is true from my casing, I want to “melt” the two rows so that all the rows are the same, except that we take the earlier value of variable D from the two rows in the window.
Currently, it seems like what you want to do is not directly possible without using a streaming-UDF. However, there is a way to achieve the same outcome in a semantically equivalent manner.
Here’s how you can do it:
Set Up a Sliding Window with Count of 2: (already done sounds like)
Use an Aggregate Expression on the Window: Apply the aggregate function collect_array, on the window. This will gather each row’s specified column value within the window into an array. If you need the whole array you can use the CreateStruct expression beforehand.
Process the Array: Once you have both elements together in a single array column, you can perform any necessary operations or transformations downstream.
This approach allows you to work with both elements within the window in a single column value, making it easier to apply further logic or casing as needed.