Hi, I have a dataset with a time column and I want to effectively use the pandas bfill() method to backfill null dates with the most recent non-null value after them. I don’t think I can easily use a sorted window function with the last() expression since you may have multiple nulls in between 2 values. Has anyone done this before? do you know how to do it?
Yes, this is achievable entirely in Pipeline Builder.
You noted that your dataset contains a “time” column. I’m going to assume that there is only one date per row in this “time” column (i.e., the time column does not contain arrays of dates).
Start by using a Window transform to generate a new column containing an array of all the non-null dates within whatever partitions your solution requires (you’d filter the rows to non-null dates and then use the Collect Distinct Array expression on the “time” column to generate this output, most likely). Now each row of your dataset will contain a single date in the “time” column and an array of non-null dates in the newly generated column.
Next, use the Filter Array Elements transform to keep only the non-null dates that are greater than the date from the “time” column. Do this by selecting the non-null date array in the “Array” field, choosing “Greater Than” as the “Expression to Filter”, choosing the green “element” button for the “Left” field, and choosing “time” (or whatever the name of your single date column is) as the “Right” field.
This will leave you with an unsorted array of all the non-null dates that come after the date from the “time” column. Use an ascending Array Sort transform to order these non-null dates, and then use the Array Element transform to take the first element from the array. This element will be the first non-null date that comes after the date from your “time” column. Let’s call this the “replacement_date.”
You didn’t mention this in your question, but you’ll probably want to use one more transform to coalesce (first non-null value) the original “time” column and your newly generated “replacement_date” column, so that if the original “time” column did contain a non-null value, you’re not overwriting it. Set “time” as the first value in the coalesce and “replacement_date” as the second value in the coalesce. The resulting output will be equal to “time” if “time” is not null, and “replacement_date” if “time” is null.
This should get you to the desired solution.