Joining to the closest row

I’ve got outputs (output1 and output2) from two transforms (transform1 and transform2). For each “output1” I receive I need to find the most recent “output2” and then do some math. Each row has a timestamp. Is there any easy way to do this in Pipeline Builder? The data set is streaming for what it’s worth.

1 Like

Hey @justinp

This would probably work:

  1. Cross join to find all possible combinations of the 2 timestamps
  2. Find the difference between the 2 timestamps in each row
  3. Window: Partition by key and rank the difference to get the least difference for each key (rank 1) and filter the whole thing by rank = 1.

Building on the above, if I understand correctly, this sounds like something that you can do with a stateful Outer Caching Join. It sounds like your have two streams and want to join all the frames with the waypoint stream for part of the answer above.

Hi @justinp

The crossjoin solution is not going to scale well and may take a while to compute. Instead, you can use the lead window function for each row to define the timeframe(docs). Then you can use a join condition like this: output1.timestamp >= output2.timestamp AND output1.timestamp < output2.timestamp_lead. If this were to be a batch transform, this would certainly be significantly faster. With streaming, I can’t be so certain

Cheers,
Yurii

Sounds like a merge_asof in Pandas. For situations where you want to match rows in one DataFrame to the closest row in another DataFrame (in terms of time). I’m also interested in how this would be done using transforms.