Does anyone know the best way to perform this kind of aggregation/windowing “throughput” metric? Pipeline would be best, contour and TwoDimensionalAggregation functions are less ideal but could work.Basically, we have data structured in the form of (patient_id
entry_time
exit_time
). We’re looking to bucket across weeks/months/days (any arbitrary time unit) and get the count of patients per bucket where the range [patient.entry_time, patient.exit_time]
overlaps with [bucket.start_time, bucket.end_time]
. For now, we’re okay with doing this once (so it doesn’t need to be super performance optimized), but we’re keeping in mind the possibility that this will need to populate a dashboard that gets updated daily.
Hey @davidyu, so assuming you have your patient dataset that looks like patient_id, entry_time, exit_time, and a dataset that has your day/week/month time buckets, I would just cross join and filter for overlaps in Pipeline builder.
Pyspark logic if it helps:
df_cross = df.crossJoin(buckets)
df_overlap = df_cross.filter(
(df_cross.entry_time < df_cross.bucket_end) &
(df_cross.exit_time >= df_cross.bucket_start)
)
df_overlap.select(
"patient_id", "entry_time", "exit_time", "bucket_start", "bucket_end"
)
After that, you can aggregate and count the patients per bucket.