"Throughput" Calculation

Does anyone know the best way to perform this kind of aggregation/windowing “throughput” metric? Pipeline would be best, contour and TwoDimensionalAggregation functions are less ideal but could work.Basically, we have data structured in the form of (patient_id entry_time exit_time ). We’re looking to bucket across weeks/months/days (any arbitrary time unit) and get the count of patients per bucket where the range [patient.entry_time, patient.exit_time] overlaps with [bucket.start_time, bucket.end_time] . For now, we’re okay with doing this once (so it doesn’t need to be super performance optimized), but we’re keeping in mind the possibility that this will need to populate a dashboard that gets updated daily.

Hey @davidyu, so assuming you have your patient dataset that looks like patient_id, entry_time, exit_time, and a dataset that has your day/week/month time buckets, I would just cross join and filter for overlaps in Pipeline builder.

Pyspark logic if it helps:

df_cross = df.crossJoin(buckets)
df_overlap = df_cross.filter(
    (df_cross.entry_time < df_cross.bucket_end) &
    (df_cross.exit_time >= df_cross.bucket_start)
)
df_overlap.select(
    "patient_id", "entry_time", "exit_time", "bucket_start", "bucket_end"
)

After that, you can aggregate and count the patients per bucket.