A situation I often encounter when trying to build a metric and alerting pipeline is creating some calculated column, that is semantically “meaningless” but captures some combined measure of “badness”, and then map the results into a 0 → 1 range. I find this makes it easier to understand as a pure distribution and allows reusable logic for then bucketing the range and assigning ratings etc.
As a trivial example, consider a case I’m working on where I have one row per page of documentation, and a column for “page_views_prev_30_days” and “days_since_last_update”. To get a metric of the docs to prioritize updating, a naive approach is simply to multiply these columns together. The result has no semantic meaning, but the distribution tells us about docs that have both a lot of views and a gap from when they’ve been updated.
Ideally there would be a one-shot board in pipeline builder where I could give it a numeric column, specify the max and min for the new range, and then have it automatically do the min-max scaling for this normalization.
It’s not too annoying to do with a reusable custom expression, but I was wondering if theres something I’m missing for doing this? Or maybe some smarter way to treat this class of transformation?