Hello,
This is a feature request for the data health check inside Dataset. I would like a new Data Health check that ensures global uniqueness of values contained inside an array-typed column across all rows. Conceptually, it’s similar to a primary key constraint, but applied to each element of an array after exploding, and validated across the entire dataset/object type (not just within a single row).
Simple example
-
Row 1: tags = [A, B]
-
Row 2: tags = [C, D]
-
Row 3: tags = [B, E] ← violation, because “B” already appears in Row 1
Current Problem :
Today we can only guarantee uniqueness at the primary key level. We do not have a built-in Data Health rule that asserts “no element contained in this array column appears in any other row’s array.” To monitor this, we must introduce an intermediate build (explode + deduplicate + check), which adds latency, cost, and operational complexity just to validate uniqueness.
Thanks for reviewing this feature request.
Regards,