How would you join on nulls in Pipeline Builder?

Joel · December 19, 2024, 5:40pm

Is there a recommended way to perform a null-safe equality check in a pipeline builder join similar to PySpark’s null safe equality check( pyspark.sql.Column.eqNullSafe — PySpark 3.5.3 documentation)? For example, suppose I have two tables for customers and transactions where both tables identify customers by “ID” (never null), and “sub ID” (not null only for non-unique IDs).

I’d like to join on “ID” = “ID” and “sub ID” <=> “sub ID” preferably without coalescing literal “NULL” strings into null sub IDs. When I try setting this up to join when sub IDs are equal or both null (shown below), Pipeline Builder warns to not do that because “Using non-equality conditions within a join can be computationally expensive and may result in slower previews and builds.”

dherls · December 19, 2024, 6:33pm

Hello! In this case I think you can ignore pipeline builder’s warning about performance. It’s being a little over-zealous about what it considers an equality operation.

If you implement this and do notice the performance is particularly bad, reach back out and we can debug further.