Filter on regex match column in Pipeline Builder

dherls · October 14, 2024, 6:40pm

Is there a way to use either the Regex Match filter or the Regex Match join expression with a column instead of a constant? I have a dataset of regular expression rules that I want to join with a different dataset of values and filter to just rows with matching regexes.

Image contains notional data

MattP · October 15, 2024, 11:34am

You’re right that can’t use the “regex match” condition on a Join or Filter Rows operation because they don’t let you select a column to get the regex from.

However, you can use the Regex Match or Regex Find operations as they allow you to select the column that contains the regex.

First, you’ll need to do a Cross-Join of the dataset containing your data and the dataset containing your regexes.

Then, transform this Cross-Joined dataset with either the Regex Match or Regex Find operation, depending on whether you want the regex to match the whole string or just a substring. For the operation you’ll need to select the field that contains the data to match against, and the field that contains the regex. This will write a Boolean value to a column called regex_match or regex_find by default.

You can then add a Filter Rows operation to only keep the rows where this new column is True.

Please be aware though that a Cross-Join will duplicate your rows of data for every regex that you have, so it can easily get quite large!

A more efficient way to do this would be to write a transform in a Code Repository that reads a list of regexes from one dataset and uses them to filter another dataset.

The Pipeline Builder way would still be a great way to test the concept, even just with a sample of your data and/or regexes, before going to the trouble of writing the code to do it more efficiently.

Hope this helps!