KNN with condition 2.0

thorsb · April 23, 2025, 7:30pm

Hi Community,

Building upon this post: https://community.palantir.com/t/knn-join-with-conditions/2780

Question 1: Will the custom condition used to apply a KNN with a condition:

filter out the results not matching the conditions after the KNN join has happened?
Apply KNN search within rows relevant for the condition?
or something else?

Question 2: If I choose “K nearest” parameter to strictly greater than 1, do I get multiple rows as I would using the regular KNN join?

I have a need where performance is important and is part of the reason why I want to do a KNN with a condition, ensuring that the “search” is performed only within the rows where the condition is fulfilled.

Thanks a lot,
Thor

david · April 23, 2025, 7:45pm

Hi @thorsb! Thanks for the post.

In the post you referenced, the solution given has a custom expression to give the distance between the two columns. The custom expression is a case statement on the value of a conditional - if the condition is satisfied, it gives a “reasonable” distance between the columns (that is, the levenshtein distance). Otherwise, it gives a very very large distance between the columns. Therefore, if the condition is not satisfied, the result can be in the output dataset, it will just be at the “bottom” of the K results for each group, since it has the largest possible distance.
Yes, if you choose K > 1, you will get more than one row in the output for each row in the left dataset of the input. You will get all tied results (so if k=2 and three rows are tied for shortest distance, you will get all three rows as output).

If performance is important, I would run the filter prior to the knn board. We have an open FR to introduce approximate knn which will allow more efficient algorithms, and to introduce better algorithms for known distance measures.

thorsb · April 23, 2025, 7:56pm

@david , thanks!

Re performance: I believe I am not able to filter beforehand, as I need to conduct this operation for a large number values for the condition. I’d be curious to hear if you have other suggestions on this.

For context: If I only had one value for the condition it would be OK, but given I can have +50, it becomes cumbersome to filter the two different datasets and then perform KNN join - and then union them.

david · April 23, 2025, 8:05pm

Hm, that’s a tougher one - at the end of the day, just like with code, if you have a lot of conditions you might just have to eat that dev cost. One thought is that you could write a reusable filter board so you only have to write it once, if the filters on both datasets are the same. Another things to try: can you use builder’s AIP feature “generate” to write the filter board for you?

thorsb · April 23, 2025, 8:13pm

@david , I have one condition column only:
And although the brute force method could work right now, my fear is that new values will appear in the condition column.

Would it be speedier to do:

2x50 x filter transforms and 50 x pre-filtered knn joins, vs.
the custom one discussed above?

david · April 23, 2025, 8:29pm

What happens that might add a new condition to the filter?

If I understand right, there’s a math equation here; the knn is n^2, which almost definitely grossly dominate the cost of the filter (even 50 of them). If each filter takes makes each dataset \frac{1}{\sqrt{50}} of the original size, then it’s faster. O/w probably just doing the raw join and filter condition is better? But really if perf of the build is paramount, I’d do it both ways and see which one wins.

thorsb · April 24, 2025, 5:39am

To make sure it’s clear: an additional potential value might be added to the existing condition.

Example: Imagine you are a company producing a certain widget. You have production sites in various countries, and want to source a component closest to your description within that country.

If you open/close a production site, then the number of values would change - which in turn would affect the filters needed.

thorsb · April 24, 2025, 1:11pm

FYI: Someone else suggested me to do a regular left join on the condition, then following a transform where similarity score is computed between the embeddings and then using top rows to pick the relevant # of rows.

david · April 24, 2025, 1:16pm

I was about to say the same thing - if your set of conditions are changing, it sounds like you want a dataset to populate the values, and can just do another join to get the filter behavior. You can even have this be a writeback dataset so you can build out a nice workshop flow for opening / closing a plant.

Also yes, if you have an embeddings column and you’re trying to do nearest neighbors between the data represented by the embeddings, it’s almost always going to be best to do the distance measure on the embeddings.