KNN with condition 2.0

Hi Community,

Building upon this post: https://community.palantir.com/t/knn-join-with-conditions/2780

Question 1: Will the custom condition used to apply a KNN with a condition:

  1. filter out the results not matching the conditions after the KNN join has happened?
  2. Apply KNN search within rows relevant for the condition?
  3. or something else?

Question 2: If I choose “K nearest” parameter to strictly greater than 1, do I get multiple rows as I would using the regular KNN join?

I have a need where performance is important and is part of the reason why I want to do a KNN with a condition, ensuring that the “search” is performed only within the rows where the condition is fulfilled.

Thanks a lot,
Thor

Hi @thorsb! Thanks for the post.

  1. In the post you referenced, the solution given has a custom expression to give the distance between the two columns. The custom expression is a case statement on the value of a conditional - if the condition is satisfied, it gives a “reasonable” distance between the columns (that is, the levenshtein distance). Otherwise, it gives a very very large distance between the columns. Therefore, if the condition is not satisfied, the result can be in the output dataset, it will just be at the “bottom” of the K results for each group, since it has the largest possible distance.
  2. Yes, if you choose K > 1, you will get more than one row in the output for each row in the left dataset of the input. You will get all tied results (so if k=2 and three rows are tied for shortest distance, you will get all three rows as output).

If performance is important, I would run the filter prior to the knn board. We have an open FR to introduce approximate knn which will allow more efficient algorithms, and to introduce better algorithms for known distance measures.

@david , thanks!

Re performance: I believe I am not able to filter beforehand, as I need to conduct this operation for a large number values for the condition. I’d be curious to hear if you have other suggestions on this.

For context: If I only had one value for the condition it would be OK, but given I can have +50, it becomes cumbersome to filter the two different datasets and then perform KNN join - and then union them.

Hm, that’s a tougher one - at the end of the day, just like with code, if you have a lot of conditions you might just have to eat that dev cost. One thought is that you could write a reusable filter board so you only have to write it once, if the filters on both datasets are the same. Another things to try: can you use builder’s AIP feature “generate” to write the filter board for you?

@david , I have one condition column only:
And although the brute force method could work right now, my fear is that new values will appear in the condition column.

Would it be speedier to do:

  1. 2x50 x filter transforms and 50 x pre-filtered knn joins, vs.
  2. the custom one discussed above?

What happens that might add a new condition to the filter?

If I understand right, there’s a math equation here; the knn is n^2, which almost definitely grossly dominate the cost of the filter (even 50 of them). If each filter takes makes each dataset \frac{1}{\sqrt{50}} of the original size, then it’s faster. O/w probably just doing the raw join and filter condition is better? But really if perf of the build is paramount, I’d do it both ways and see which one wins.

To make sure it’s clear: an additional potential value might be added to the existing condition.

Example: Imagine you are a company producing a certain widget. You have production sites in various countries, and want to source a component closest to your description within that country.

If you open/close a production site, then the number of values would change - which in turn would affect the filters needed.

FYI: Someone else suggested me to do a regular left join on the condition, then following a transform where similarity score is computed between the embeddings and then using top rows to pick the relevant # of rows.

I was about to say the same thing - if your set of conditions are changing, it sounds like you want a dataset to populate the values, and can just do another join to get the filter behavior. You can even have this be a writeback dataset so you can build out a nice workshop flow for opening / closing a plant.

Also yes, if you have an embeddings column and you’re trying to do nearest neighbors between the data represented by the embeddings, it’s almost always going to be best to do the distance measure on the embeddings.