Ability to search strings and symbols using keyword search in workshop

Lee · November 11, 2024, 1:42pm

I’m having trouble with keyword searches when using the filter list in workshop. I’m trying to find properties containing specific words, but the results are inconsistent.

For example, searching for “test” doesn’t find the property “2024_test_新新新”, but searching for “新新新” does. I read something suggesting that parts of a name after an underscore “_” might be ignored, but that doesn’t seem to be the case here since “新新新” is found even though it comes after an underscore.

I referred to this documentation :
https://community.palantir.com/t/ability-to-search-strings-within-filter-selections/1373

Also, searching for “Sk” does find “2024_○○線Sk○_○○巡”. So, it’s not the language or the number of underscores that’s the problem.

Please let me know why this happens and how I can successfully find the property “2024_test_新新新” by searching for “test”?

sandpiper · November 12, 2024, 8:41am

Keyword search functionality in Object Storage V2 is backed by Apache Lucene and uses the standard tokenizer (see https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/analysis/standard/StandardTokenizer.html), which in turn is based upon the Unicode Text Segmentation algorithm described at https://unicode.org/reports/tr29/. This algorithm breaks around all ideographs (including hanzi/kanji), but does not break around underscores, meaning that the example strings that you shared would be tokenized in the following ways (assuming for the sake of argument that the ○ in the second example are ASCII characters).

2024_test_新新新 → [2024_test_, 新, 新, 新]

2024_○○線Sk○_○○巡 → [2024_○○, 線, Sk○_○○, 巡]

The above tokenization behavior, combined with the fact that keyword search only matches on token prefixes (but does not match on strings within tokens), explains the behavior that you are observing. There is no way to find “2024_test_新新新” by searching for “test” unless you preprocess the string in the pipeline (possibly adding another, separate property dedicated for keyword search) to replace the underscores with spaces.