I’m working on an entity resolution pipeline to resolve parts across a business. I have ~5 million parts and their descriptions. My plan is to use embeddings + cosine similarity to group parts, but I’m concerned about the compute/time cost of embedding all 5 million descriptions. I’m already planning to use ANN search to minimize pairwise comparisons after embedding, but the initial embedding step itself seems quite computationally expensive.
Does anyone have ideas for making this embedding process more efficient/less computationally expensive? Much appreciated!
I do not have a solution that is purely native to foundry, but if you’re ok with doing a data pass to DataLinks and then back to Foundry we can help you solve that problem. DataLinks specializes in hidrating the ontology, and a lot of what we do is exactly in the realms of entity resolution and automated modeling.
What I would recommend doing is a data export of your parts table, let DataLinks automatically detect and perform entity resolution to create compatible links, then either do a data export back to foundry or consume directly from workshop via functions.
Let me now if this would work and I can make a tutorial for you and drop here. Alternatively this should work with any external service specialized in entity resolution.
For now, I am trying to run the embedding process once in pipeline builder, and then I wrote a transform that uses ANN (Spark LSH) to generate what materials should be considered as identical. Since there shouldn’t be too many new entries to the dataset, the one time embeddings cost should hopefully be reasonable.
If I run into any bottlenecks or issues with this, I will definitely explore these other options further!