Pipeline Builder with LLM for 1K+ rows

jchon · June 11, 2024, 10:04am

I am trying to use an LLM (Large Language Model) to find categories/keywords/embeddings for over 100K items in a pipeline, but running it all at once is too large. What is the most resource-efficient and risk-free way to run this?

Xander · June 11, 2024, 10:11am

Hey, can you say more about why it’s “too large”, we’ve seen usage far greater than 1,000 rows before.

helenq · December 4, 2024, 1:32am

Hey @jchon if this is a still a problem you can select the “skip recomputing rows” option and run your pipeline in increments on your input data.

You can read more about it here: https://www.palantir.com/docs/foundry/pipeline-builder/pipeline-builder-llm/#skip-computing-already-processed-rows

guyhartstein · December 4, 2024, 7:46am

Hey @jchon,

What is the nature of the task? Often, a categorization task can be done more quickly and at scale with a pre-trained ML model or one trained on synthetic data. Pretrained embeddings models also exist, and those you should be able to use at scale to generate vector embeddings for 100K rows. At the end of the day, you could always try changing your build settings in Pipeline builder to something larger and “natively accelerated.”

Additionally, you could ask yourself how much of the data needs to be processed and how much of it is repeated. If you have a lot of duplicates, you could turn on “Skip Recomputing Rows,” which caches LLM queries and ensures that you aren’t re-running the same calls.

Let me know if that was helpful