Best Practices for Pipeline Builder Use LLM

bkaplan · November 22, 2024, 4:21am

I am wondering if there are any best practices to help speed up the initial run of a large pipeline that is using a Use LLM Transform in pipeline builder.

A few initial ideas that I had:

Is there a way to chunk the dataset into X number of row builds so we can capture the output from each section?
Is there an ideal cluster size that we should use? (i.e. more executors? little memory?)

helenq · November 22, 2024, 6:09am

Hey @bkaplan are your builds OOM-ing or are the builds just taking a long time? If it’s the former, generally the suggestions (from the Pipeline Builder side, without bumping enrollment limits) would be to try a smaller profile or split the non-dependent parts of your pipeline into a separate pipeline so there’s not as many concurrent requests. Note that the smaller profile might be less helpful / can make builds take longer.

If you just want to get the builds to go faster it’s a bit tricky to tell you a catch all solution because depending on your stack/enrollment configs increasing the profile might hit more rate limits which would end up slowing down your progress.

I would definitely turn on the Skip recomputing rows option so that all builds after the first one will at least only run the LLM on new rows it hasn’t seen. You could even split your input dataset into a few chunks to cache the initial data (first run) and as you union in the rest of the data, it will just run the LLMs on the new rows.

Those are my initial thoughts but I’ll let other folks chime in as well!

bkaplan · November 22, 2024, 1:45pm

They are just taking a really long time and I currently don’t have any insight into the progress of the build until at least one task finishes.

I’d want to better understand what would be the best way of fine-tuning my profile.

For example, take a case where I have 1000 rows and each row use 1 call to an LLM that uses 10,000 tokens that takes 30 seconds. Further, assume a rate limit of 100,000 tokens a minute. How can most I effectively run as close to 10 calls as possible in that time frame?

To answer this I’d want to understand the following:

How many calls can each executor make?
Does the number of VPCUs on an executor change this?
How much memory should my profile need? Assuming that my pipeline is just using the use LLM block, should it be as small as possible?
Is there a way to find my enrollments rate limits?
Is there a way to approximate the number of tokens and time it takes for the LLM to run a row?

I realize this is not an exact science, but ideally this should help me minimize build time while minimizing compute.

david · November 22, 2024, 4:21pm

Hey! Sounds like you’re wanting to maximize the throughput for your usellm board.

Each executor will kick off 16 parallel calls per vCPU.
^^
Memory should not make a difference for the throughput, unless you have responses or inputs so large that the executors are OOMing.
You can find enrollment limits in control panel.
You can use the test row feature at the bottom of the UseLLM board to figure out how many tokens a particular representative row might use. Unfortunately, these will not be representative of the time taken since the time take for a completion depends on how many retries need to happen before it succeeds. If your enrollment is being rate limited or even if the build itself is large (and therefore you’re running into rate limits in the context of just the build), the number of retries for a row could be quite large.

bkaplan · November 22, 2024, 4:37pm

Where in control panel can I find the rate limits?