I have been working with the compute modules to expose LLM’s in bedrock via bedrock SDK
Here is my current setup
- Compute module with a function name InvokeLLM that takes input messages calls bedrock and responds
- Webhook that calls the compute module
- Typsecriipt function that calls the webhook. I had to do this because as per a previous post, I cannt call a compute function directly from a tyescript function
- the function published Via typescript now shows up as a model to choose from list of registered models in AIP logic
My question is, if multiple users use the same model at the same time, will the compute module process these requests sequentially ?
I guess I can control the RPM at the webhook level, if the webhook is configured to concurency limit 10, I would assume 10 of these requests will hit the compute module
Will compute module process these request sequentially ?
If so how to make the compute module process more than 1 request at a time
Based on the Documentation, I understand that compute module manages concurrency when using the SDK, This concurrency is set when configuring the container and the replica’s will scale to meet the concurrency.
So I need to make sure that I have enough scaling in the webhook and compute setting.