we successfully build some agents and exposed to them to our customers. Our production websites are dealing with quite some heavy load and we need to make sure that also the agents can deal with the amount of requests. We run some load tests on one of them and the results weren’t that good. In average the response times dropped drastically if there are more than 5 requests in parallel.
Therefore we need several things and weren’t able to find anything in the documentation towards:
Logging and Monitoring
Requestcodes - are there errors
Response times - how long is the response time
which prompts are send to the agent
Scaling
How can we scale the agents so that they can handle more requests
Hi Peter! Would be happy to try and help out here.
In terms of logging, we are actively working on a first-class solution to this and hope to ship this sometime early next year! If an agent is set to indefinite retention and all of its resources are within the same Compass project, we will log the messages (and their metadata) from each session to a Foundry dataset that can then be viewed by the builder of the agent. We are still scoping out the exact details here, but this should be along the lines of what you are describing above. As an aside, request codes are also documented in the Foundry API documentation here!
In order to scale agents for your workload, we could definitely advise based on the details of your workflow. Could you offer some more information on your current setup and how you are using the Agents external API? For example, are you using a third-party service user? Also, what volume of requests do you expect? We can horizontally scale the AIP Agents service for your deployment on a case-by-case basis, so any additional information would be helpful!
For caching, if the model temperature is 0 then we do cache responses with the same input, but could you expand on how you would expect this caching to operate? We can’t leverage too much caching without sacrificing the indeterminism of LLMs, but if you have a list of frequently asked questions from users then you could definitely inject that as context for the agent. For example, you could store FAQs as objects and leverage either ontology context or custom retrieval context to attach this to the agent’s prompt as soon as the user’s query is sent. This could help avoid the agent spending unnecessary time on tool calls to retrieve the context for its response.
thanks so much for your answer and the additional information.
Logging:
It sounds that you’re working exactly on what we need. One follow up question on the link to the request codes - is there already any way to log them in the platform itself or are there only visible for the external application that is requesting Foundry?
Scaling:
We are using the agents in parts of our media websites and the possible amount of requests it depends heavily on the use case. This can do up to several thousand concurrent users. If we have a production use-case that we need to scale up - how would the process look like - who would we need to reach out to and how long would it take?
Caching:
Thanks for the information about the temperature - that helps already quite a lot. Ideally there is a functionality that allows to predefined answers for common questions and they don’t have to hit any LLM. Many of our users are looking to interact as easy as possible with the product (that includes that they don’t wanna type but click instead and a cached answer would be perfectly fine for them). At the end we are looking for functionality like provided from OpenAI for example (unfortunately I can’t include links but if you search for prompt caching and OpenAI you’ll find it immediately)
One more thing to scale the agent is the threshold of tokens per second for the configured model. Ideally we could configure it just for one dedicated agent.