Why write Evals?
- Adapt at warp speed: Evals let you pivot quickly (i.e., compare and switch between models and model versions) and keep switching costs low without compromising quality. You’ll know exactly what changed and why.
- Spot issues fast: Testing prevents accidental regressions in your AI as you keep iterating.
- Build trust & alignment: Show customers/users your AI meets real-world requirements — and keeps meeting them after every change.
Quick Overview of the Evals Workflow
// Here’s a worksheet that steps through each of these stages in granular detail- Define the Problem & AI Task: Identify exactly what you’re using AI for (e.g., summarizing documents, classifying tickets, extracting entities, etc.).
- Collect/Create Test Cases: Gather real data and/or create synthetic test cases that represent a variety of inputs.
- Decide on Evaluation Methods:
- Deterministic checks (e.g., regex, exact match, numeric ranges)
- LLM-as-Judge for open-ended/subjective output (e.g., comprehensiveness, groundedness)
- Ideally: keep to simple pass/fail outputs for your evaluators
- Run, Observe, Fix: Evaluate, see where it fails, fix the system.
- Repeat: Over time, you’ll refine your tests, add new ones, and keep track of improvements.
Tip: Start small; add complexities later if needed. Don’t let perfect be the enemy of good — ship an initial eval suite and learn from it.
Example #1: Summarization
Problem Overview
A function takes in a block of text and returns a summary.What Evaluators We Used
- Out of the box in AIP Evals
- String length check (summary must be < n characters).
- Regex match (summary must be, e.g., bulleted)
- Keyword checker (e.g., “X,” “Y,” if they appear in the source).
- Custom No-Code (Evaluator backed by AIP Logic)
- LLM-as-a-Judge: “Does this {summary bullet, sentence, etc.} accurately reflect [the input text]?”
- We’ve found it effective to force the LLM to return quotes from the input text that you can use to validate your validator. Helpful when writing evals on this evaluator.
- Provide 2–3 examples of grounded vs. ungrounded summaries as few-shot examples so the LLM judge sees how a “pass” vs. “fail” looks.
- Custom Pro-Code (Evaluator backed by a Function)
- Word count (summary must be ~150 words)
What We Changed As A Result
- Prompt Engineering: For example, “Ensure you mention any mention of X or Z explicitly in the summary.” or “Be concise”
Example #2: Classification
Problem Overview
A function takes in some input and classifies it into two or more categories (ex: yes/no, pass/fail, score 1-10). Optional (but recommended): the LLM is prompted to provide an explanation prior to the output classification.
What Evaluators we Used
- Out of the box in AIP Evals
- Exact string/integer match (expected vs. actual classification)
- String length (on explanation)
- Keyword checker (on explanation)
- Numeric range (for scores)
- Custom No-Code (Evaluator backed by AIP Logic)
- LLM-as-a-Judge: e.g., “Does this output explanation align with this output classification?”
What we changed & Impact of iterating with Evals
- Prompt Engineering: Refine the classification criteria for each potential output, give further instructions for more concise/relevant explanations, etc.
- OAG (Ontology-assisted-generation): Even better if you can represent the classification criteria as an object type, rather than hardcoding it into the prompt
- Use of tools and deterministic processes: augment LLM use with deterministic functions and tools for more consistent results.
Example #3: Semantic Search
Problem Overview
A function takes in a search query and returns the top K most semantically similar objects.
What Evaluators We Used
- Out of the box in AIP Evals
- Exact Object match (e.g., if you’re returning the top search result)
- Custom No-Code (Evaluator backed by AIP Logic)
- LLM-as-a-Judge: Score/classify each search result based on quality/relevance, average it.
- Custom Pro-Code (Evaluator backed by a Function)
- Recall %: count the number of relevant results vs. the number of total results in the search
What We Changed & Impact of Iterating with Evals
- Upgrade Embedding Model: We switched from text-embedding-ada-002 to text-embedding-3-small.
- Tune Queries/Embedding Process: Tuned logic to generate more descriptive search queries / embed more semantically relevant properties of embedded objects.
- Impact: With one partner, we saw the quality of search results improve across the board. Raised both the floor + ceiling of result quality.
Key Patterns Across Examples
- Start with a small set of pass/fail checks.
- Don’t overcomplicate scoring scales / rubrics.
- Bring domain experts in to define what “correct” means.
- Your goal is to use them to sniff out hidden rules or corner cases + codify their tribal knowledge as evals (both for your main function under test + also for any LLM-as-a-Judge evaluators).
- Use real data + synthetic data:
- Real data captures actual complexities, synthetic helps push edge cases systematically.
- LLM-as-Judge is powerful but needs clear prompts and examples of “pass vs. fail.”
- Write Evals for your LLM-as-Judge functions to ensure alignment between their outputs + human preferences. You can often use the same test cases between your main LLM function and your LLM-as-a-Judge functions.
- Watch out for:
- Cost or latency overhead (as we all know, LLM calls aren’t free / instantaneous).
- When to fix prompt instructions vs. update few-shot examples vs. add new tests.
Closing Advice
- Build Quick, Update Often: Even a single day of test writing can reveal major failures. Add or refine tests each time you find an error.
- Don’t Stress Perfection: It’s normal to accept some failing test cases, especially if you’re rapidly iterating. Decide thresholds that fit your + your customers’ risk tolerance.
- Start small: A handful of carefully chosen test cases with pass/fail checks can work wonders.
- Evals are not optional if you want to maintain velocity with confidence — they’re a minimal investment that pays off with every iteration.
References & Further Reading
- For basic AIP Evals setup: From Prototype to Production: Testing and Evaluating AI Systems with AIP Evals (PCL Blog)
- For a deep dive into the theory/tradecraft of writing effective evals: Evaluating Generative AI: A Field Manual (PCL Blog)
- Other useful blog posts:
- Your AI Product Needs Evals
- Creating a LLM-as-a-Judge That Drives Business Results
- Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences
- What We’ve Learned From A Year of Building with LLMs
- What AI engineers can learn from qualitative research methods in HCI