[Guide] Writing Effective Evals

colton · February 11, 2025, 3:09pm

// We’re working on better docs for AIP Evals, but in the interim wanted to share some best practices and learnings from Palantir engineers putting evals to work in the field. For what Evals are at a high level and why they’re critical to production AIP workflows, check out this introductory blog post and this deep dive. This post pulls together learnings from Palantir's Privacy and Civil Liberties (PCL) dev team from embedding with Palantir engineers in the field. The case studies are representative examples of how Evals were applied to various AIP Logic implementations and how those implementations evolved based on what was learned from the evaluation results.

Why write Evals?

Adapt at warp speed: Evals let you pivot quickly (i.e., compare and switch between models and model versions) and keep switching costs low without compromising quality. You’ll know exactly what changed and why.
Spot issues fast: Testing prevents accidental regressions in your AI as you keep iterating.
Build trust & alignment: Show customers/users your AI meets real-world requirements — and keeps meeting them after every change.

Even if you only have limited time to invest in writing evals, the payoff in clarity and reliability is huge.

Quick Overview of the Evals Workflow

// Here’s a worksheet that steps through each of these stages in granular detail

Define the Problem & AI Task: Identify exactly what you’re using AI for (e.g., summarizing documents, classifying tickets, extracting entities, etc.).
Collect/Create Test Cases: Gather real data and/or create synthetic test cases that represent a variety of inputs.
Decide on Evaluation Methods:

Deterministic checks (e.g., regex, exact match, numeric ranges)
LLM-as-Judge for open-ended/subjective output (e.g., comprehensiveness, groundedness)
Ideally: keep to simple pass/fail outputs for your evaluators

Run, Observe, Fix: Evaluate, see where it fails, fix the system.
Repeat: Over time, you’ll refine your tests, add new ones, and keep track of improvements.

Tip: Start small; add complexities later if needed. Don’t let perfect be the enemy of good — ship an initial eval suite and learn from it.

Example #1: Summarization

Problem Overview

A function takes in a block of text and returns a summary.

What Evaluators We Used

Out of the box in AIP Evals

String length check (summary must be < n characters).
Regex match (summary must be, e.g., bulleted)
Keyword checker (e.g., “X,” “Y,” if they appear in the source).

Custom No-Code (Evaluator backed by AIP Logic)

LLM-as-a-Judge: “Does this {summary bullet, sentence, etc.} accurately reflect [the input text]?”

LLM-as-a-Judge in AIP Logic941×1064 82.8 KB
We’ve found it effective to force the LLM to return quotes from the input text that you can use to validate your validator. Helpful when writing evals on this evaluator.
Provide 2–3 examples of grounded vs. ungrounded summaries as few-shot examples so the LLM judge sees how a “pass” vs. “fail” looks.

Custom Pro-Code (Evaluator backed by a Function)

Word count (summary must be ~150 words)

What We Changed As A Result

Prompt Engineering: For example, “Ensure you mention any mention of X or Z explicitly in the summary.” or “Be concise”

Example #2: Classification

Problem Overview

A function takes in some input and classifies it into two or more categories (ex: yes/no, pass/fail, score 1-10). Optional (but recommended): the LLM is prompted to provide an explanation prior to the output classification.

What Evaluators we Used

Out of the box in AIP Evals

Exact string/integer match (expected vs. actual classification)
String length (on explanation)
Keyword checker (on explanation)
Numeric range (for scores)

Custom No-Code (Evaluator backed by AIP Logic)

LLM-as-a-Judge: e.g., “Does this output explanation align with this output classification?”

What we changed & Impact of iterating with Evals

Prompt Engineering: Refine the classification criteria for each potential output, give further instructions for more concise/relevant explanations, etc.
OAG (Ontology-assisted-generation): Even better if you can represent the classification criteria as an object type, rather than hardcoding it into the prompt
Use of tools and deterministic processes: augment LLM use with deterministic functions and tools for more consistent results.
Custom Evals dashboard for binary classification1216×1254 81.7 KB

Example #3: Semantic Search

Problem Overview

A function takes in a search query and returns the top K most semantically similar objects.

What Evaluators We Used

Out of the box in AIP Evals

Exact Object match (e.g., if you’re returning the top search result)

Custom No-Code (Evaluator backed by AIP Logic)

LLM-as-a-Judge: Score/classify each search result based on quality/relevance, average it.

Custom Pro-Code (Evaluator backed by a Function)

Recall %: count the number of relevant results vs. the number of total results in the search

What We Changed & Impact of Iterating with Evals

Upgrade Embedding Model: We switched from text-embedding-ada-002 to text-embedding-3-small.
Tune Queries/Embedding Process: Tuned logic to generate more descriptive search queries / embed more semantically relevant properties of embedded objects.
Impact: With one partner, we saw the quality of search results improve across the board. Raised both the floor + ceiling of result quality.
Evals results showing improvements2206×258 25.5 KB

Key Patterns Across Examples

Start with a small set of pass/fail checks.

Don’t overcomplicate scoring scales / rubrics.

Bring domain experts in to define what “correct” means.

Your goal is to use them to sniff out hidden rules or corner cases + codify their tribal knowledge as evals (both for your main function under test + also for any LLM-as-a-Judge evaluators).

Use real data + synthetic data:

Real data captures actual complexities, synthetic helps push edge cases systematically.

LLM-as-Judge is powerful but needs clear prompts and examples of “pass vs. fail.”

Write Evals for your LLM-as-Judge functions to ensure alignment between their outputs + human preferences. You can often use the same test cases between your main LLM function and your LLM-as-a-Judge functions.

Watch out for:

Cost or latency overhead (as we all know, LLM calls aren’t free / instantaneous).
When to fix prompt instructions vs. update few-shot examples vs. add new tests.

Closing Advice

Build Quick, Update Often: Even a single day of test writing can reveal major failures. Add or refine tests each time you find an error.
Don’t Stress Perfection: It’s normal to accept some failing test cases, especially if you’re rapidly iterating. Decide thresholds that fit your + your customers’ risk tolerance.
Start small: A handful of carefully chosen test cases with pass/fail checks can work wonders.
Evals are not optional if you want to maintain velocity with confidence — they’re a minimal investment that pays off with every iteration.

References & Further Reading

For basic AIP Evals setup: From Prototype to Production: Testing and Evaluating AI Systems with AIP Evals (PCL Blog)
For a deep dive into the theory/tradecraft of writing effective evals: Evaluating Generative AI: A Field Manual (PCL Blog)
Other useful blog posts:

anjor · March 13, 2025, 10:55am

This is great, thank you so much for writing it up!

Is there any way to have evals on datasets as opposed to ontology? The use case I have in mind is data extraction from mediasets using LLMs. It’s weird to add an object type for the raw mediaset just for the evals.

colton · March 13, 2025, 5:50pm

At the moment, the best way to do this is the way you identified — by building that dataset into the Ontology, you can pull it into AIP Evals via an Object set-backed evaluation suite. That being said, we’d be curious to know more about your use case to inform our product development!

system · May 12, 2025, 5:51pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.