[Guide] Writing Effective Evals

// We’re working on better docs for AIP Evals, but in the interim wanted to share some best practices and learnings from Palantir engineers putting evals to work in the field. For what Evals are at a high level and why they’re critical to production AIP workflows, check out this introductory blog post and this deep dive. This post pulls together learnings from Palantir's Privacy and Civil Liberties (PCL) dev team from embedding with Palantir engineers in the field. The case studies are representative examples of how Evals were applied to various AIP Logic implementations and how those implementations evolved based on what was learned from the evaluation results.

Why write Evals?

  • Adapt at warp speed: Evals let you pivot quickly (i.e., compare and switch between models and model versions) and keep switching costs low without compromising quality. You’ll know exactly what changed and why.
  • Spot issues fast: Testing prevents accidental regressions in your AI as you keep iterating.
  • Build trust & alignment: Show customers/users your AI meets real-world requirements — and keeps meeting them after every change.
Even if you only have limited time to invest in writing evals, the payoff in clarity and reliability is huge.

Quick Overview of the Evals Workflow

// Here’s a worksheet that steps through each of these stages in granular detail
  • Define the Problem & AI Task: Identify exactly what you’re using AI for (e.g., summarizing documents, classifying tickets, extracting entities, etc.).
  • Collect/Create Test Cases: Gather real data and/or create synthetic test cases that represent a variety of inputs.
  • Decide on Evaluation Methods:
    • Deterministic checks (e.g., regex, exact match, numeric ranges)
    • LLM-as-Judge for open-ended/subjective output (e.g., comprehensiveness, groundedness)
    • Ideally: keep to simple pass/fail outputs for your evaluators
  • Run, Observe, Fix: Evaluate, see where it fails, fix the system.
  • Repeat: Over time, you’ll refine your tests, add new ones, and keep track of improvements.
Tip: Start small; add complexities later if needed. Don’t let perfect be the enemy of good — ship an initial eval suite and learn from it.

Example #1: Summarization

Problem Overview

A function takes in a block of text and returns a summary.

What Evaluators We Used

  • Out of the box in AIP Evals
    • String length check (summary must be < n characters).
    • Regex match (summary must be, e.g., bulleted)
    • Keyword checker (e.g., “X,” “Y,” if they appear in the source).
  • Custom No-Code (Evaluator backed by AIP Logic)
    • LLM-as-a-Judge: “Does this {summary bullet, sentence, etc.} accurately reflect [the input text]?”
      • We’ve found it effective to force the LLM to return quotes from the input text that you can use to validate your validator. Helpful when writing evals on this evaluator.
      • Provide 2–3 examples of grounded vs. ungrounded summaries as few-shot examples so the LLM judge sees how a “pass” vs. “fail” looks.
  • Custom Pro-Code (Evaluator backed by a Function)
    • Word count (summary must be ~150 words)

What We Changed As A Result


  • Prompt Engineering: For example, “Ensure you mention any mention of X or Z explicitly in the summary.” or “Be concise”

Example #2: Classification


Problem Overview


A function takes in some input and classifies it into two or more categories (ex: yes/no, pass/fail, score 1-10). Optional (but recommended): the LLM is prompted to provide an explanation prior to the output classification.

What Evaluators we Used


  • Out of the box in AIP Evals
    • Exact string/integer match (expected vs. actual classification)
    • String length (on explanation)
    • Keyword checker (on explanation)
    • Numeric range (for scores)
  • Custom No-Code (Evaluator backed by AIP Logic)
    • LLM-as-a-Judge: e.g., “Does this output explanation align with this output classification?”

What we changed & Impact of iterating with Evals


  • Prompt Engineering: Refine the classification criteria for each potential output, give further instructions for more concise/relevant explanations, etc.
  • OAG (Ontology-assisted-generation): Even better if you can represent the classification criteria as an object type, rather than hardcoding it into the prompt
  • Use of tools and deterministic processes: augment LLM use with deterministic functions and tools for more consistent results.


Example #3: Semantic Search


Problem Overview


A function takes in a search query and returns the top K most semantically similar objects.

What Evaluators We Used


  • Out of the box in AIP Evals
    • Exact Object match (e.g., if you’re returning the top search result)
  • Custom No-Code (Evaluator backed by AIP Logic)
    • LLM-as-a-Judge: Score/classify each search result based on quality/relevance, average it.
  • Custom Pro-Code (Evaluator backed by a Function)
    • Recall %: count the number of relevant results vs. the number of total results in the search

What We Changed & Impact of Iterating with Evals


  • Upgrade Embedding Model: We switched from text-embedding-ada-002 to text-embedding-3-small.
  • Tune Queries/Embedding Process: Tuned logic to generate more descriptive search queries / embed more semantically relevant properties of embedded objects.
  • Impact: With one partner, we saw the quality of search results improve across the board. Raised both the floor + ceiling of result quality.


Key Patterns Across Examples


  • Start with a small set of pass/fail checks.
    • Don’t overcomplicate scoring scales / rubrics.
  • Bring domain experts in to define what “correct” means.
    • Your goal is to use them to sniff out hidden rules or corner cases + codify their tribal knowledge as evals (both for your main function under test + also for any LLM-as-a-Judge evaluators).
  • Use real data + synthetic data:
    • Real data captures actual complexities, synthetic helps push edge cases systematically.
  • LLM-as-Judge is powerful but needs clear prompts and examples of “pass vs. fail.”
    • Write Evals for your LLM-as-Judge functions to ensure alignment between their outputs + human preferences. You can often use the same test cases between your main LLM function and your LLM-as-a-Judge functions.
  • Watch out for:
    • Cost or latency overhead (as we all know, LLM calls aren’t free / instantaneous).
    • When to fix prompt instructions vs. update few-shot examples vs. add new tests.

Closing Advice


  • Build Quick, Update Often: Even a single day of test writing can reveal major failures. Add or refine tests each time you find an error.
  • Don’t Stress Perfection: It’s normal to accept some failing test cases, especially if you’re rapidly iterating. Decide thresholds that fit your + your customers’ risk tolerance.
  • Start small: A handful of carefully chosen test cases with pass/fail checks can work wonders.
  • Evals are not optional if you want to maintain velocity with confidence — they’re a minimal investment that pays off with every iteration.

References & Further Reading


5 Likes

This is great, thank you so much for writing it up!

Is there any way to have evals on datasets as opposed to ontology? The use case I have in mind is data extraction from mediasets using LLMs. It’s weird to add an object type for the raw mediaset just for the evals.

At the moment, the best way to do this is the way you identified — by building that dataset into the Ontology, you can pull it into AIP Evals via an Object set-backed evaluation suite. That being said, we’d be curious to know more about your use case to inform our product development!