[Worksheet] Evals Field Manual

colton · February 11, 2025, 1:58pm

BLUF: This worksheet walks you through each phase of developing a testing and evaluation (T&E) plan for your AI use case. Based off of this guide from Palantir’s Privacy and Civil Liberties (PCL) team, you’ll identify exactly what you want to test, how you’ll define “good,” where your ground-truth or reference data comes from, and which evaluation methods or metrics to use.

Worksheet at a Glance

Define Use Case	Clarify the workflow and KPI	Clear problem statement & success metrics	☐
Break Down Tasks	Separate “macro” vs. “micro” tasks	Defined sub-tasks & priority levels	☐
“Good” Criteria	List out syntax & semantic requirements	Shared definition of success at each step	☐
Gather/Create Data	Collect or synthesize ground-truth & reference data	Core test sets for your AI evals	☐
Pick Evaluators	Choose from exact match, fuzzy match, LLM judge, etc.	Mapped tasks to best-fit evaluators	☐
LLM-as-a-Judge	Write judge prompts & pass/fail outputs	Automated checks for open-ended tasks	☐
Robustness Checks	Account for non-determinism, bias, perturbations	Additional layers of reliability testing	☐
Analyze & Iterate	Review test results, root cause fails, fix & retest	Iterative performance improvements	☐
Deploy & Monitor	Monitor real-world behavior, log data, test for drift	Ongoing reliability & alignment	☐

Define Your Use Case and Success Criteria

Goal

Articulate the real-world scenario in which your AI system will be used and identify how you’ll know it’s successful.

Questions to Answer

What is the workflow or business process being augmented or automated by AI?

Example: “Routing shipments to specific factories,” “triaging customer service tickets,” “summarizing chat logs,” etc.

Who are the end-users (internal team, customers, etc.)?

Example: “Customer support agents,” “warehouse managers,” “marketing specialists.”

What does ‘success’ look like at a high level?

Example: “Reduced time to resolve tickets,” “fewer shipping errors,” “faster summarization with correct attributions.”

What are the measurable outcomes or Key Performance Indicators (KPIs)?

Example: “% misrouted shipments,” “average time to resolve tickets,” “accuracy of code completions that compile without errors.”

Checklist

I can explain, in a few sentences, how the AI solution is used and why it matters.
I know the primary business or user metric this AI should improve (e.g., reduced user churn, quicker turnaround, higher conversion, etc.).
My entire team shares a consistent understanding of “what does good look like?” at a high level.

Break Down the AI Tasks (“Macro” vs. “Micro”)

Goal

Decompose the larger AI workflow (“macro”) into smaller tasks or “testable” components (“micro”) to pinpoint what, exactly, you’ll evaluate.

Questions to Answer

What is the end-to-end workflow, and where does AI fit in?

Example: “AI checks inventory, then suggests shipping route. Operator confirms the route and executes the shipment.”

Which tasks are performed by AI?

Example: “Document retrieval,” “question answering,” “translation,” “code generation,” “classification,” “summarization.”

Which tasks must absolutely be correct vs. which tasks are more flexible or creative?

Example: “A shipping label must be exact” (low tolerance for error) vs. “Suggestion text can be approximate” (higher tolerance).

Checklist

I have mapped the full operational workflow: from user request → AI inference → final action/decision.
I have a list of discrete tasks or steps that the AI performs.
I can tag each task with its required level of correctness or robustness (e.g., 95% accuracy for classification, 100% valid code syntax, etc.).

Specify “What Does Good Look Like?” for Each Task

Goal

Translate “success” into more detailed criteria that capture both syntax (format, structure) and semantics (correctness, relevance).

Questions to Answer

Syntax Requirements

Does the AI output need to follow a specific format (e.g., JSON schema, function call signature, code snippet that compiles)?
Are there constraints like max length or specific language?

Semantic Requirements

Does the response need to be factually correct, or is creative latitude allowed?
Is it relevant and comprehensive for the user’s question?

Quality and Style Requirements

Does the text need to be friendly, neutral, concise, etc.?
Are there domain-specific guidelines (e.g., legal disclaimers, brand voice) that must be followed?

Checklist

For each task, I’ve listed 2–3 “semantic” criteria (like correctness, completeness, tone).
For each task, I’ve listed any strict “syntax” or format rules.
I know how to check if the output meets these criteria (some might be automated; some might need domain-expert input).

Collect or Create Reference Data (“Ground Truth”) Where Possible

Goal

Build a set of test examples with known or expected outputs for straightforward tasks, or well-understood “gold standard” references for creative tasks.

Methods to Consider

Open-Source Benchmarks

Good for standard tasks (e.g., summarization, QA) to ensure your model’s “baseline” performance is intact.

Unit Tests Written by Subject Matter Experts

Manually crafted “edge case” or “typical scenario” examples that define a correct output.

Historical Data

If automating an existing workflow, gather real past inputs and their correct decisions or final outcomes.

Synthetic Data

Use an LLM to create plausible but labeled examples if real data is limited or missing.
Example: “Generate user queries and the expected correct classification label.”

Checklist

I have at least a small set of test examples that reflect real use cases.
I know which tasks these examples “cover”—and I have edge cases or adversarial examples.
(Optional) I have a plan for using LLMs or other means to expand my dataset if needed.

Design Evaluation Methods

Goal

Match each task (from Step 2) to an appropriate evaluation approach or metric. You’ll likely use a mix of reference-based comparisons (when ground truth is available) and reference-free checks (like heuristics or LLM-as-a-Judge).

Potential Evaluators

Deterministic or Reference-Based Evaluators

Exact string match, fuzzy matching (Levenshtein distance), numeric tolerance checks, regular expression checks.
Great for syntax validation, verifying top-1 classification, or checking a known field is correct.

LLM-as-a-Judge

A second LLM “critiques” the main model’s output based on your pass/fail criteria.
Especially useful where no single “correct” reference exists (e.g., summarization, code style).

Human Judgments

Domain experts mark outputs as pass/fail and note a brief critique.
This yields the “gold standard” to align your automated evaluations.

Perturbation Testing

Introduce typos, synonyms, or demographic changes to inputs. Evaluate the model’s resilience and check for potential biases.

Checklist

For each criteria, I decided if I can use exact match or if I need a more flexible check.
I know whether an LLM-as-a-Judge or domain-expert pass/fail step is needed.
I have a plan to run multiple trials if the AI has non-deterministic outputs (especially for generative tasks).
I’m considering “robustness” tests (typos, synonyms, etc.) if real-world data is messy.

Implement “LLM-as-a-Judge” Carefully (If Relevant)

Goal

Leverage a second LLM to evaluate outputs on more subjective or open-ended tasks.

Tips

Prompt Design

Provide multi-shot examples of “good vs. bad” outputs and how to critique them.
Use binary pass/fail style to keep the results actionable.

Cost & Performance

LLM calls can be expensive and slow. Consider using them selectively or on a representative sample of your data.
Start with a small, curated set of examples from real user data or known tough scenarios.

Periodically Check the Judge

Validate the judge’s alignment with a domain expert’s judgments.
Tweak prompts or instructions if you see consistent misjudgments.

Checklist

I have a “judge prompt” with examples that show how to evaluate correctness, completeness, style, etc.
The LLM judge output is strictly yes/no (pass/fail) or “pass with critique” vs. “fail with critique” for clarity.
I track agreement between the judge and an actual domain expert.
I know the cost/latency trade-offs for running LLM-based evals at scale.

Measure Robustness, Non-Determinism, and Bias

Goal

Ensure your AI remains stable across variations of inputs and identify potential biases or fairness issues.

Techniques

Multiple Runs / Temperature Sweeps

Re-run each test input N times. Calculate how often the model meets your pass criteria.
Look for outliers or inconsistent performance.

Perturbation Testing (again)

Expand your test data with controlled changes (typos, synonyms, demographic swaps) to gauge how the model fails or changes answers.

Bias & Fairness Checks

Evaluate if the AI performance differs significantly across different demographic inputs.

Checklist

I run each test input multiple times if the model is non-deterministic.
I have at least a small “perturbed” test set that includes random typos or demographic changes.
I’ve noted any potential biases or fairness concerns and have a plan to address them (e.g., specialized test sets, domain expert reviews).

Analyze Results & Iterate

Goal

Aggregate your evaluation metrics, see where failures cluster, and fix them systematically.

Steps

Create an Evaluation Dashboard

Summaries: pass rate across tasks/features, confusion matrices, distribution of error types.

Error Analysis

Sample from failed outputs. Work with domain experts to label the root cause (e.g., “lack of user context,” “wrong function call,” “hallucinated references”).

Fix & Retest

Adjust prompts, add constraints, refine training data, or revise tool-calling logic.
Re-run the same tests to confirm improvement.

Decide on Next Steps

If the model is good enough, proceed to pilot or A/B testing.
Otherwise, keep iterating, gathering more data or exploring fine-tuning, specialized guardrails, etc.

Checklist

I have a mechanism (dashboard, spreadsheet, logs) to see pass/fail rates.
I read through a sample of fails to classify root causes.
I made changes (prompt engineering, training, RAG approach, etc.) to address top failure modes.
I re-ran my tests to see if I improved performance.

Deployment, Monitoring & Maintenance

Goal

Once your AI is in production, continue monitoring and evaluating it to catch regressions and drift.

Ongoing Steps

Live Feedback & Logging

Capture real user interactions (traces), especially failures or escalations.
Periodically label them and feed them back into your test suite.

Drift Monitoring

Watch if model performance degrades over time or if user behavior changes.
Re-run your test suite on a schedule or after major updates.

Expansion

As your use cases grow, add new tests or scenarios.
Create specialized or more granular LLM judges if certain tasks become critical.

Checklist

I have a plan for logging real-world usage (including final responses, error codes, user feedback).
I set a schedule or triggers (e.g., monthly, new model release) to re-run the entire evaluation suite.
I have a strategy for building new tests and fix-forward improvements as the model or data evolves.

Final Tips & Recap

Keep It Simple

Start with the most critical tasks and basic pass/fail. Don’t chase dozens of metrics before you have the fundamentals.

Engage Domain Experts

They define “good” in practice. Build your test data and “LLM-as-a-Judge” prompts around their guidance.

Look at the Data

The real value often comes from systematically inspecting test failures and seeing how the AI actually behaves.

Iterate

Testing and evaluation is not one-and-done. Build a routine for continuous refinement.

Closing Notes

Start small: Even a handful of high-value test cases can surface big issues early.
Refine often: As you learn from real user data, keep updating your test sets and evaluation plan.
Keep it actionable: Use pass/fail judgments and short critiques rather than complex, subjective numeric scales—this gives you a direct path to improvements.

[Worksheet] Evals Field Manual

Worksheet at a Glance

Define Your Use Case and Success Criteria

Goal

Questions to Answer

Checklist

Break Down the AI Tasks (“Macro” vs. “Micro”)

Goal

Questions to Answer

Checklist

Specify “What Does Good Look Like?” for Each Task

Goal

Questions to Answer

Checklist

Collect or Create Reference Data (“Ground Truth”) Where Possible

Goal

Methods to Consider

Checklist

Design Evaluation Methods

Goal

Potential Evaluators

Checklist

Implement “LLM-as-a-Judge” Carefully (If Relevant)

Goal

Tips

Checklist

Measure Robustness, Non-Determinism, and Bias

Goal

Techniques

Checklist

Analyze Results & Iterate

Goal

Steps

Checklist

Deployment, Monitoring & Maintenance

Goal

Ongoing Steps

Checklist

Final Tips & Recap

Closing Notes

Further Reading