[Worksheet] Evals Field Manual

BLUF: This worksheet walks you through each phase of developing a testing and evaluation (T&E) plan for your AI use case. Based off of this guide from Palantir’s Privacy and Civil Liberties (PCL) team, you’ll identify exactly what you want to test, how you’ll define “good,” where your ground-truth or reference data comes from, and which evaluation methods or metrics to use.

Worksheet at a Glance

Define Use Case Clarify the workflow and KPI Clear problem statement & success metrics
Break Down Tasks Separate “macro” vs. “micro” tasks Defined sub-tasks & priority levels
“Good” Criteria List out syntax & semantic requirements Shared definition of success at each step
Gather/Create Data Collect or synthesize ground-truth & reference data Core test sets for your AI evals
Pick Evaluators Choose from exact match, fuzzy match, LLM judge, etc. Mapped tasks to best-fit evaluators
LLM-as-a-Judge Write judge prompts & pass/fail outputs Automated checks for open-ended tasks
Robustness Checks Account for non-determinism, bias, perturbations Additional layers of reliability testing
Analyze & Iterate Review test results, root cause fails, fix & retest Iterative performance improvements
Deploy & Monitor Monitor real-world behavior, log data, test for drift Ongoing reliability & alignment

Define Your Use Case and Success Criteria

Goal

Articulate the real-world scenario in which your AI system will be used and identify how you’ll know it’s successful.

Questions to Answer

  • What is the workflow or business process being augmented or automated by AI?
    • Example: “Routing shipments to specific factories,” “triaging customer service tickets,” “summarizing chat logs,” etc.
  • Who are the end-users (internal team, customers, etc.)?
    • Example: “Customer support agents,” “warehouse managers,” “marketing specialists.”
  • What does ‘success’ look like at a high level?
    • Example: “Reduced time to resolve tickets,” “fewer shipping errors,” “faster summarization with correct attributions.”
  • What are the measurable outcomes or Key Performance Indicators (KPIs)?
    • Example: “% misrouted shipments,” “average time to resolve tickets,” “accuracy of code completions that compile without errors.”

Checklist

  • I can explain, in a few sentences, how the AI solution is used and why it matters.
  • I know the primary business or user metric this AI should improve (e.g., reduced user churn, quicker turnaround, higher conversion, etc.).
  • My entire team shares a consistent understanding of “what does good look like?” at a high level.

Break Down the AI Tasks (“Macro” vs. “Micro”)

Goal

Decompose the larger AI workflow (“macro”) into smaller tasks or “testable” components (“micro”) to pinpoint what, exactly, you’ll evaluate.

Questions to Answer

  • What is the end-to-end workflow, and where does AI fit in?
    • Example: “AI checks inventory, then suggests shipping route. Operator confirms the route and executes the shipment.”
  • Which tasks are performed by AI?
    • Example: “Document retrieval,” “question answering,” “translation,” “code generation,” “classification,” “summarization.”
  • Which tasks must absolutely be correct vs. which tasks are more flexible or creative?
    • Example: “A shipping label must be exact” (low tolerance for error) vs. “Suggestion text can be approximate” (higher tolerance).

Checklist

  • I have mapped the full operational workflow: from user request → AI inference → final action/decision.
  • I have a list of discrete tasks or steps that the AI performs.
  • I can tag each task with its required level of correctness or robustness (e.g., 95% accuracy for classification, 100% valid code syntax, etc.).

Specify “What Does Good Look Like?” for Each Task

Goal

Translate “success” into more detailed criteria that capture both syntax (format, structure) and semantics (correctness, relevance).

Questions to Answer

  • Syntax Requirements
    • Does the AI output need to follow a specific format (e.g., JSON schema, function call signature, code snippet that compiles)?
    • Are there constraints like max length or specific language?
  • Semantic Requirements
    • Does the response need to be factually correct, or is creative latitude allowed?
    • Is it relevant and comprehensive for the user’s question?
  • Quality and Style Requirements
    • Does the text need to be friendly, neutral, concise, etc.?
    • Are there domain-specific guidelines (e.g., legal disclaimers, brand voice) that must be followed?

Checklist

  • For each task, I’ve listed 2–3 “semantic” criteria (like correctness, completeness, tone).
  • For each task, I’ve listed any strict “syntax” or format rules.
  • I know how to check if the output meets these criteria (some might be automated; some might need domain-expert input).

Collect or Create Reference Data (“Ground Truth”) Where Possible

Goal

Build a set of test examples with known or expected outputs for straightforward tasks, or well-understood “gold standard” references for creative tasks.

Methods to Consider

  • Open-Source Benchmarks
    • Good for standard tasks (e.g., summarization, QA) to ensure your model’s “baseline” performance is intact.
  • Unit Tests Written by Subject Matter Experts
    • Manually crafted “edge case” or “typical scenario” examples that define a correct output.
  • Historical Data
    • If automating an existing workflow, gather real past inputs and their correct decisions or final outcomes.
  • Synthetic Data
    • Use an LLM to create plausible but labeled examples if real data is limited or missing.
    • Example: “Generate user queries and the expected correct classification label.”

Checklist

  • I have at least a small set of test examples that reflect real use cases.
  • I know which tasks these examples “cover”—and I have edge cases or adversarial examples.
  • (Optional) I have a plan for using LLMs or other means to expand my dataset if needed.

Design Evaluation Methods

Goal

Match each task (from Step 2) to an appropriate evaluation approach or metric. You’ll likely use a mix of reference-based comparisons (when ground truth is available) and reference-free checks (like heuristics or LLM-as-a-Judge).

Potential Evaluators

  • Deterministic or Reference-Based Evaluators
    • Exact string match, fuzzy matching (Levenshtein distance), numeric tolerance checks, regular expression checks.
    • Great for syntax validation, verifying top-1 classification, or checking a known field is correct.
  • LLM-as-a-Judge
    • A second LLM “critiques” the main model’s output based on your pass/fail criteria.
    • Especially useful where no single “correct” reference exists (e.g., summarization, code style).
  • Human Judgments
    • Domain experts mark outputs as pass/fail and note a brief critique.
    • This yields the “gold standard” to align your automated evaluations.
  • Perturbation Testing
    • Introduce typos, synonyms, or demographic changes to inputs. Evaluate the model’s resilience and check for potential biases.

Checklist

  • For each criteria, I decided if I can use exact match or if I need a more flexible check.
  • I know whether an LLM-as-a-Judge or domain-expert pass/fail step is needed.
  • I have a plan to run multiple trials if the AI has non-deterministic outputs (especially for generative tasks).
  • I’m considering “robustness” tests (typos, synonyms, etc.) if real-world data is messy.

Implement “LLM-as-a-Judge” Carefully (If Relevant)

Goal

Leverage a second LLM to evaluate outputs on more subjective or open-ended tasks.

Tips

  • Prompt Design
    • Provide multi-shot examples of “good vs. bad” outputs and how to critique them.
    • Use binary pass/fail style to keep the results actionable.
  • Cost & Performance
    • LLM calls can be expensive and slow. Consider using them selectively or on a representative sample of your data.
    • Start with a small, curated set of examples from real user data or known tough scenarios.
  • Periodically Check the Judge
    • Validate the judge’s alignment with a domain expert’s judgments.
    • Tweak prompts or instructions if you see consistent misjudgments.

Checklist

  • I have a “judge prompt” with examples that show how to evaluate correctness, completeness, style, etc.
  • The LLM judge output is strictly yes/no (pass/fail) or “pass with critique” vs. “fail with critique” for clarity.
  • I track agreement between the judge and an actual domain expert.
  • I know the cost/latency trade-offs for running LLM-based evals at scale.

Measure Robustness, Non-Determinism, and Bias

Goal

Ensure your AI remains stable across variations of inputs and identify potential biases or fairness issues.

Techniques

  • Multiple Runs / Temperature Sweeps
    • Re-run each test input N times. Calculate how often the model meets your pass criteria.
    • Look for outliers or inconsistent performance.
  • Perturbation Testing (again)
    • Expand your test data with controlled changes (typos, synonyms, demographic swaps) to gauge how the model fails or changes answers.
  • Bias & Fairness Checks
    • Evaluate if the AI performance differs significantly across different demographic inputs.

Checklist

  • I run each test input multiple times if the model is non-deterministic.
  • I have at least a small “perturbed” test set that includes random typos or demographic changes.
  • I’ve noted any potential biases or fairness concerns and have a plan to address them (e.g., specialized test sets, domain expert reviews).

Analyze Results & Iterate

Goal

Aggregate your evaluation metrics, see where failures cluster, and fix them systematically.

Steps

  • Create an Evaluation Dashboard
    • Summaries: pass rate across tasks/features, confusion matrices, distribution of error types.
  • Error Analysis
    • Sample from failed outputs. Work with domain experts to label the root cause (e.g., “lack of user context,” “wrong function call,” “hallucinated references”).
  • Fix & Retest
    • Adjust prompts, add constraints, refine training data, or revise tool-calling logic.
    • Re-run the same tests to confirm improvement.
  • Decide on Next Steps
    • If the model is good enough, proceed to pilot or A/B testing.
    • Otherwise, keep iterating, gathering more data or exploring fine-tuning, specialized guardrails, etc.

Checklist

  • I have a mechanism (dashboard, spreadsheet, logs) to see pass/fail rates.
  • I read through a sample of fails to classify root causes.
  • I made changes (prompt engineering, training, RAG approach, etc.) to address top failure modes.
  • I re-ran my tests to see if I improved performance.

Deployment, Monitoring & Maintenance

Goal

Once your AI is in production, continue monitoring and evaluating it to catch regressions and drift.

Ongoing Steps

  • Live Feedback & Logging
    • Capture real user interactions (traces), especially failures or escalations.
    • Periodically label them and feed them back into your test suite.
  • Drift Monitoring
    • Watch if model performance degrades over time or if user behavior changes.
    • Re-run your test suite on a schedule or after major updates.
  • Expansion
    • As your use cases grow, add new tests or scenarios.
    • Create specialized or more granular LLM judges if certain tasks become critical.

Checklist

  • I have a plan for logging real-world usage (including final responses, error codes, user feedback).
  • I set a schedule or triggers (e.g., monthly, new model release) to re-run the entire evaluation suite.
  • I have a strategy for building new tests and fix-forward improvements as the model or data evolves.

Final Tips & Recap

  • Keep It Simple
    • Start with the most critical tasks and basic pass/fail. Don’t chase dozens of metrics before you have the fundamentals.
  • Engage Domain Experts
    • They define “good” in practice. Build your test data and “LLM-as-a-Judge” prompts around their guidance.
  • Look at the Data
    • The real value often comes from systematically inspecting test failures and seeing how the AI actually behaves.
  • Iterate
    • Testing and evaluation is not one-and-done. Build a routine for continuous refinement.

Closing Notes

  • Start small: Even a handful of high-value test cases can surface big issues early.
  • Refine often: As you learn from real user data, keep updating your test sets and evaluation plan.
  • Keep it actionable: Use pass/fail judgments and short critiques rather than complex, subjective numeric scales—this gives you a direct path to improvements.

Further Reading

2 Likes