AI Evaluation & Testing
Unit tests check exact outputs. LLMs are non-deterministic — same input, different output every time. You need evals: automated quality checks using rubrics, LLM-as-judge, and eval datasets. You'll learn eval-driven development — write evals first, then iterate prompts to pass them.
Use this at work tomorrow
Write 10 eval cases for any AI feature your team ships — catch prompt regressions before users do.
Learning Objectives
- 1Understand why expect(output).toBe(exact) fails for AI
- 2Build eval datasets from real user queries and expert labels
- 3Implement LLM-as-judge evaluation with scoring rubrics
- 4Set up eval-driven development: write evals → iterate prompts → measure
- 5Create an eval suite that catches regressions in your Day 3 RAG app
Ship It: Eval suite for your RAG app
By the end of this day, you'll build and deploy a eval suite for your rag app. This isn't a toy — it's a real project for your portfolio.
I can build an automated eval pipeline using LLM-as-judge and Promptfoo to catch AI regressions before deploying.
What's the #1 reason AI projects fail in production?
Week 2: Production AI — Evals Are Your Test Suite
Week 1 taught you to build AI features. Week 2 teaches you to ship them with confidence. The #1 reason AI projects fail in production isn't bad models — it's no way to measure quality. Evals are your test suite for non-deterministic systems. Without them, every deployment is a coin flip.
Why can't traditional unit tests fully cover AI feature quality?
The Eval Mental Model: Think CI/CD for AI
You already write unit tests and integration tests. AI evals are the same idea: define expected behavior, run the system, check the output. The difference: AI outputs are non-deterministic, so you need fuzzy matching. Instead of assertEquals('hello'), you check 'does the response contain the key information?' or 'is the tone professional?'. This is where LLM-as-judge comes in.
What replaces assertEquals() in AI evals?
In LLM-as-judge, which model grades the outputs?
LLM-as-Judge: Using AI to Test AI
The most powerful eval technique: use a strong model (GPT-4o) to grade outputs from your production model (GPT-4o-mini). Define rubrics with criteria and scores. The judge model evaluates each criterion. This scales better than manual review and catches regressions automatically. It's not perfect — judge models have biases — but it's 10x better than 'looks good to me' testing.
What's a known limitation of LLM-as-judge?
Types of Evals: What to Test
Core eval types: (1) Factuality — is the answer correct given the context? (2) Relevance — does the answer address the question? (3) Faithfulness — does the answer only use provided context (no hallucination)? (4) Harmfulness — does it contain unsafe content? (5) Style — does it match your tone/format requirements? Start with factuality and faithfulness for RAG systems. Add others as you find failure modes.
If you run the same eval suite twice, will scores be identical?
Building an Eval Pipeline with Promptfoo
Promptfoo is the open-source standard for LLM evals. It runs your prompts against test datasets and scores results. Think of it as Jest/Vitest for AI. You define test cases in YAML, run `promptfoo eval`, and get a report showing pass/fail rates. Integrate it into CI to catch regressions before deployment. Every serious AI team uses something like this.
Where should you integrate Promptfoo eval runs?
The Full Evolution
Watch one function evolve through every concept you just learned.
Production Gotchas
Golden rule: build your eval set from real user queries, not imagined ones. Production failures are always weirder than your test cases. Start with 20-50 eval cases covering happy path + known failure modes. LLM-as-judge costs money — run full evals on PRs, not every commit. Eval scores will fluctuate 2-5% between runs (non-determinism) — set thresholds with margin. Version your prompts alongside eval results so you can correlate changes.
Code Comparison
Unit Tests vs AI Evals
Deterministic testing vs non-deterministic AI evaluation
// Testing a deterministic function
describe("calculateTotal", () => {
it("sums items correctly", () => {
const result = calculateTotal([
{ price: 10.00 },
{ price: 5.50 },
]);
// Exact match — always the same output
expect(result).toBe(15.50);
});
it("applies discount", () => {
const result = calculateTotal(
[{ price: 100 }],
{ discount: 0.1 }
);
expect(result).toBe(90.00);
});
});
// Deterministic: same input → same output
// Binary: pass or fail, nothing in between// Testing a non-deterministic AI system
import { generateObject } from "ai";
import { z } from "zod";
async function evalResponse(
question: string,
aiAnswer: string,
groundTruth: string
) {
const { object } = await generateObject({
model: openai("gpt-4o"), // Strong judge
schema: z.object({
factuality: z.number().min(1).max(5),
relevance: z.number().min(1).max(5),
reasoning: z.string(),
}),
prompt: `Grade this AI response:
Question: ${question}
Expected: ${groundTruth}
Actual: ${aiAnswer}
Score 1-5 on factuality and relevance.`,
});
return object;
}
// Run across eval dataset
const results = await Promise.all(
evalSet.map(({ q, expected }) =>
evalResponse(q, await getAIAnswer(q), expected)
)
);
const avgScore = avg(results.map(r => r.factuality));
assert(avgScore >= 4.0, "Quality regression!");KEY DIFFERENCES
- Unit tests: exact match, deterministic, binary pass/fail
- AI evals: fuzzy scoring (1-5), non-deterministic, threshold-based
- LLM-as-judge: use GPT-4o to grade GPT-4o-mini outputs
- Run evals as CI checks to catch regressions before deploy
Bridge Map: Unit tests + CI/CD → Evals + eval-driven development
Click any bridge to see the translation
Hands-On Challenges
Build, experiment, and get AI-powered feedback on your code.
AI Eval Dashboard
Build and deploy an evaluation dashboard for your Day 7 capstone (or any AI feature). Create a golden eval dataset, implement LLM-as-judge scoring, and build a dashboard that tracks quality over time. This is how production AI teams prevent regressions.
Acceptance Criteria
- Create a golden eval dataset with 20+ question/expected-answer pairs
- Implement LLM-as-judge scoring on factuality, relevance, and faithfulness
- Run the full eval suite and compute pass rates and average scores
- Build a visual dashboard showing scores, per-question breakdown, and trends
- Support multiple eval runs with timestamp-based comparison
- Identify and highlight the weakest-performing test cases
- Deploy to a public URL (Vercel, Netlify, etc.)
Build Roadmap
0/6Create a new Next.js app with TypeScript and Tailwind CSS. Plan the architecture: eval dataset storage, judge API, results database, and dashboard UI.
npx create-next-app@latest ai-eval-dashboard --typescript --tailwind --appCreate folders: /data/evals (golden datasets), /lib/judge (scoring logic), /app/dashboardDeploy Tip
Push to GitHub and import into Vercel. Pre-load sample eval results so the dashboard has data on first visit. This project shows hiring managers you build quality infrastructure, not just features.
Sign in to submit your deployed project.
I can build an automated eval pipeline using LLM-as-judge and Promptfoo to catch AI regressions before deploying.