Curriculum/Day 8: AI Evaluation & Testing

Day 8Ship AI to Production

AI Evaluation & Testing

Unit tests check exact outputs. LLMs are non-deterministic — same input, different output every time. You need evals: automated quality checks using rubrics, LLM-as-judge, and eval datasets. You'll learn eval-driven development — write evals first, then iterate prompts to pass them.

80 min(+30 min boss)★★★☆☆

✅

Bridge:Unit tests + CI/CDEvals + eval-driven development

Use this at work tomorrow

Write 10 eval cases for any AI feature your team ships — catch prompt regressions before users do.

Learning Objectives

1Understand why expect(output).toBe(exact) fails for AI
2Build eval datasets from real user queries and expert labels
3Implement LLM-as-judge evaluation with scoring rubrics
4Set up eval-driven development: write evals → iterate prompts → measure
5Create an eval suite that catches regressions in your Day 3 RAG app

Ship It: Eval suite for your RAG app

By the end of this day, you'll build and deploy a eval suite for your rag app. This isn't a toy — it's a real project for your portfolio.

Before You Start — Rate Your Confidence

I can build an automated eval pipeline using LLM-as-judge and Promptfoo to catch AI regressions before deploying.

1 = no idea · 5 = ship it blindfolded

Predict First — Then Learn

What's the #1 reason AI projects fail in production?

Week 2: Production AI — Evals Are Your Test Suite

Week 1 taught you to build AI features. Week 2 teaches you to ship them with confidence. The #1 reason AI projects fail in production isn't bad models — it's no way to measure quality. Evals are your test suite for non-deterministic systems. Without them, every deployment is a coin flip.

💡Evals are your test suite for AI — without them, every deployment is a coin flip.

Quick Pulse Check

Why can't traditional unit tests fully cover AI feature quality?

🧪 Eval Pipeline Flow

Watch how a test case flows through the evaluation pipeline.

📝

Test Case

🤖

Production Model

💬

AI Output

⚖️

Judge Model

📊

Score Card

The Eval Mental Model: Think CI/CD for AI

You already write unit tests and integration tests. AI evals are the same idea: define expected behavior, run the system, check the output. The difference: AI outputs are non-deterministic, so you need fuzzy matching. Instead of assertEquals('hello'), you check 'does the response contain the key information?' or 'is the tone professional?'. This is where LLM-as-judge comes in.

💡AI evals = unit tests with fuzzy matching. Check meaning, not exact strings.

Quick Pulse Check

What replaces assertEquals() in AI evals?

Predict First — Then Learn

In LLM-as-judge, which model grades the outputs?

LLM-as-Judge: Using AI to Test AI

The most powerful eval technique: use a strong model (GPT-4o) to grade outputs from your production model (GPT-4o-mini). Define rubrics with criteria and scores. The judge model evaluates each criterion. This scales better than manual review and catches regressions automatically. It's not perfect — judge models have biases — but it's 10x better than 'looks good to me' testing.

💡LLM-as-judge: use GPT-4o to grade GPT-4o-mini. Scales better than manual review, catches regressions.

Quick Pulse Check

What's a known limitation of LLM-as-judge?

⚖️ LLM-as-Judge Simulator

Score the AI response, then see how GPT-4o would score it.

Ground Truth

Password reset is available at Settings > Security. Click 'Reset Password' to receive an email with a reset link valid for 1 hour.

AI Response

To reset your password, go to Settings → Security → Reset Password. You'll get an email link.

factuality

relevance

faithfulness

Types of Evals: What to Test

Core eval types: (1) Factuality — is the answer correct given the context? (2) Relevance — does the answer address the question? (3) Faithfulness — does the answer only use provided context (no hallucination)? (4) Harmfulness — does it contain unsafe content? (5) Style — does it match your tone/format requirements? Start with factuality and faithfulness for RAG systems. Add others as you find failure modes.

💡Start with factuality + faithfulness evals for RAG. Add style and harmfulness as you find failure modes.

📊 Eval Score Dashboard

Notice the 2-5% fluctuation across runs — that's non-determinism in action.

Loading dashboard…

Predict First — Then Learn

If you run the same eval suite twice, will scores be identical?

Building an Eval Pipeline with Promptfoo

Promptfoo is the open-source standard for LLM evals. It runs your prompts against test datasets and scores results. Think of it as Jest/Vitest for AI. You define test cases in YAML, run `promptfoo eval`, and get a report showing pass/fail rates. Integrate it into CI to catch regressions before deployment. Every serious AI team uses something like this.

💡Promptfoo is Jest for AI — YAML test cases, CLI runner, CI integration, pass/fail reports.

Quick Pulse Check

Where should you integrate Promptfoo eval runs?

The Full Evolution

Watch one function evolve through every concept you just learned.

🔄 Code Evolution — One Function, Five Stages

Step 1: Raw fetch()

The SWE starting point

Raw fetch, manual headers, raw text output

1async function reviewCode(code: string) {
2  const response = await fetch(
3    "https://api.openai.com/v1/chat/completions",
4    {
5      method: "POST",
6      headers: {
7        "Authorization": `Bearer ${API_KEY}`,
8        "Content-Type": "application/json",
9      },
10      body: JSON.stringify({
11        model: "gpt-4o-mini",
12        messages: [
13          { role: "user", content: `Review: ${code}` }
14        ],
15      }),
16    }
17  );
18  const data = await response.json();
19  return data.choices[0].message.content;
20  // Returns raw text — unparseable!
21}

1 / 5

Production Gotchas

Golden rule: build your eval set from real user queries, not imagined ones. Production failures are always weirder than your test cases. Start with 20-50 eval cases covering happy path + known failure modes. LLM-as-judge costs money — run full evals on PRs, not every commit. Eval scores will fluctuate 2-5% between runs (non-determinism) — set thresholds with margin. Version your prompts alongside eval results so you can correlate changes.

Code Comparison

Unit Tests vs AI Evals

Deterministic testing vs non-deterministic AI evaluation

Unit Test (exact match)Traditional

// Testing a deterministic function
describe("calculateTotal", () => {
  it("sums items correctly", () => {
    const result = calculateTotal([
      { price: 10.00 },
      { price: 5.50 },
    ]);
    // Exact match — always the same output
    expect(result).toBe(15.50);
  });

  it("applies discount", () => {
    const result = calculateTotal(
      [{ price: 100 }],
      { discount: 0.1 }
    );
    expect(result).toBe(90.00);
  });
});
// Deterministic: same input → same output
// Binary: pass or fail, nothing in between

AI Eval (rubric-based scoring)AI Engineering

// Testing a non-deterministic AI system
import { generateObject } from "ai";
import { z } from "zod";

async function evalResponse(
  question: string,
  aiAnswer: string,
  groundTruth: string
) {
  const { object } = await generateObject({
    model: openai("gpt-4o"), // Strong judge
    schema: z.object({
      factuality: z.number().min(1).max(5),
      relevance: z.number().min(1).max(5),
      reasoning: z.string(),
    }),
    prompt: `Grade this AI response:
Question: ${question}
Expected: ${groundTruth}
Actual: ${aiAnswer}

Score 1-5 on factuality and relevance.`,
  });
  return object;
}

// Run across eval dataset
const results = await Promise.all(
  evalSet.map(({ q, expected }) =>
    evalResponse(q, await getAIAnswer(q), expected)
  )
);
const avgScore = avg(results.map(r => r.factuality));
assert(avgScore >= 4.0, "Quality regression!");

KEY DIFFERENCES

Unit tests: exact match, deterministic, binary pass/fail
AI evals: fuzzy scoring (1-5), non-deterministic, threshold-based
LLM-as-judge: use GPT-4o to grade GPT-4o-mini outputs
Run evals as CI checks to catch regressions before deploy

Bridge Map: Unit tests + CI/CD → Evals + eval-driven development

Click any bridge to see the translation

Hands-On Challenges

Build, experiment, and get AI-powered feedback on your code.

starter

Build an LLM-as-Judge Evaluator

Create an evaluator that uses a mock LLM-as-judge to score AI responses on factuality and relevance. Given a question, expected answer, and actual AI answer, the judge should return scores (1-5) and reasoning. Run it against a small eval dataset and compute average scores.

PLAYGROUND

import { useState } from "react";
import { z } from "zod";
// TODO: Import generateObject and openai from the mock
// import { generateObject, openai } from "./ai-sdk-mock";

// Eval dataset: question, expected answer, actual AI answer
const evalDataset = [
  {
    question: "How do I authenticate with the API?",
    expected: "Use JWT tokens. POST /auth/login with email and password to get a token. Include as Bearer token in Authorization header.",
    actual: "Authentication uses JWT tokens. Call POST /auth/login with your email and password to receive a token, then include it as a Bearer token in the Authorization header.",
  },
  {
    question: "What are the rate limits?",
    expected: "100 requests per minute per API key. Returns 429 when exceeded. Use exponential backoff.",
    actual: "The API allows 100 requests per minute. If you exceed this, you'll get a 429 error code.",
  },
  {
    question: "How does authentication work?",
    expected: "JWT-based auth via POST /auth/login endpoint.",
    actual: "You need to log in to use the API. There's a login endpoint you can call.",
  },
  {
    question: "How do webhooks work?",
    expected: "Configure at /settings/webhooks. Events: user.created, order.completed, payment.failed. Payloads include signature header.",
    actual: "Webhooks can be set up in settings. They support events like user.created, order.completed, and payment.failed. Each payload has a signature for verification.",
  },
];

// TODO: Define a judge schema with:
// - factuality: number (1-5)
// - relevance: number (1-5)
// - reasoning: string

export default function App() {
  const [results, setResults] = useState<any[]>([]);
  const [running, setRunning] = useState(false);
  const [avgScores, setAvgScores] = useState<{ factuality: number; relevance: number } | null>(null);

  async function runEvals() {
    setRunning(true);
    setResults([]);
    setAvgScores(null);

    const evalResults: any[] = [];

    for (const evalCase of evalDataset) {
      // TODO: Call generateObject() with judge model to score each eval case
      // Use model: openai("gpt-4o") (the strong judge)
      // prompt should include: question, expected answer, actual answer
      // Ask the judge to score factuality and relevance (1-5)

      evalResults.push({
        question: evalCase.question,
        factuality: 0,  // TODO: Replace with actual score
        relevance: 0,   // TODO: Replace with actual score
        reasoning: "Implement the judge!",
      });
    }

    // TODO: Calculate average factuality and relevance scores
    // const avgFactuality = evalResults.reduce((sum, r) => sum + r.factuality, 0) / evalResults.length;
    // const avgRelevance = ...

    setResults(evalResults);
    // setAvgScores({ factuality: avgFactuality, relevance: avgRelevance });
    setRunning(false);
  }

  function getScoreColor(score: number) {
    if (score >= 4) return "#16a34a";
    if (score >= 3) return "#ca8a04";
    return "#dc2626";
  }

  return (
    <div style={{ padding: 20, fontFamily: "sans-serif", maxWidth: 700 }}>
      <h2>🧪 AI Eval Dashboard</h2>
      <p style={{ color: "#666", fontSize: 13 }}>
        LLM-as-Judge evaluator — scores AI responses on factuality & relevance
      </p>

      <button onClick={runEvals} disabled={running}
        style={{ padding: "10px 24px", background: running ? "#94a3b8" : "#8b5cf6", color: "white", border: "none", borderRadius: 8, cursor: "pointer", fontSize: 14 }}>
        {running ? "Running evals..." : `Run Evals (${evalDataset.length} cases)`}
      </button>

      {avgScores && (
        <div style={{ marginTop: 16, padding: 12, background: "#f8fafc", borderRadius: 8, display: "flex", gap: 24 }}>
          <div>
            <strong>Avg Factuality:</strong>{" "}
            <span style={{ color: getScoreColor(avgScores.factuality), fontWeight: 700, fontSize: 18 }}>
              {avgScores.factuality.toFixed(1)}/5
            </span>
          </div>
          <div>
            <strong>Avg Relevance:</strong>{" "}
            <span style={{ color: getScoreColor(avgScores.relevance), fontWeight: 700, fontSize: 18 }}>
              {avgScores.relevance.toFixed(1)}/5
            </span>
          </div>
          <div style={{ color: avgScores.factuality >= 4 ? "#16a34a" : "#dc2626", fontWeight: 600 }}>
            {avgScores.factuality >= 4 ? "✅ PASSING" : "❌ BELOW THRESHOLD"}
          </div>
        </div>
      )}

      {results.length > 0 && (
        <div style={{ marginTop: 16 }}>
          {results.map((r, i) => (
            <div key={i} style={{ padding: 12, margin: "8px 0", background: "#f1f5f9", borderRadius: 8, fontSize: 13 }}>
              <div style={{ fontWeight: 600, marginBottom: 4 }}>Q: {r.question}</div>
              <div style={{ display: "flex", gap: 16, margin: "4px 0" }}>
                <span>Factuality: <strong style={{ color: getScoreColor(r.factuality) }}>{r.factuality}/5</strong></span>
                <span>Relevance: <strong style={{ color: getScoreColor(r.relevance) }}>{r.relevance}/5</strong></span>
              </div>
              <div style={{ color: "#666", fontStyle: "italic", marginTop: 4 }}>{r.reasoning}</div>
            </div>
          ))}
        </div>
      )}
    </div>
  );
}

Open Sandbox

Real-World Challenge

AI Eval Dashboard

Build and deploy an evaluation dashboard for your Day 7 capstone (or any AI feature). Create a golden eval dataset, implement LLM-as-judge scoring, and build a dashboard that tracks quality over time. This is how production AI teams prevent regressions.

~4h estimated

Next.js 14+Vercel AI SDKOpenAI GPT-4o (for judge)Tailwind CSSRecharts or Chart.js (visualization)Vercel (deploy)

Acceptance Criteria

Create a golden eval dataset with 20+ question/expected-answer pairs
Implement LLM-as-judge scoring on factuality, relevance, and faithfulness
Run the full eval suite and compute pass rates and average scores
Build a visual dashboard showing scores, per-question breakdown, and trends
Support multiple eval runs with timestamp-based comparison
Identify and highlight the weakest-performing test cases
Deploy to a public URL (Vercel, Netlify, etc.)

Build Roadmap

0/6

Create a new Next.js app with TypeScript and Tailwind CSS. Plan the architecture: eval dataset storage, judge API, results database, and dashboard UI.

npx create-next-app@latest ai-eval-dashboard --typescript --tailwind --app

Create folders: /data/evals (golden datasets), /lib/judge (scoring logic), /app/dashboard

Deploy Tip

Push to GitHub and import into Vercel. Pre-load sample eval results so the dashboard has data on first visit. This project shows hiring managers you build quality infrastructure, not just features.

After Learning — Rate Your Confidence Again

I can build an automated eval pipeline using LLM-as-judge and Promptfoo to catch AI regressions before deploying.

1 = no idea · 5 = ship it blindfolded

Day 7: Capstone: AI Documentation Assistant

Day 9: Security & Guardrails

AI Evaluation & Testing

Learning Objectives

Ship It: Eval suite for your RAG app

Week 2: Production AI — Evals Are Your Test Suite

🧪 Eval Pipeline Flow

The Eval Mental Model: Think CI/CD for AI

LLM-as-Judge: Using AI to Test AI

⚖️ LLM-as-Judge Simulator

Types of Evals: What to Test

📊 Eval Score Dashboard

Building an Eval Pipeline with Promptfoo

The Full Evolution

🔄 Code Evolution — One Function, Five Stages

Step 1: Raw fetch()

Production Gotchas

Code Comparison

Unit Tests vs AI Evals

Bridge Map: Unit tests + CI/CD → Evals + eval-driven development

Hands-On Challenges

Build an LLM-as-Judge Evaluator

AI Eval Dashboard

Acceptance Criteria

Build Roadmap

Discussion

🧪 Eval Pipeline Flow

⚖️ LLM-as-Judge Simulator

🔄 Code Evolution — One Function, Five Stages

Step 1: Raw fetch()

Build an LLM-as-Judge Evaluator

Discussion