Curriculum/Day 10: Cost Optimization & Multi-Model Strategy

Day 10Ship AI to Production

Cost Optimization & Multi-Model Strategy

AI API costs can explode overnight. You'll learn the same cost controls you use for infrastructure — caching, routing, and optimization — adapted for AI. Semantic caching, model routing (fast/cheap vs slow/smart), multi-model strategy (GPT-4o vs Claude vs open-source), and token optimization.

80 min(+30 min boss)★★★☆☆

💰

Bridge:Caching + load balancingSemantic caching + model routing

Use this at work tomorrow

Add semantic caching to your most-called AI endpoint — watch costs drop immediately.

Learning Objectives

1Implement semantic caching — cache by meaning, not exact string match
2Build a model router: route simple queries to cheap models, complex to expensive
3Design a multi-model strategy: GPT-4o vs Claude vs Gemini vs local (Ollama)
4Optimize token usage: prompt compression, output limiting, context pruning
5Ship a multi-model router that reduces costs by 60%+

Ship It: Multi-model router with caching

By the end of this day, you'll build and deploy a multi-model router with caching. This isn't a toy — it's a real project for your portfolio.

Before You Start — Rate Your Confidence

I can implement semantic caching, model routing, and token optimization to cut AI costs by 50-80%.

1 = no idea · 5 = ship it blindfolded

Predict First — Then Learn

If 10,000 users each make 10 GPT-4o calls/day, what's the monthly cost?

Cost Optimization: AI Bills Grow Fast

A single GPT-4o call costs ~$0.01. Multiply by 10,000 daily users making 10 requests each — that's $1,000/day, $30,000/month. Without cost controls, AI features will kill your budget. The good news: most teams can cut AI costs 50-80% with caching, model routing, and smart prompt engineering. This is a skill most AI engineers lack.

💡AI costs scale linearly with users. Caching + routing + prompt trimming can cut costs 50-80%.

Quick Pulse Check

What's the simplest way to estimate your monthly AI cost?

📊 Cost Projection Dashboard

Adjust sliders and toggle strategies to see monthly cost impact.

Users1,000

Requests/user/day10

Avg tokens/request500

Without optimization$1500/mo

With optimization$1500/mo

Monthly savings: $0(0%)

Predict First — Then Learn

What cost reduction can semantic caching alone achieve for repeated-question patterns?

Semantic Caching: Don't Ask Twice

If 100 users ask 'How do I reset my password?', do you really need 100 LLM calls? Semantic caching embeds each query and checks if a similar query was recently answered. If cosine similarity > 0.95, return the cached response. For RAG systems, cache at the retrieval layer too. This single technique can cut costs 30-60% for any app with repeated question patterns.

💡Semantic caching: embed queries, check similarity > 0.95, return cached response. Cuts costs 30-60%.

Quick Pulse Check

Why use semantic similarity (not exact match) for caching LLM queries?

💾 Semantic Cache Simulator

Watch 10 queries run — similar ones hit the cache and save cost.

Multi-Model Strategy: Right Model for the Right Job

GPT-4o for everything is like using a Ferrari for grocery runs. Route requests to the cheapest model that handles them well: GPT-4o-mini for simple Q&A (~10x cheaper), GPT-4o for complex reasoning, Claude for long documents, local models for classification. Build a router that classifies the request complexity and routes accordingly. This alone cuts costs 40-70%.

💡Route simple → mini, complex → 4o, long docs → Claude. Right model for the right job cuts costs 40-70%.

Quick Pulse Check

What's the cost difference between GPT-4o-mini and GPT-4o?

Predict First — Then Learn

When a model router can't classify request complexity, which model should it default to?

The Model Router Pattern

A model router uses a fast classifier (even regex or keyword matching) to route requests: simple → GPT-4o-mini, complex → GPT-4o, creative → Claude, code → GPT-4o. You can even use a small LLM to classify request complexity — the classification call costs 0.1% of the main call. Start simple (keyword-based), upgrade to LLM-based routing when you have data on what kinds of queries your users ask.

💡Start with keyword-based routing, upgrade to LLM-based when you have query data. Classification costs 0.1% of the main call.

🔀 Model Router Simulator

Type a query — see which model it routes to and the cost savings.

Token Optimization: Say More with Less

Every token costs money. Techniques: (1) Trim system prompts — most are too verbose. (2) Use structured output to avoid parsing tokens. (3) Set max_tokens to limit response length. (4) Summarize conversation history instead of sending full context. (5) Use tiered context — only include relevant information. A 2000-token system prompt that could be 500 tokens costs 4x more on every single call.

💡A 2000-token prompt that could be 500 costs 4x more on every call. Trim prompts, set max_tokens, summarize history.

Quick Pulse Check

Which token optimization has the biggest per-call impact?

The Full Evolution

Watch one function evolve through every concept you just learned.

🔄 Code Evolution — One Function, Five Stages

Step 1: Raw fetch()

The SWE starting point

Raw fetch, manual headers, raw text output

1async function reviewCode(code: string) {
2  const response = await fetch(
3    "https://api.openai.com/v1/chat/completions",
4    {
5      method: "POST",
6      headers: {
7        "Authorization": `Bearer ${API_KEY}`,
8        "Content-Type": "application/json",
9      },
10      body: JSON.stringify({
11        model: "gpt-4o-mini",
12        messages: [
13          { role: "user", content: `Review: ${code}` }
14        ],
15      }),
16    }
17  );
18  const data = await response.json();
19  return data.choices[0].message.content;
20  // Returns raw text — unparseable!
21}

1 / 5

Production Gotchas

Track cost per user, per feature, and per model. Alert on spikes. Set hard spending limits in your OpenAI dashboard. One runaway loop can cost thousands in minutes. Caching has correctness tradeoffs — cached answers may be stale or wrong for slightly different questions. Test cache hit quality. Model routing failures default to the expensive model, not the cheap one — a broken router that uses GPT-4o for everything is expensive but correct. A broken router that uses GPT-4o-mini for everything is cheap but may produce bad answers.

Code Comparison

Single Model vs Model Router

Using one expensive model for everything vs routing to the right model

Single Model (expensive)Traditional

// ❌ GPT-4o for everything
export async function POST(req: Request) {
  const { message } = await req.json();

  // Every request → expensive model
  const result = await generateText({
    model: openai("gpt-4o"),
    // Cost: ~$0.01 per call
    prompt: message,
  });

  return Response.json({ text: result.text });
}

// 10,000 requests/day
// Cost: $100/day = $3,000/month
// Simple "What are your hours?" → $0.01
// Complex analysis → $0.01
// Same cost regardless of complexity!

Model Router (cost-optimized)AI Engineering

// ✅ Route to cheapest capable model
function classifyComplexity(
  message: string
): "simple" | "medium" | "complex" {
  const len = message.length;
  const hasCode = /```|function |class /.test(message);
  const hasAnalysis = /analyze|compare|explain why/i
    .test(message);

  if (hasCode || hasAnalysis || len > 500)
    return "complex";
  if (len > 200) return "medium";
  return "simple";
}

const MODEL_MAP = {
  simple: openai("gpt-4o-mini"),   // $0.001
  medium: openai("gpt-4o-mini"),   // $0.001
  complex: openai("gpt-4o"),       // $0.01
};

export async function POST(req: Request) {
  const { message } = await req.json();
  const complexity = classifyComplexity(message);

  const result = await generateText({
    model: MODEL_MAP[complexity],
    prompt: message,
  });

  return Response.json({
    text: result.text,
    model: complexity, // Track for analytics
  });
}
// ~80% simple, ~15% medium, ~5% complex
// Cost: $12/day = $360/month (88% savings!)

KEY DIFFERENCES

80% of requests are simple — GPT-4o-mini handles them fine at 10x less cost
Only route complex requests to expensive models
Track which model was used for cost analytics
Start with keyword-based routing, upgrade to LLM-based when needed

Bridge Map: Caching + load balancing → Semantic caching + model routing

Click any bridge to see the translation

Hands-On Challenges

Build, experiment, and get AI-powered feedback on your code.

starter

Build a Semantic Cache

Implement a semantic cache that avoids redundant AI calls. When a new question comes in, embed it and check if any cached question has cosine similarity > 0.90. If yes, return the cached answer. If no, call the mock AI and cache the result. Track cache hits, misses, and cost savings.

PLAYGROUND

import { useState, useRef } from "react";
// TODO: Import embed, generateText, openai from the mock
// import { embed, generateText, openai } from "./ai-sdk-mock";

interface CacheEntry {
  question: string;
  embedding: number[];
  answer: string;
  timestamp: number;
}

// TODO: Implement cosineSimilarity function
// Takes two number arrays, returns a number between -1 and 1
function cosineSimilarity(a: number[], b: number[]): number {
  // Implement: dot(a,b) / (magnitude(a) * magnitude(b))
  return 0;
}

const SIMILARITY_THRESHOLD = 0.90;
const COST_PER_CALL = 0.01; // $0.01 per LLM call

const sampleQuestions = [
  "How do I reset my password?",
  "I forgot my password, how to reset it?",  // Should cache hit!
  "What are your business hours?",
  "When are you open?",  // Should cache hit!
  "How do I return an item?",
  "I want a refund for my order",  // Should cache hit!
  "How long does shipping take?",
  "What's the delivery time?",  // Should cache hit!
];

export default function App() {
  const cacheRef = useRef<CacheEntry[]>([]);
  const [log, setLog] = useState<Array<{
    question: string;
    answer: string;
    cacheHit: boolean;
    similarity?: number;
    timeMs: number;
  }>>([]);
  const [stats, setStats] = useState({ hits: 0, misses: 0, saved: 0 });
  const [running, setRunning] = useState(false);

  async function askQuestion(question: string) {
    const start = Date.now();

    // TODO: Step 1 - Embed the question
    // const questionEmbedding = await embed(question);

    // TODO: Step 2 - Check cache for similar questions
    // Loop through cacheRef.current, compute cosineSimilarity
    // If any entry has similarity > SIMILARITY_THRESHOLD, it's a cache hit
    // let cacheHit = null;
    // let bestSimilarity = 0;

    // TODO: Step 3 - If cache hit, return cached answer
    // TODO: Step 4 - If cache miss, call generateText and add to cache

    const entry = {
      question,
      answer: "Implement the semantic cache!",
      cacheHit: false,
      timeMs: Date.now() - start,
    };

    setLog(prev => [...prev, entry]);
    return entry;
  }

  async function runDemo() {
    setRunning(true);
    setLog([]);
    cacheRef.current = [];
    let hits = 0, misses = 0;

    for (const q of sampleQuestions) {
      const result = await askQuestion(q);
      if (result.cacheHit) hits++; else misses++;
    }

    setStats({ hits, misses, saved: hits * COST_PER_CALL });
    setRunning(false);
  }

  return (
    <div style={{ padding: 20, fontFamily: "sans-serif", maxWidth: 700 }}>
      <h2>💰 Semantic Cache Demo</h2>
      <p style={{ color: "#666", fontSize: 13 }}>
        Avoid redundant LLM calls by caching similar questions
      </p>

      <button onClick={runDemo} disabled={running}
        style={{ padding: "10px 24px", background: running ? "#94a3b8" : "#8b5cf6", color: "white", border: "none", borderRadius: 8, cursor: "pointer", fontSize: 14 }}>
        {running ? "Running..." : `Run Demo (${sampleQuestions.length} questions)`}
      </button>

      {log.length > 0 && (
        <>
          <div style={{ marginTop: 16, padding: 12, background: "#f8fafc", borderRadius: 8, display: "flex", gap: 24, fontSize: 14 }}>
            <div>🎯 Hits: <strong style={{ color: "#16a34a" }}>{stats.hits}</strong></div>
            <div>❌ Misses: <strong style={{ color: "#dc2626" }}>{stats.misses}</strong></div>
            <div>💰 Saved: <strong style={{ color: "#16a34a" }}>${stats.saved.toFixed(2)}</strong></div>
            <div>📊 Hit Rate: <strong>{((stats.hits / (stats.hits + stats.misses)) * 100).toFixed(0)}%</strong></div>
          </div>

          <div style={{ marginTop: 12 }}>
            {log.map((entry, i) => (
              <div key={i} style={{
                padding: 8, margin: "4px 0", borderRadius: 6, fontSize: 12,
                background: entry.cacheHit ? "#f0fdf4" : "#f1f5f9",
                border: `1px solid ${entry.cacheHit ? "#bbf7d0" : "#e2e8f0"}`,
              }}>
                <div style={{ display: "flex", justifyContent: "space-between", alignItems: "center" }}>
                  <span>
                    <strong style={{ color: entry.cacheHit ? "#16a34a" : "#0ea5e9" }}>
                      {entry.cacheHit ? "CACHE HIT" : "CACHE MISS"}
                    </strong>
                    {" "}{entry.question}
                  </span>
                  <span style={{ color: "#94a3b8" }}>
                    {entry.timeMs}ms
                    {entry.similarity ? ` (sim: ${entry.similarity.toFixed(2)})` : ""}
                  </span>
                </div>
                <div style={{ color: "#666", marginTop: 2 }}>{entry.answer.slice(0, 100)}...</div>
              </div>
            ))}
          </div>
        </>
      )}
    </div>
  );
}

Open Sandbox

Real-World Challenge

Multi-Model Router with Caching

Build and deploy a cost-optimized AI API with semantic caching, intelligent model routing, and a cost tracking dashboard. This is the infrastructure layer that makes AI features economically viable at scale.

~4h estimated

Next.js 14+Vercel AI SDKOpenAI GPT-4o-mini + GPT-4oRecharts (charts)Tailwind CSSVercel (deploy)

Acceptance Criteria

Implement semantic caching that returns cached responses for similar queries (similarity > 0.90)
Build a model router that routes simple queries to cheap models and complex ones to powerful models
Track cost per request, per model, and per user in real-time
Show a dashboard with total spend, cache savings, model distribution, and cost trends
Support at least 2 models (e.g., GPT-4o-mini for simple, GPT-4o for complex)
Include prompt optimization that reduces token usage by 30%+
Deploy to a public URL (Vercel, Netlify, etc.)

Build Roadmap

0/6

Create a new Next.js app with TypeScript and Tailwind CSS. Plan the architecture: cache layer → router → model call → cost tracking → dashboard.

npx create-next-app@latest ai-cost-optimizer --typescript --tailwind --app

Create /lib/cache.ts, /lib/router.ts, /lib/cost-tracker.ts as separate modules

Deploy Tip

Push to GitHub and import into Vercel. Pre-load the dashboard with sample cost data so it looks impressive on first visit. Set your OPENAI_API_KEY in Vercel environment variables.

After Learning — Rate Your Confidence Again

I can implement semantic caching, model routing, and token optimization to cut AI costs by 50-80%.

1 = no idea · 5 = ship it blindfolded

Day 9: Security & Guardrails

Day 11: LLMOps & Observability

Cost Optimization & Multi-Model Strategy

Learning Objectives

Ship It: Multi-model router with caching

Cost Optimization: AI Bills Grow Fast

📊 Cost Projection Dashboard

Semantic Caching: Don't Ask Twice

💾 Semantic Cache Simulator

Multi-Model Strategy: Right Model for the Right Job

The Model Router Pattern

🔀 Model Router Simulator

Token Optimization: Say More with Less

The Full Evolution

🔄 Code Evolution — One Function, Five Stages

Step 1: Raw fetch()

Production Gotchas

Code Comparison

Single Model vs Model Router

Bridge Map: Caching + load balancing → Semantic caching + model routing

Hands-On Challenges

Build a Semantic Cache

Multi-Model Router with Caching

Acceptance Criteria

Build Roadmap

Discussion

📊 Cost Projection Dashboard

🔀 Model Router Simulator

🔄 Code Evolution — One Function, Five Stages

Step 1: Raw fetch()

Build a Semantic Cache

Discussion