Skip to content
Curriculum/Day 10: Cost Optimization & Multi-Model Strategy
Day 10Ship AI to Production

Cost Optimization & Multi-Model Strategy

AI API costs can explode overnight. You'll learn the same cost controls you use for infrastructure — caching, routing, and optimization — adapted for AI. Semantic caching, model routing (fast/cheap vs slow/smart), multi-model strategy (GPT-4o vs Claude vs open-source), and token optimization.

80 min(+30 min boss)★★★☆☆
💰
Bridge:Caching + load balancingSemantic caching + model routing

Use this at work tomorrow

Add semantic caching to your most-called AI endpoint — watch costs drop immediately.

Learning Objectives

  • 1Implement semantic caching — cache by meaning, not exact string match
  • 2Build a model router: route simple queries to cheap models, complex to expensive
  • 3Design a multi-model strategy: GPT-4o vs Claude vs Gemini vs local (Ollama)
  • 4Optimize token usage: prompt compression, output limiting, context pruning
  • 5Ship a multi-model router that reduces costs by 60%+

Ship It: Multi-model router with caching

By the end of this day, you'll build and deploy a multi-model router with caching. This isn't a toy — it's a real project for your portfolio.

Before You Start — Rate Your Confidence

I can implement semantic caching, model routing, and token optimization to cut AI costs by 50-80%.

1 = no idea · 5 = ship it blindfolded
Predict First — Then Learn

If 10,000 users each make 10 GPT-4o calls/day, what's the monthly cost?

Cost Optimization: AI Bills Grow Fast

A single GPT-4o call costs ~$0.01. Multiply by 10,000 daily users making 10 requests each — that's $1,000/day, $30,000/month. Without cost controls, AI features will kill your budget. The good news: most teams can cut AI costs 50-80% with caching, model routing, and smart prompt engineering. This is a skill most AI engineers lack.

💡AI costs scale linearly with users. Caching + routing + prompt trimming can cut costs 50-80%.
Quick Pulse Check

What's the simplest way to estimate your monthly AI cost?

Predict First — Then Learn

What cost reduction can semantic caching alone achieve for repeated-question patterns?

Semantic Caching: Don't Ask Twice

If 100 users ask 'How do I reset my password?', do you really need 100 LLM calls? Semantic caching embeds each query and checks if a similar query was recently answered. If cosine similarity > 0.95, return the cached response. For RAG systems, cache at the retrieval layer too. This single technique can cut costs 30-60% for any app with repeated question patterns.

💡Semantic caching: embed queries, check similarity > 0.95, return cached response. Cuts costs 30-60%.
Quick Pulse Check

Why use semantic similarity (not exact match) for caching LLM queries?

💾 Semantic Cache Simulator

Watch 10 queries run — similar ones hit the cache and save cost.

Multi-Model Strategy: Right Model for the Right Job

GPT-4o for everything is like using a Ferrari for grocery runs. Route requests to the cheapest model that handles them well: GPT-4o-mini for simple Q&A (~10x cheaper), GPT-4o for complex reasoning, Claude for long documents, local models for classification. Build a router that classifies the request complexity and routes accordingly. This alone cuts costs 40-70%.

💡Route simple → mini, complex → 4o, long docs → Claude. Right model for the right job cuts costs 40-70%.
Quick Pulse Check

What's the cost difference between GPT-4o-mini and GPT-4o?

Predict First — Then Learn

When a model router can't classify request complexity, which model should it default to?

The Model Router Pattern

A model router uses a fast classifier (even regex or keyword matching) to route requests: simple → GPT-4o-mini, complex → GPT-4o, creative → Claude, code → GPT-4o. You can even use a small LLM to classify request complexity — the classification call costs 0.1% of the main call. Start simple (keyword-based), upgrade to LLM-based routing when you have data on what kinds of queries your users ask.

💡Start with keyword-based routing, upgrade to LLM-based when you have query data. Classification costs 0.1% of the main call.

Token Optimization: Say More with Less

Every token costs money. Techniques: (1) Trim system prompts — most are too verbose. (2) Use structured output to avoid parsing tokens. (3) Set max_tokens to limit response length. (4) Summarize conversation history instead of sending full context. (5) Use tiered context — only include relevant information. A 2000-token system prompt that could be 500 tokens costs 4x more on every single call.

💡A 2000-token prompt that could be 500 costs 4x more on every call. Trim prompts, set max_tokens, summarize history.
Quick Pulse Check

Which token optimization has the biggest per-call impact?

The Full Evolution

Watch one function evolve through every concept you just learned.

Production Gotchas

Track cost per user, per feature, and per model. Alert on spikes. Set hard spending limits in your OpenAI dashboard. One runaway loop can cost thousands in minutes. Caching has correctness tradeoffs — cached answers may be stale or wrong for slightly different questions. Test cache hit quality. Model routing failures default to the expensive model, not the cheap one — a broken router that uses GPT-4o for everything is expensive but correct. A broken router that uses GPT-4o-mini for everything is cheap but may produce bad answers.

Code Comparison

Single Model vs Model Router

Using one expensive model for everything vs routing to the right model

Single Model (expensive)Traditional
// ❌ GPT-4o for everything
export async function POST(req: Request) {
  const { message } = await req.json();

  // Every request → expensive model
  const result = await generateText({
    model: openai("gpt-4o"),
    // Cost: ~$0.01 per call
    prompt: message,
  });

  return Response.json({ text: result.text });
}

// 10,000 requests/day
// Cost: $100/day = $3,000/month
// Simple "What are your hours?" → $0.01
// Complex analysis → $0.01
// Same cost regardless of complexity!
Model Router (cost-optimized)AI Engineering
// ✅ Route to cheapest capable model
function classifyComplexity(
  message: string
): "simple" | "medium" | "complex" {
  const len = message.length;
  const hasCode = /```|function |class /.test(message);
  const hasAnalysis = /analyze|compare|explain why/i
    .test(message);

  if (hasCode || hasAnalysis || len > 500)
    return "complex";
  if (len > 200) return "medium";
  return "simple";
}

const MODEL_MAP = {
  simple: openai("gpt-4o-mini"),   // $0.001
  medium: openai("gpt-4o-mini"),   // $0.001
  complex: openai("gpt-4o"),       // $0.01
};

export async function POST(req: Request) {
  const { message } = await req.json();
  const complexity = classifyComplexity(message);

  const result = await generateText({
    model: MODEL_MAP[complexity],
    prompt: message,
  });

  return Response.json({
    text: result.text,
    model: complexity, // Track for analytics
  });
}
// ~80% simple, ~15% medium, ~5% complex
// Cost: $12/day = $360/month (88% savings!)

KEY DIFFERENCES

  • 80% of requests are simple — GPT-4o-mini handles them fine at 10x less cost
  • Only route complex requests to expensive models
  • Track which model was used for cost analytics
  • Start with keyword-based routing, upgrade to LLM-based when needed

Bridge Map: Caching + load balancing → Semantic caching + model routing

Click any bridge to see the translation

Hands-On Challenges

Build, experiment, and get AI-powered feedback on your code.

Real-World Challenge

Multi-Model Router with Caching

Build and deploy a cost-optimized AI API with semantic caching, intelligent model routing, and a cost tracking dashboard. This is the infrastructure layer that makes AI features economically viable at scale.

~4h estimated
Next.js 14+Vercel AI SDKOpenAI GPT-4o-mini + GPT-4oRecharts (charts)Tailwind CSSVercel (deploy)

Acceptance Criteria

  • Implement semantic caching that returns cached responses for similar queries (similarity > 0.90)
  • Build a model router that routes simple queries to cheap models and complex ones to powerful models
  • Track cost per request, per model, and per user in real-time
  • Show a dashboard with total spend, cache savings, model distribution, and cost trends
  • Support at least 2 models (e.g., GPT-4o-mini for simple, GPT-4o for complex)
  • Include prompt optimization that reduces token usage by 30%+
  • Deploy to a public URL (Vercel, Netlify, etc.)

Build Roadmap

0/6

Create a new Next.js app with TypeScript and Tailwind CSS. Plan the architecture: cache layer → router → model call → cost tracking → dashboard.

npx create-next-app@latest ai-cost-optimizer --typescript --tailwind --app
Create /lib/cache.ts, /lib/router.ts, /lib/cost-tracker.ts as separate modules

Deploy Tip

Push to GitHub and import into Vercel. Pre-load the dashboard with sample cost data so it looks impressive on first visit. Set your OPENAI_API_KEY in Vercel environment variables.

Sign in to submit your deployed project.

After Learning — Rate Your Confidence Again

I can implement semantic caching, model routing, and token optimization to cut AI costs by 50-80%.

1 = no idea · 5 = ship it blindfolded