Multimodal AI & Streaming UX
You handle file uploads. Now you'll make AI understand them — extracting structured JSON from images, analyzing documents, and processing audio. Plus, you'll master the UX patterns that make AI apps feel magical: streaming responses, optimistic UI, and progressive loading.
Use this at work tomorrow
Add a streaming AI response to any chat-like interface in your app — it transforms the UX.
Learning Objectives
- 1Extract structured data from images (receipts → JSON, screenshots → code)
- 2Process documents with vision models (PDFs, diagrams, handwriting)
- 3Build streaming AI responses with real-time character-by-character display
- 4Implement AI UX patterns: optimistic updates, progressive loading, error recovery
- 5Ship a receipt scanner that extracts structured data from photos
Ship It: Receipt scanner + streaming chat
By the end of this day, you'll build and deploy a receipt scanner + streaming chat. This isn't a toy — it's a real project for your portfolio.
I can build multimodal AI features (image/audio → structured data), implement streaming UX, and choose the right AI UX pattern for each use case.
A 1024×1024 image costs ~765 tokens on GPT-4o (~$0.003). You send 50 product photos. What's the cost?
Multimodal = New Input Types for Your AI APIs
You already handle file uploads — images, PDFs, audio. Multimodal AI processes these same file types but extracts meaning. Instead of just storing a receipt image, you can extract { vendor: 'Starbucks', total: 5.75, items: ['latte'] } as structured JSON. Same upload pipeline, AI-powered extraction.
What does 'multimodal' mean in the context of AI APIs?
Structured Data from Images: The Killer Use Case
The most practical multimodal skill: image → structured JSON. Receipts → expense data. Screenshots → UI code. Diagrams → descriptions. Handwriting → text. Use generateObject() with a Zod schema and a vision model — same structured output pattern from Day 1, but with image input. This replaces months of OCR/computer vision work.
You want to extract receipt data as typed JSON from a photo. What API pattern?
Why does ChatGPT stream tokens instead of waiting for the full response?
Streaming: The UX Pattern That Makes AI Feel Magical
Waiting 3-5 seconds for a full response feels broken. Streaming token-by-token feels alive. The Vercel AI SDK provides streamText() (server) and useChat() (client) for this. It uses the same ReadableStream Web API you know. This is why ChatGPT, Cursor, and every great AI app streams — the perceived latency drops from seconds to milliseconds.
Your AI auto-categorizes uploaded documents. The user uploads and waits for the AI. What UX pattern?
AI UX Patterns: Beyond the Chat Interface
Not everything needs to be a chatbot. AI UX patterns include: streaming text (token by token), optimistic UI (show placeholder while AI generates), progressive enrichment (show basic answer → enrich with details), skeleton loading with AI-specific messaging ('Analyzing your image...'), and graceful degradation (fallback when AI fails). The best AI apps feel fast even when the model is slow.
Your AI search takes 4 seconds. Which UX pattern makes it feel faster?
The Full Evolution
Watch one function evolve through every concept you just learned.
Production Gotchas
Image tokens are expensive: a 1024x1024 image costs ~765 tokens on GPT-4o (~$0.003). Resize images before sending to reduce costs. Audio transcription (Whisper) is separate from chat models — it's a different API endpoint. For PDFs, convert to images first (each page) or extract text. Rate limit file-heavy endpoints more aggressively — users love uploading 50 images at once. Always validate file type and size server-side.
Code Comparison
File Upload vs Vision AI Understanding
Processing files traditionally vs with AI — from metadata to understanding
// Traditional image processing
import sharp from "sharp";
const file = formData.get("image");
const metadata = await sharp(file.buffer)
.metadata();
return {
width: metadata.width,
height: metadata.height,
format: metadata.format,
size: file.size,
};
// Can extract: dimensions, format, size
// CANNOT understand what's IN the image// AI image understanding + extraction
import { generateObject } from "ai";
import { z } from "zod";
const { object } = await generateObject({
model: openai("gpt-4o-mini"),
schema: z.object({
vendor: z.string(),
total: z.number(),
date: z.string(),
items: z.array(z.object({
name: z.string(),
price: z.number(),
})),
}),
messages: [{
role: "user",
content: [
{ type: "text",
text: "Extract receipt data." },
{ type: "image", image: imageBuffer },
],
}],
});
// Returns typed JSON:
// { vendor: "Starbucks", total: 5.75,
// items: [{ name: "Latte", price: 5.75 }] }KEY DIFFERENCES
- Traditional: extract metadata (dimensions, format, size)
- Vision AI: understand + extract structured data from content
- Same generateObject() pattern from Day 1 — add image input
- Replaces months of OCR/CV work with a single API call
Loading Spinner vs Streaming Response
Traditional loading vs AI streaming UX
// Wait for full response, show spinner
const [loading, setLoading] = useState(false);
const [data, setData] = useState("");
async function handleSubmit() {
setLoading(true);
// User sees: ⏳ spinner for 3-5 seconds
const res = await fetch("/api/analyze", {
method: "POST",
body: formData,
});
const result = await res.json();
setData(result.text); // All at once
setLoading(false);
}
// UX: Nothing... nothing... WALL OF TEXT
// Feels slow even if only 3 seconds// Stream response token by token
"use client";
import { useChat } from "ai/react";
export default function Chat() {
const { messages, input, handleInputChange,
handleSubmit, isLoading } = useChat();
return (
<div>
{messages.map(m => (
<div key={m.id}>
<strong>{m.role}:</strong>
{m.content}
{/* Text appears word by word! */}
</div>
))}
<form onSubmit={handleSubmit}>
<input value={input}
onChange={handleInputChange} />
</form>
</div>
);
}
// UX: Words flow in naturally ✨
// Feels instant even if total is 5 secondsKEY DIFFERENCES
- Spinner → wall of text feels slow (even at 3 seconds)
- Streaming → words flow in naturally (feels instant)
- useChat() handles streaming, state, and conversation history
- Same Web Streams API you'd use for file downloads
Bridge Map: File uploads + loading states → Vision/audio APIs + streaming UX
Click any bridge to see the translation
Hands-On Challenges
Build, experiment, and get AI-powered feedback on your code.
Receipt Scanner + Streaming Chat
Build and deploy a multimodal AI app that extracts structured data from receipt photos and lets users ask follow-up questions about their receipts via a streaming chat. Combine vision AI with real-time UX.
Acceptance Criteria
- Accept image uploads (receipt photos) via drag-and-drop or file picker
- Send images to a vision model and extract structured data (items, prices, total, date, merchant)
- Display extracted receipt data in a clean, editable format
- Add streaming chat where users can ask questions about the receipt data
- Show progressive loading states ('Analyzing receipt...', 'Extracting items...')
- Handle errors: blurry images, non-receipt images, API failures
- Deploy to a public URL (Vercel, Netlify, etc.)
Build Roadmap
0/6Create a new Next.js app with TypeScript and Tailwind CSS. Set up the project with an upload page, a processing API route, and a chat API route.
npx create-next-app@latest receipt-scanner --typescript --tailwind --appPlan two API routes: /api/scan (image → structured data) and /api/chat (follow-up questions)Deploy Tip
Push to GitHub and import into Vercel. Include 2-3 sample receipt images users can try without uploading their own. Set your OPENAI_API_KEY in Vercel environment variables.
Sign in to submit your deployed project.
I can build multimodal AI features (image/audio → structured data), implement streaming UX, and choose the right AI UX pattern for each use case.