How to Stream LLM Responses in Next.js

Real-time token streaming with the Vercel AI SDK

Ad placeholder (leaderboard)

What you are building

This tutorial builds a real-time streaming chat interface in Next.js using the Vercel AI SDK. Instead of waiting for the model to finish a long answer and then dumping it on screen, tokens appear as they are generated — the same typewriter effect you see in ChatGPT and Claude. The AI SDK hides the awkward parts (server-sent event framing, client buffering, abort wiring) behind a route handler helper and the useChat hook, so a working streaming chat is a few dozen lines.

How streaming works

On the server, an App Router route handler at app/api/chat/route.ts calls the model with streamText and returns result.toDataStreamResponse(). That response body is an ongoing stream of structured deltas rather than a single JSON blob, so the connection stays open and tokens flow out as the model produces them. On the client, a component calls useChat, which gives you messages, the bound input/handleSubmit, an isLoading flag, a stop function, and an error value. As deltas arrive, the SDK appends them to the in-progress assistant message and React re-renders, producing the live typing effect. Because an AbortController backs the request, calling stop cancels the stream and halts token billing immediately.

Tips and the latency estimator below

Always render an explicit stop button — long answers are exactly where users want to bail early. Show the partial text even on error so a mid-stream failure does not erase what already arrived; pair it with a retry. Consider runtime = 'edge' for lower time-to-first-token on chat. And remember that streaming improves perceived latency, not total latency: the model still generates at the same tokens-per-second, you just reveal it sooner. The estimator below shows the difference — how long until the first word appears versus the full answer — for a given model speed and answer length.

Ad placeholder (rectangle)