Why stream instead of waiting for the full response?

Streaming shows the first tokens in well under a second instead of making the user stare at a spinner for many seconds while a long answer generates. Perceived latency drops dramatically and the experience feels conversational. It also lets users cancel early once they have read enough, saving output tokens.

What does toDataStreamResponse return?

It returns a Response whose body is a stream of structured events — text deltas, tool calls, and finish reasons — in the format the useChat hook understands. You return it directly from your route handler; the SDK handles the wire protocol so you never hand-roll server-sent events.

How do I cancel an in-flight request?

Call the stop function returned by useChat. The SDK aborts the underlying fetch with an AbortController, which closes the stream to the model provider so you stop being billed for tokens you will not show. Always expose a stop button for long generations.

Can I stream from the Edge runtime?

Yes. Add export const runtime = 'edge' to the route handler for lower cold-start latency and global distribution. Streaming works on both the Node and Edge runtimes; choose Edge for latency-sensitive chat and Node when you need libraries that are not Edge-compatible.

How do I handle a partial response that errors mid-stream?

useChat exposes an error value and keeps whatever text already streamed in the message list. Render the partial text, show the error with a retry button, and let the user resend. On the server, wrap streamText so provider errors surface as a clean stream-finish rather than a hung connection.

How to Stream LLM Responses in Next.js

What you are building

This tutorial builds a real-time streaming chat interface in Next.js using the Vercel AI SDK. Instead of waiting for the model to finish a long answer and then dumping it on screen, tokens appear as they are generated — the same typewriter effect you see in ChatGPT and Claude. The AI SDK hides the awkward parts (server-sent event framing, client buffering, abort wiring) behind a route handler helper and the useChat hook, so a working streaming chat is a few dozen lines.

How streaming works

On the server, an App Router route handler at app/api/chat/route.ts calls the model with streamText and returns result.toDataStreamResponse(). That response body is an ongoing stream of structured deltas rather than a single JSON blob, so the connection stays open and tokens flow out as the model produces them. On the client, a component calls useChat, which gives you messages, the bound input/handleSubmit, an isLoading flag, a stop function, and an error value. As deltas arrive, the SDK appends them to the in-progress assistant message and React re-renders, producing the live typing effect. Because an AbortController backs the request, calling stop cancels the stream and halts token billing immediately.

Tips and the latency estimator below

Always render an explicit stop button — long answers are exactly where users want to bail early. Show the partial text even on error so a mid-stream failure does not erase what already arrived; pair it with a retry. Consider runtime = 'edge' for lower time-to-first-token on chat. And remember that streaming improves perceived latency, not total latency: the model still generates at the same tokens-per-second, you just reveal it sooner. The estimator below shows the difference — how long until the first word appears versus the full answer — for a given model speed and answer length.