Do I need a backend to build an OpenAI chatbot?

For anything real, yes. Your API key must stay on a server because anyone who sees a client-side key can spend your money. A small serverless function that forwards requests to OpenAI is enough, and keeps the key secret.

How do I make the chatbot remember the conversation?

The API is stateless, so you resend the full message history on every call. Keep an array of role-tagged messages (system, user, assistant) and append each new turn before sending. Trim or summarize old turns when you approach the context limit.

What is the system prompt for?

The system message sets the assistant's persona, rules and tone before any user input. It is the main lever for controlling behaviour, so put instructions like "You are a concise support agent" there rather than repeating them every turn.

How does streaming work?

Set stream true in the request and read the response as server-sent events. Each chunk carries a token or two, which you append to the UI as they arrive, giving the familiar typing effect instead of a long wait for the full reply.

How much does running a chatbot cost?

You pay per token for input (your history plus prompt) and output (the reply), priced separately. Since you resend history each turn, long conversations cost more over time. Use a token cost calculator to estimate spend before launching.

Why does a long chatbot conversation get expensive so fast?

Because you resend the whole history every turn, token cost is quadratic in conversation length — each turn pays for all previous turns again. A 20-turn chat can bill ~40,000 input tokens even if the user typed only ~2,000. Cap the history with a sliding window or summarise old turns to control it.

How should I handle 429 rate-limit errors from the API?

Retry with exponential backoff and jitter rather than an immediate loop, which only worsens a rate-limit spike. Show the user a transient "busy, try again" state, and set a spend cap in your provider dashboard so retries can't run up a surprise bill.

What is the How to Build a Chatbot with the OpenAI API?

Learn to build a working chatbot with the OpenAI Chat Completions API. Covers authentication, message history, streaming responses, system prompts and deploying to a web app, with an interactive code-snippet generator. It runs free in your browser on Gera Tools, with nothing uploaded.

How to Build a Chatbot with the OpenAI API

Name: How to Build a Chatbot with the OpenAI API
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

A chatbot is the fastest way to learn the OpenAI API: one endpoint, a list of messages, and a reply. This tutorial walks through authentication, conversation memory, streaming and deployment, and includes an interactive snippet generator so you can copy starter code in your language.

Step 1 — Authenticate

Everything goes through the Chat Completions endpoint at https://api.openai.com/v1/chat/completions. You authenticate with an API key sent as a Bearer token:

Authorization: Bearer YOUR_API_KEY

The single most important rule: keep the key on a server. A key in browser JavaScript can be read by anyone and used to drain your account. In practice you build a tiny backend route (a serverless function is plenty) that holds the key and forwards requests.

Step 2 — Send messages with history

The API is stateless — it remembers nothing between calls. You provide the whole conversation each time as a messages array of role-tagged objects:

[
  { "role": "system", "content": "You are a concise, friendly assistant." },
  { "role": "user", "content": "What is the OpenAI API?" },
  { "role": "assistant", "content": "It lets you call models like GPT-4 over HTTP." },
  { "role": "user", "content": "How do I add memory?" }
]

The system message defines the persona and rules. Each new user turn is appended, and you append the model’s reply before the next call. That growing array is the chatbot’s memory. As it nears the model’s context limit, trim or summarize the oldest turns.

Step 3 — Stream responses and deploy

A reply that arrives all at once feels slow. Set "stream": true and read the response as server-sent events; each chunk holds a fragment of text you append to the UI for the familiar typing effect.

To deploy, put your forwarding route on a host like Vercel or any Node server, set the API key as an environment variable, and point a minimal frontend at it. You now have a working, hosted chatbot.

Use the generator below to produce a ready-to-paste starter for Node, Python, or plain fetch, then read what a token is and estimate spend with the LLM API Cost Calculator before you launch.

Why cost grows faster than a linear conversation

Because you resend the entire history every turn, a chatbot’s token cost is quadratic in conversation length, not linear. Each turn pays for all prior turns again. A rough model, assuming ~100 tokens per message:

Turn	History resent (input tokens)	Cumulative input tokens
1	100	100
5	~900	~2,500
10	~1,900	~10,000
20	~3,900	~40,000

By turn 20 you have paid for roughly 40,000 input tokens even though the user typed only ~2,000. This is the single biggest surprise in a first production bill, and it is why history management is not optional:

Sliding window: keep the system prompt plus the last N turns; drop the rest. Simple, cheap, forgets old context.
Summarisation: when history nears a threshold, ask the model to compress the old turns into a short summary and replace them with it. Preserves gist at a fraction of the tokens.
Retrieval: store past turns externally and re-inject only the relevant ones. Best for long-lived assistants, more moving parts.

The three ways to give a chatbot “state”

Approach	How it works	Trade-off
Full replay (this tutorial)	Resend every message each call	Simple; cost and latency grow with length
Windowed history	Resend only recent turns + system	Cheap; loses long-range memory
Summarise + window	Compress old turns to a summary	Keeps gist; risks losing specifics
External memory / RAG	Retrieve relevant past context on demand	Scales to long histories; more infra

Most production chatbots start with full replay, add a window once bills climb, then reach for summarisation or retrieval only when a genuine long-memory requirement appears. Don’t build retrieval on day one for a support bot that never needs to recall last week.

Errors you will hit and how to handle them

429 rate-limit / quota errors. Retry with exponential backoff and jitter; a bare retry loop makes a rate-limit spike worse. Surface a “busy, try again” state to the user rather than hanging.
Context-length exceeded. The request’s total tokens (history + prompt + requested max output) passed the model’s window. Trim history before sending, and reserve headroom for the reply.
Streamed connection drops. Server-sent event streams can break mid-reply; buffer what arrived, and let the user regenerate. Persist the partial turn so history stays consistent.
Malformed or unexpected output. If you ask for JSON, validate it — models occasionally wrap it in prose or truncate. Use the provider’s structured-output / JSON mode where available and re-request on a parse failure.
Prompt injection from user text. Treat everything in a user message as untrusted. Don’t let user text silently override the system prompt’s safety rules; keep authoritative instructions in the system role and validate any actions the model requests before executing them.

Common pitfalls when going to production

The starter above works, but three things bite most first chatbots:

Leaking the API key. Never ship the key in client-side JavaScript — proxy every call through your own server route and read the key from an environment variable. A leaked key can be used to drain your account within hours.
Unbounded context growth. Because you resend the whole history each turn, long conversations grow token cost linearly and eventually hit the model’s context window. Cap the history, summarise older turns, or drop the earliest exchanges.
No rate limiting or spend cap. Add a per-user rate limit and set a usage/budget limit in your provider dashboard so a runaway loop or abuse can’t produce a surprise bill.

Sources and references

OpenAI — Chat Completions API reference — the endpoint, messages array, and stream parameter used here
OpenAI — Text generation and prompting guide — system prompts, roles, and conversation state
OpenAI — Production best practices — key security, rate limits, and cost control
OpenAI — Streaming responses — server-sent events for the typing effect

Maintained by the Gera Tools editorial team. API surfaces, model names, and pricing change frequently — confirm against the current OpenAI documentation above. The generated snippets are starters, not production-hardened code. Last reviewed 2026-07-02.