What you are building
This tutorial walks through adding real AI features to a React Native app — the kind of streaming chat and voice experience users now expect. The stack is Expo (which bundles the audio, storage, and networking pieces you need), a small backend proxy that holds your API key, and the OpenAI API for chat and transcription. The single most important architectural decision is that the phone never talks to the model provider directly: a key shipped inside an app binary can be extracted by anyone, so all requests go through your own backend. Get that right and the rest — streaming, voice, caching — is ordinary app code. Use the planner below to sequence the build and estimate effort for each piece.
How it works
The architecture is three layers. The app renders a chat UI and records audio. A backend proxy holds your secret key, forwards chat and transcription requests to OpenAI, and streams responses back. The model provider does the actual generation. For chat, your backend requests a streaming completion and pipes the chunks to the app, which appends each token to the current message so replies appear word by word. For voice, Expo AV records audio, your backend sends it to a transcription model like Whisper, and the returned text flows into the same chat path — voice and text share one code path. A small on-device cache stores recent responses so repeat questions return instantly without another paid call.
Tips and pitfalls
The proxy is non-negotiable; never embed the key. Make the app feel fast with streaming even when total latency is unchanged — perceived speed comes from seeing the first words quickly. Reuse one code path for voice and text by transcribing audio to text early, so you do not maintain two flows. Cache aggressively on device with a shape-guarded read (validate the stored JSON before using it) so a corrupted cache never crashes the app. Default to a smaller model for routine turns, cap output length, and set a hard spending limit in your provider dashboard so a runaway loop on a user’s phone cannot drain your account. Build the chat slice end to end first, then layer in voice and caching once the core loop works.