Why Cohere
Cohere is an enterprise-focused NLP provider whose models are built around retrieval and search rather than open-ended chat. Its three workhorse endpoints map cleanly onto the parts of a production RAG system: Embed turns text into vectors, Rerank scores how relevant documents are to a query, and Chat (Command R and R+) generates grounded answers with citations and tool calls. If you are building search, knowledge assistants, or document processing at scale, Cohere’s separation of these concerns is a genuine advantage over a single chat-only API.
Use the builder above to assemble a real request for any of the three endpoints, then copy the curl and run it with your own key.
How the three endpoints fit together
A typical Cohere RAG flow looks like this. At ingestion time you call Embed
with input_type: "search_document" to vectorise every chunk of your corpus and
store the vectors. At query time you embed the user’s question with
input_type: "search_query", do an approximate-nearest-neighbour lookup to get
a noisy top-k, then pass that candidate set to Rerank to reorder by true
relevance. Finally you hand the best few chunks to Chat as documents so
Command R+ produces an answer grounded in your data — with inline citations it
generates automatically when you pass a documents field.
The key detail beginners miss is that v3 embeddings are asymmetric: documents
and queries must be embedded with different input_type values or retrieval
quality drops noticeably. Get that right and Cohere’s retrieval stack is hard to
beat.
Tips and gotchas
- Authenticate with a Bearer header. Cohere uses
Authorization: Bearer $CO_API_KEY, not a custom key header. - Start with Rerank for cheap wins. If your RAG answers are mediocre, adding a Rerank pass over your existing retrieval is the highest-leverage change you can make and costs far less than the chat call.
- Pin the model string. Always specify an exact model (e.g.
command-r-plus) rather than an alias, so a model update never silently changes your behaviour. - Move to the SDK once it works. Prototype with the curl above, then switch
to the official
coherePython or TypeScript SDK for retries, streaming, and typed responses in production.