How to Run LLMs Locally with Ollama

Llama 3, Mistral, and Gemma — private AI on your laptop

Ad placeholder (leaderboard)

Why run a model locally

Running an LLM locally with Ollama gives you three things a hosted API can’t: privacy (nothing leaves your machine), zero marginal cost (no per-token billing), and offline availability. The trade-off is that you run smaller, open-weight models than the largest hosted ones, and speed depends on your hardware. For coding help, drafting, summarisation, and RAG over private documents, a 7B–9B model on a laptop is genuinely useful. Use the builder above to match a model to your RAM and generate the exact commands.

How it works

Ollama installs a small background service that manages model weights and exposes a local HTTP API on port 11434. You interact with it three ways:

  • ollama run <model> for an interactive terminal chat — the fastest way to try a model.
  • The native REST API at localhost:11434/api/chat, which returns Ollama’s own JSON schema and supports streaming token-by-token.
  • The OpenAI-compatible API at localhost:11434/v1/chat/completions, which lets you reuse code written for OpenAI by simply changing the base URL — the API key can be any non-empty string.

The first ollama pull downloads quantised weights (compressed to run in far less memory than full precision) once; subsequent runs are instant. Models are stored locally and shared across all three access methods.

Tips and notes

  • Quantisation is your friend. Default Ollama models are 4-bit quantised, which is why a 7B model fits in ~5 GB on disk and runs in ~8–16 GB of RAM with little quality loss for most tasks.
  • Use the OpenAI shim to migrate gradually. Develop against the local endpoint, then flip the base URL back to a hosted provider for production if you need a larger model — no other code changes.
  • Keep one model loaded. Ollama unloads idle models after a few minutes; set OLLAMA_KEEP_ALIVE if you want a server to keep a model hot for low latency.
  • Watch the RAM check. If the tool above says your model won’t fit, it really won’t run well — swapping to disk makes generation painfully slow. Drop to the next smaller model instead.
Ad placeholder (rectangle)