How much RAM do I actually need?

A rough rule is that a model needs slightly more RAM than its file size. A 3B model runs comfortably in 8 GB, 7B–9B models want around 16 GB, and 70B models need a workstation with 64 GB or more. The RAM checker above flags when your chosen model will swap and crawl.

No. Ollama runs on CPU and uses Apple Silicon's unified memory or a discrete GPU automatically when present. A GPU makes generation much faster, but small models are perfectly usable on a modern laptop CPU.

Is it really private?

Yes. Once the weights are downloaded, inference happens entirely on your machine with no network calls. You can pull a model, disconnect from the internet, and keep chatting — which is why Ollama is popular for sensitive or offline work.

Can I reuse my existing OpenAI code?

Mostly yes. Ollama exposes an OpenAI-compatible endpoint at localhost:11434/v1, so you can point an OpenAI SDK at it by changing the base URL and using any string as the API key. A few advanced parameters differ, but chat completions work out of the box.

How do I customise a model's behaviour?

Create a Modelfile that sets a base model, a system prompt, and parameters like temperature, then run ollama create. This bakes your settings into a named local model you can run like any other, which is handy for consistent assistants.

How to Run LLMs Locally with Ollama

Why run a model locally

Running an LLM locally with Ollama gives you three things a hosted API can’t: privacy (nothing leaves your machine), zero marginal cost (no per-token billing), and offline availability. The trade-off is that you run smaller, open-weight models than the largest hosted ones, and speed depends on your hardware. For coding help, drafting, summarisation, and RAG over private documents, a 7B–9B model on a laptop is genuinely useful. Use the builder above to match a model to your RAM and generate the exact commands.

How it works

Ollama installs a small background service that manages model weights and exposes a local HTTP API on port 11434. You interact with it three ways:

ollama run <model> for an interactive terminal chat — the fastest way to try a model.
The native REST API at localhost:11434/api/chat, which returns Ollama’s own JSON schema and supports streaming token-by-token.
The OpenAI-compatible API at localhost:11434/v1/chat/completions, which lets you reuse code written for OpenAI by simply changing the base URL — the API key can be any non-empty string.

The first ollama pull downloads quantised weights (compressed to run in far less memory than full precision) once; subsequent runs are instant. Models are stored locally and shared across all three access methods.

Tips and notes

Quantisation is your friend. Default Ollama models are 4-bit quantised, which is why a 7B model fits in ~5 GB on disk and runs in ~8–16 GB of RAM with little quality loss for most tasks.
Use the OpenAI shim to migrate gradually. Develop against the local endpoint, then flip the base URL back to a hosted provider for production if you need a larger model — no other code changes.
Keep one model loaded. Ollama unloads idle models after a few minutes; set OLLAMA_KEEP_ALIVE if you want a server to keep a model hot for low latency.
Watch the RAM check. If the tool above says your model won’t fit, it really won’t run well — swapping to disk makes generation painfully slow. Drop to the next smaller model instead.