Why small models matter
Frontier models are impressive, but they require data-centre accelerators and an API call for every request. Small language models (SLMs) — roughly 2B to 8B parameters — flip the trade-off: they run locally on a laptop, a single consumer GPU, or even a phone. That unlocks privacy (data never leaves the device), offline use, low latency, and near-zero marginal cost per inference. The three most prominent open SLM families today are Microsoft’s Phi-3, Google’s Gemma, and Meta’s Llama 3 8B.
The contenders
- Phi-3 Mini (~3.8B). Microsoft’s bet is “textbook-quality” curated training data over raw size. Phi-3 Mini consistently outperforms its parameter count on reasoning and coding benchmarks, making it a standout for capability-per-byte on constrained hardware.
- Gemma 2B and 7B. Google’s open family derived from Gemini research. The 2B model is one of the best options for the tightest devices; the 7B is a solid general performer. Gemma ships with good tooling and a permissive commercial licence.
- Llama 3 8B. Meta’s 8B is the strongest general-purpose all-rounder of the three, with broad knowledge, good instruction following, and by far the largest ecosystem of fine-tunes, quantizations, and tooling support.
Quality per parameter and quantization
Benchmarks (MMLU, HumanEval, GSM8K) tell a consistent story: Phi-3 Mini delivers remarkable reasoning and coding for its size, Llama 3 8B leads overall because it has more parameters and broad training, and Gemma is competitive in the middle while offering the smallest viable 2B option.
The number that actually decides deployability is memory after quantization. At 4-bit: Gemma 2B fits in roughly 1.5-2 GB, Phi-3 Mini in about 2.5 GB, and Llama 3 8B in around 5-6 GB. 4-bit quantization typically costs only a few points of quality, which is why almost all on-device deployments use it. Below 4-bit, quality degrades faster, so 4-bit is the usual sweet spot.
Licensing and ecosystem
Licensing differs and matters for products. Gemma and Phi-3 ship under permissive, commercially friendly terms. Llama 3 uses Meta’s community licence, which is permissive for the vast majority of companies but adds a clause for very large-scale providers — read it if you are at hyperscaler size. Across all three, the Llama ecosystem is the richest: the most community fine-tunes, the broadest support in runtimes like llama.cpp, Ollama, and vLLM, and the most quantized variants ready to download.
Picking one
Choose by your binding constraint. For the most constrained device (a phone or tiny edge board), Gemma 2B or quantized Phi-3 Mini win. For best reasoning/coding per parameter, reach for Phi-3 Mini. For the strongest general capability and the deepest tooling, Llama 3 8B is the safe default if your hardware can hold ~6 GB. In practice many teams ship Phi-3 or Gemma 2B on-device and fall back to a larger hosted model for hard queries.