Safety is a balance, not a single score
It is tempting to ask which AI model is “safest,” but the question hides a tradeoff. Safety has two opposing dimensions: a model should refuse genuinely harmful requests (weapons, malware, exploitation) while not over-refusing benign ones (explaining how a medication works, writing fiction with conflict, discussing security concepts for defence). A model that refuses everything is perfectly safe and completely useless. The art is calibrating the line, and that is where ChatGPT, Claude, and Gemini differ most.
How the three leading models compare
On clearly harmful requests, all three frontier models perform well — refusing instructions for serious physical, cyber, or biological harm is solved territory, and the remaining gaps are edge cases that red-teaming and jailbreaking probe. The interesting differences live in ambiguous and benign-but-sensitive territory:
- ChatGPT (OpenAI) tends to provide caveated, educational answers to sensitive-but-legitimate questions, leaning toward usefulness with disclaimers. Its refusal style is generally concise.
- Claude (Anthropic) is trained with an explicit Constitutional AI approach and is often noted for nuanced reasoning about why it will or will not help, and for handling dual-use topics by addressing the legitimate framing while declining the harmful one.
- Gemini (Google) has historically applied relatively conservative safety filters, which at times produced more over-refusals on benign queries; Google has tuned this down across versions.
These are tendencies, not fixed laws — and crucially they shift with every model update.
The over-refusal problem
Over-refusal is the quieter failure mode. When a model declines a legitimate request — a nurse asking about drug interactions, a developer asking how an exploit works so they can defend against it, a novelist writing a violent scene — users get frustrated and route to less-safe tools or jailbreaks. Reducing over-refusal without weakening protection on real harms is an active research target, and recent model releases have generally moved toward explaining and contextualising sensitive topics rather than flatly refusing them, while keeping hard lines on genuine harm.
How to read safety claims
Because behaviour changes with each version, treat any specific verdict as a snapshot. What endures is the methodology: labs test with standardized prompt batteries spanning harmful, benign-sensitive, and ambiguous categories, scoring for correct refusal, over-refusal, and consistency, and publish the results in model and system cards. When you evaluate a model for your own product, do the same — assemble a small set of prompts representative of your real users, including the awkward edge cases, and measure how the model reasons, not just whether it said yes or no once. The best choice for you depends on whether your domain leans toward needing maximum caution or maximum helpfulness on sensitive topics.