Socratic dialogue evaluator
Multi-turn conversations are where LLMs most often go wrong: a model that answers turn one beautifully can contradict itself by turn four, lose the thread, or give shallower answers as the context fills. The Socratic dialogue evaluator scores each assistant turn in a chat log on reasoning depth, internal consistency, and alignment with earlier turns — using heuristic rubrics that run instantly in your browser, no API key required.
How it works
Paste your conversation as JSON ([{ "role": "...", "content": "..." }]) or as plain text with user: / assistant: tags. The evaluator parses it, then scores each assistant turn across three dimensions:
- Reasoning depth — rewards explicit reasoning markers (“because”, “therefore”, “if… then”), structured explanation, and worked steps; penalizes one-line non-answers.
- Consistency — flags contradiction cues (“actually, no”, “I was wrong”, “ignore that”) and reversals against what the model said earlier.
- Alignment — measures how much each turn connects back to the conversation so far, as a proxy for staying on-thread rather than drifting.
Each turn gets a composite score and notes, and you get a conversation-level average so you can compare runs.
Tips and examples
Use the per-turn view to find the exact point where a conversation degrades — often you’ll see depth fall off a cliff once the context gets long, which is a cue to summarize and restart. When building evaluation datasets, a low consistency score is a good filter for turns worth a human’s attention. Remember the scores are heuristic: they reliably catch shallow, drifting, or self-reversing turns, but they can’t judge whether a claim is factually true — pair the tool with spot-checks for anything high-stakes.