How does it score without an AI judge?

It applies heuristic rubrics over the text — looking for reasoning markers, explanation length, hedging, and lexical overlap with prior turns as a consistency proxy. It is a fast, deterministic first pass, not a substitute for human or LLM-judge review.

What input format does it accept?

Either a JSON array of objects with role and content fields, or plain text with lines tagged user: and assistant:. The parser detects which one you pasted.

What does the consistency score mean?

It estimates how well each assistant turn stays grounded in the conversation by measuring how much of its content connects to earlier turns. Sudden topic jumps or self-contradiction phrases lower the score.

Can it detect contradictions?

It flags surface-level contradiction cues like reversing earlier statements or saying "actually, no" after a confident claim. It cannot verify factual truth, so deep logical contradictions still need human review.

Is my conversation uploaded anywhere?

No. Parsing and scoring happen entirely in your browser. Nothing leaves the page.

What is the Socratic Dialogue Evaluator?

Paste a multi-turn chat log and the tool scores each model turn on logical consistency, reasoning depth, and alignment with prior turns using heuristic rubrics, so you can spot where a conversation drifts or contradicts itself. It runs free in your browser on Gera Tools, with nothing uploaded.

Socratic Dialogue Evaluator

Name: Socratic Dialogue Evaluator
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Socratic dialogue evaluator

Multi-turn conversations are where LLMs most often go wrong: a model that answers turn one beautifully can contradict itself by turn four, lose the thread, or give shallower answers as the context fills. The Socratic dialogue evaluator scores each assistant turn in a chat log on reasoning depth, internal consistency, and alignment with earlier turns — using heuristic rubrics that run instantly in your browser, no API key required.

How it works

Paste your conversation as JSON ([{ "role": "...", "content": "..." }]) or as plain text with user: / assistant: tags. The evaluator parses it, then scores each assistant turn across three dimensions:

Reasoning depth — rewards explicit reasoning markers (“because”, “therefore”, “if… then”), structured explanation, and worked steps; penalizes one-line non-answers.
Consistency — flags contradiction cues (“actually, no”, “I was wrong”, “ignore that”) and reversals against what the model said earlier.
Alignment — measures how much each turn connects back to the conversation so far, as a proxy for staying on-thread rather than drifting.

Each turn gets a composite score and notes, and you get a conversation-level average so you can compare runs.

Why evaluate Socratic dialogue specifically?

The Socratic method is demanding for LLMs because it requires the model to hold a goal in mind (a target insight), advance the learner with questions, and do so without either stating the answer directly or drifting off the thread. A generic conversation evaluator misses the structure-specific failure modes: a model that starts Socratic but slips into declarative answers by turn six, or one that asks varied-seeming questions that all come back to the same prompt.

This evaluator scores against that structure by weighting:

Question density — how often the assistant’s turns end with a genuine question (critical for Socratic mode)
Topic coherence — whether follow-up questions build on the learner’s previous answer or reset to a pre-scripted sequence
Depth progression — whether the reasoning complexity escalates through the conversation rather than staying flat

Interpreting the scores

Score range	What it means
High depth, high consistency	Turn is well-reasoned and grounded in prior turns
High depth, low consistency	Rich reasoning but contradicts or ignores earlier context
Low depth, high consistency	Stays on topic but gives thin, surface-level responses
Low depth, low consistency	Shallow and drifting — a clear weak point in the dialogue

Low consistency in early turns is more damaging than low consistency later, because self-contradictions in the foundational turns undermine the entire reasoning chain.

Practical uses

Prompt testing — compare two system prompts’ dialogue quality across the same user turns, without needing LLM-as-judge API calls.
Dataset curation — filter a large batch of synthetic dialogues to find the turns worth human review before fine-tuning.
Teaching tool audits — evaluate AI tutoring sessions where Socratic consistency is the pedagogical goal, checking that the model never gives away the answer early.

Use the per-turn view to find the exact point where a conversation degrades — often you will see depth fall off once the context gets long, which is a cue to summarize and restart. The scores are heuristic: they reliably catch shallow, drifting, or self-reversing turns, but they cannot verify factual truth — pair the tool with spot-checks for anything high-stakes.