Benchmark runtime restructured
Canonical script layout enforced. All studies now run from unified pipeline entry points with consistent CLI interfaces.
Research Project
Evaluating clinical reasoning reliability in mental health LLMs
A black-box, scalable benchmark harness evaluating three critical failure modes in large language models destined for mental health support: unfaithful reasoning, sycophantic agreement, and longitudinal drift.
Purpose & Scope
This benchmark evaluates clinical reasoning reliability in large language models designed for mental health support. It tests three critical failure modes that directly impact patient safety.
Unlike general-purpose LLM benchmarks, this framework targets the specific ways models can fail in clinical contexts: producing correct answers with fabricated rationales, agreeing with incorrect user assumptions under social pressure, and losing consistency across extended conversations.
Note: This page showcases the research project. The live benchmark platform is a separate artefact (coming soon).
Do step-by-step rationales line up with gold reasoning traces, or are models producing correct answers with fabricated explanations?
Can models maintain clinical accuracy while refusing to agree with user errors under social pressure?
Does the model maintain consistency and recall critical patient details over multi-turn therapeutic conversations?
Identifies when models produce unsafe clinical outputs, including unfaithful reasoning and silent bias in mental health contexts.
Measures whether model explanations genuinely reflect their decision-making process, or are post-hoc rationalisations.
Tests multiple models across three studies with reproducible scoring pipelines, enabling fair comparison of clinical reasoning capabilities.
Grounded in verified failure modes from safety literature with clinically meaningful evaluation scenarios.
Surveyed clinical LLM failure modes and designed three-study evaluation protocol.
Implemented Study A (faithfulness), Study B (sycophancy), and Study C (drift) pipelines.
Scaled to 14,416 prompts per model across all studies — Study A: 4,000 CoT/Early + 2,016 adversarial bias; Study B: 4,000 single-turn + 2,400 multi-turn; Study C: 2,000 longitudinal turns.
Running all studies across 8 models totalling 115,328 prompts with full metric computation and result archival.
Producing final analysis notebooks, charts, and written report.
Release benchmark platform, dataset splits, and reproducibility package.
Canonical script layout enforced. All studies now run from unified pipeline entry points with consistent CLI interfaces.
All 8 models evaluated across Studies A, B, and C with 115,328 total prompts. Results archived with full metric computation including faithfulness gap, sycophancy probability, and entity recall at turn 10.
40+ patient personas created with safety plans and multi-turn conversation histories for Study C longitudinal drift evaluation.
Full written report covering all three studies, metrics, model comparisons, and findings.
DownloadSlide deck summarising benchmark design, key results, and implications for clinical AI safety.
DownloadDetailed description of metrics, data pipelines, gold label construction, and reproducibility steps.
Open-source benchmark runtime with all study pipelines, model runners, and metric calculators.
View on GitHubInteractive analysis from the benchmark's initial small-scale evaluation run (not the full scaled-dataset evaluation). Each notebook contains charts, tables, and detailed metric breakdowns.
Key academic sources supporting this benchmark
The benchmark platform is a separate, interactive tool where you can explore model evaluations, compare results, and run your own assessments.
This showcase page provides project context. For the live evaluation tool, visit the platform below.