Research Project

FSD Mental Health Safety Benchmark

Evaluating clinical reasoning reliability in mental health LLMs

A black-box, scalable benchmark harness evaluating three critical failure modes in large language models destined for mental health support: unfaithful reasoning, sycophantic agreement, and longitudinal drift.

Benchmark PlatformComing Soon

Purpose & Scope

What This Project Is

This benchmark evaluates clinical reasoning reliability in large language models designed for mental health support. It tests three critical failure modes that directly impact patient safety.

Unlike general-purpose LLM benchmarks, this framework targets the specific ways models can fail in clinical contexts: producing correct answers with fabricated rationales, agreeing with incorrect user assumptions under social pressure, and losing consistency across extended conversations.

Note: This page showcases the research project. The live benchmark platform is a separate artefact (coming soon).

Study AFaithfulness

Do step-by-step rationales line up with gold reasoning traces, or are models producing correct answers with fabricated explanations?

Study BSycophancy

Can models maintain clinical accuracy while refusing to agree with user errors under social pressure?

Study CLongitudinal Drift

Does the model maintain consistency and recall critical patient details over multi-turn therapeutic conversations?

Why It Matters

Patient Safety

Identifies when models produce unsafe clinical outputs, including unfaithful reasoning and silent bias in mental health contexts.

Transparency

Measures whether model explanations genuinely reflect their decision-making process, or are post-hoc rationalisations.

Evaluation Quality

Tests multiple models across three studies with reproducible scoring pipelines, enabling fair comparison of clinical reasoning capabilities.

Real-World Relevance

Grounded in verified failure modes from safety literature with clinically meaningful evaluation scenarios.

Current Status

Oct 2025Done

Literature Review & Framework Design

Surveyed clinical LLM failure modes and designed three-study evaluation protocol.

Dec 2025Done

Benchmark Runtime v1

Implemented Study A (faithfulness), Study B (sycophancy), and Study C (drift) pipelines.

Jan 2026Done

Dataset Scaling Complete

Scaled to 14,416 prompts per model across all studies — Study A: 4,000 CoT/Early + 2,016 adversarial bias; Study B: 4,000 single-turn + 2,400 multi-turn; Study C: 2,000 longitudinal turns.

Feb 2026In Progress

8-Model Evaluation Complete

Running all studies across 8 models totalling 115,328 prompts with full metric computation and result archival.

Mar 2026Planned

Analysis & Reporting

Producing final analysis notebooks, charts, and written report.

Apr 2026Planned

Public Release

Release benchmark platform, dataset splits, and reproducibility package.

Updates

15 Feb 2026Release

Benchmark runtime restructured

Canonical script layout enforced. All studies now run from unified pipeline entry points with consistent CLI interfaces.

20 Jan 2026Milestone

8-model evaluation complete

All 8 models evaluated across Studies A, B, and C with 115,328 total prompts. Results archived with full metric computation including faithfulness gap, sycophancy probability, and entity recall at turn 10.

1 Dec 2025Data

Persona-based longitudinal histories

40+ patient personas created with safety plans and multi-turn conversation histories for Study C longitudinal drift evaluation.

Outputs & Artefacts

Technical Report

Full written report covering all three studies, metrics, model comparisons, and findings.

Download

Presentation Deck

Slide deck summarising benchmark design, key results, and implications for clinical AI safety.

Download

Coming Soon

Evaluation Methodology

Detailed description of metrics, data pipelines, gold label construction, and reproducibility steps.

Benchmark Code

Open-source benchmark runtime with all study pipelines, model runners, and metric calculators.

View on GitHub

Analysis Notebooks

Interactive analysis from the benchmark's initial small-scale evaluation run (not the full scaled-dataset evaluation). Each notebook contains charts, tables, and detailed metric breakdowns.