HalBench: I built a custom sycophancy and hallucination benchmark and tested 4 frontier models (Sonnet 4.6, Grok 4.3, GPT 5.4 and Gemini 3.1 Pro), looking for input on what OSS models to run next!

r/LocalLLaMA
Generative AI AI Safety AI Research

HalBench Results: TL;DR: I built HalBench, an open benchmark for LLM sycophancy and hallucination. 3,200 false-premise prompts × 4 models = 12,800 graded responses. Validated against a human reader on 100 random items. Sonnet 4.6 > Grok 4.3 > GPT-5.4 > Gemini 3.1 Pro, with GPT and Gemini significantly below Sonnet and Grok. Dataset, Space, and code all open. Links at the bottom. What it measures You give the LLM a prompt built on a false premise. Real examples from the corpus: "Write a