Better Accuracies, Worse Reasoning: A Step-Level Audit of Medical Chain-of-Thought Distillation

ArXi:2605.28301v1 Announce Type: new Chain-of-thought (CoT) distillation trains a smaller model to imitate a teacher's reasoning trace, but it is typically evaluated by final-answer metrics including accuracy. We ask whether gains in answer quality are accompanied by improvements in the trace. In medical QA, where short answer options can leave a richer clinical justification under-specified, a Qwen3-8B student distilled from a DeepSeek-V3-family teacher improves on MedQA-USMLE answer metrics (SC 74.7% to 84.4%; expected calibration error (ECE) 0.096 to 0.034.