Geometry-Lite: Interpretable Safety Probing via Layer-Wise Margin Geometry

ArXi:2605.20241v1 Announce Type: new Prompt-level safety probes for large language models use hidden-state representations to separate safe from unsafe prompts, but strong average detection performance does not explain the geometry of this separation. In particular, it remains unclear how safety evidence is formed across layers, which aspects of that layer-wise geometry low-false-positive decisions, and which geometric biases remain stable under benchmark shift. We study this as an empirical decomposition problem and.