Hidden Latent-State Shifts in LLMs: Why Current Alignment Is Blind to Real Internal Dangers — Especially With Agents

r/artificial
Generative AI AI Safety

For years, the alignment community has focused almost entirely on the model’s output - making sure the final tokens are safe, helpful, and honest. RLHF, DPO, constitutional AI, output filters - all of it operates at the surface level. But what if the model can enter a completely different internal regime inside the residual stream, while its external behavior remains perfectly aligned? We just measured exactly that.