The Attentional White Bear Effect in Transformer Language Models

ArXi:2605.28639v1 Announce Type: cross Instruction-based suppression is widely used to prevent language models from generating prohibited content, yet it remains unclear whether suppression reduces internal representation or merely suppresses expression. We investigate this question through representational probing, attention analysis, and behavioral semantic leakage experiments across multiple transformer models.