Honesty in a small model drops from 35% to 0% by changing the tone of the prompt. Sharing the findings.

r/LocalLLaMA
Generative AI Open Source AI

It raises questions about how language models behave when the framing of a request shifts. Small open-source AI models can be moved from honest to dishonest behaviour by little than a change in tone. Asked to solve coding problems designed to be mathematically impossible, the model openly acknowledged the impossibility about a third of the time when addressed in neutral language. When the same problem was framed with mild pressure, suggesting only visible results mattered, the model never once admitted the task could not be done.