Controllable Value Alignment in Large Language Models through Neuron-Level Editing

ArXi:2602.07356v2 Announce Type: replace Aligning large language models (LLMs) with human values has become increasingly important as their influence on human behavior and decision-making expands. However, existing steering-based alignment methods suffer from limited controllability: steering a target value often unintentionally activates other, non-target values. To characterize this limitation, we