Dissecting the Black Box: Circuit-Level Analysis of LLM Vulnerability Detection

ArXi:2605.29901v1 Announce Type: cross Large language models (LLMs) can detect software vulnerabilities, but how do they actually identify vulnerable code? We address this question using mechanistic interpretability; analyzing the internal computations of a neural network to understand its reasoning process. Using Circuit Tracer on Gemma-2-2b, we trace the computational pathways activated when the model classifies 472 C/C++ code samples as vulnerable or safe.