AI RESEARCH

Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs

arXiv CS.CL

ArXi:2605.30521v1 Announce Type: new Large language models must frequently process untrusted inputs, such as judging an answer from another model or running tasks like spam and harm classifiers while under adversarial pressure. These inputs are often string-formatted directly into a prompt template, leaving systems fragile to manipulation. Current LLM specs from major providers like OpenAI distinguish trustworthiness along an Instruction Hierarchy, from System messages (most trusted) to Tool Results (least trusted.