AI RESEARCH
Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks
arXiv CS.AI
•
ArXi:2605.23170v1 Announce Type: cross Position-controlled evaluation is standard for retrieval tasks such as Needle-in-a-Haystack and RULER, but mainstream reasoning benchmarks do not control positional placement of target tasks in long contexts. We audit 11 long-context benchmarks and find none jointly controls task position, filler content, and context length for reasoning. An audit of four flagship long-context releases finds no main result-table entry for NIAH, RULER, or LongBench-family benchmarks, while agentic and coding benchmarks appear in main result-tables across all four.