Can LLMs Adhere to Strict 2D Spatial Constraints? (Testing with Sokoban)

I recently ran a benchmark to test how well modern Large Language Models (LLMs) handle spatial geometry and logical reasoning under zero-shot conditions. To eliminate cheat-guessing, I used a custom Sokoban (Box-Pushing) map with extremely strict formatting constraints (no Chain-of-Thought allowed, only raw directional outputs). The results showed a massive divide between top-tier closed-source models and the rest of the field.