One thing that's been bothering me lately: benchmark performance often tells me almost nothing about whether a workflow will survive production usage.[D]

I've seen systems score well internally and then immediately fail under: ambiguous user intent messy real-world context contradictory instructions long-running sessions Feels like evaluation still heavily rewards clean-task optimization instead of behavioral robustness. What are people using beyond standard eval pipelines? submitted by /u/Bladerunner_7_ [link] [comments]