GRADE: Generalizable Reasoning-Aware Dialogue Evaluation for AI Tutors

ArXi:2605.27866v1 Announce Type: new Evaluating AI tutor responses requires than factual correctness: tutors must identify mistakes, locate errors, provide guidance, and offer actionable next steps. We present GRADE, a systematic study of open-source models for pedagogical ability assessment in student-tutor dialogues. Building on the BEA 2025 TutorMind setting, we evaluate 120 configurations across five language models, zero-shot inference, LoRA fine-tuning, synthetic augmentation, CoT+Reasoning, and single-task versus multitask formulations.