Bilevel Optimization over Saddle Points of Zero-Sum Markov Games

ArXi:2605.26654v1 Announce Type: cross Reinforcement learning (RL) often has a hierarchical structure, where an upper-level (UL) learner selects model parameters and a lower-level (LL) decision-making process responds, naturally leading to a bilevel optimization problem. Most existing bilevel RL methods assume a single-policy LL Marko decision process (MDP), and. therefore. fail to capture competitive structures arising in applications such as incentive design, where multiple policies interact.