Communication-Efficient Hybrid Language Model via Uncertainty-Aware Opportunistic and Compressed Transmission

ArXi:2505.11788v2 Announce Type: replace-cross To emerging language-based applications using dispersed and heterogeneous computing resources, the hybrid language model (HLM) offers a promising architecture, where an on-device small language model (SLM) generates draft tokens that are validated and corrected by a remote large language model (LLM). However, the original HLM suffers from substantial communication overhead, as the LLM requires the SLM to upload the full vocabulary distribution for each token.