BigMac: Breaking the Pareto Frontier of Compute and Memory in Multimodal LLM Training

Training multimodal large language models (MLLMs) is challenged by both model and data heterogeneity. Existing systems redesign the training pipeline to address these challenges, but remain bound by a Pareto frontier between compute and memory efficiency, improving one only at the expense of the other. The core idea of BigMac is to elegantly nest the encoder and generator computation into the original LLM pipeline, forming a depende