AI RESEARCH

Demystifying Pipeline Parallelism: First Theory for PipeDream

arXiv CS.LG

Training modern machine learning models increasingly requires computation to be distributed across many accelerators. Data parallelism remains the default choice and is often paired with tensor-parallel sharding, but model parallelism becomes unavoidable once parameters, activations, or optimizer states no longer fit on a single device. Our first contribution is theoretical: we introduce Rando