Fused MoE dispatch kernel in pure Triton: 89-131% of Megablocks, runs on AMD with zero code changes

I've been working on MoE inference and wrote a fused dispatch kernel entirely in Triton, no CUDA. At inference batch sizes (up to 512 tokens) it reaches 89-131% of Megablocks(Stanford's CUDA-optimized MoE lib), and the same kernel runs on AMD MI300X with no changes. Mixtral-8x7B on A100. The biggest win was fusing the gate+up projections so the SwiGLU intermediate never leaves registers, cutting 35% of global memory traffic. Fewer kernellaunches (5 vs 24+) helped but mattered less.