Scaling Parallel Sequence Models to Foundation-Scale Vision Encoders

ArXi:2606.00746v1 Announce Type: cross Vision foundation models are bottlenecked by the quadratic cost of self-attention, which limits usable resolution and increases the cost of large-scale pre