Inverse Depth Scaling From Most Layers Being Similar

ArXi:2602.05970v2 Announce Type: replace-cross Neural scaling laws relate loss to model size in large language models (LLMs), yet depth and width may contribute to performance differently, requiring detailed studies. Here, we quantify how depth affects loss via analysis of LLMs and toy residual networks. We find loss scales inversely proportional to depth in LLMs, probably due to functionally similar layers reducing error through ensemble averaging rather than compositional learning or discretizing smooth dynamics.