Dissecting ThunderKittens, anatomy of a compact DSL for high-performance AI kernels

Lobste.rs AI
Machine Learning AI Hardware

Introduction Modern ML workloads depend heavily on custom GPU kernels. Even when a model is expressed as clean tensor operations, the performance almost a...