Dissecting ThunderKittens, anatomy of a compact DSL for high-performance AI kernels
Lobste.rs AI
•
Machine Learning
AI Hardware
Introduction Modern ML workloads depend heavily on custom GPU kernels. Even when a model is expressed as clean tensor operations, the performance almost a...