FeatherOps: Fast fp8 matmul on RDNA3 without native fp8, now supports more models

There was not much update on the kernel itself since March, and I did a lot on ComfyUI integration. Currently tested models are Anima, LTX 2.3, Qwen-Image, Wan, and other models may also work out of the box. For some workloads you may see 30~50% speedup, but your mileage may vary. submitted by /u/woct0rdho [link] [comments]