DCC: Data-Centric Compilation of Machine Learning Kernels for Processing-In-Memory Architectures

ArXi:2511.15503v2 Announce Type: replace-cross High-performance Host processors can integrate Processing-In-Memory (PIM) devices, which can accelerate memory-intensive kernels of Machine Learning (ML) models, including Large Language Models (LLMs), by leveraging the large memory bandwidth available at PIM cores. However, Host processor needs consecutive elements distributed across DRAM banks, while PIM cores need consecutive elements within their local banks.