Using Gemma 4 E4B with the LiteRT engine - ~2.4x speedup over Q4 GGUF in text generation, image processing roughly the same

r/LocalLLaMA
Generative AI Computer Vision Open Source AI

I know there is a PR in llama.cpp to MTP for the 26b and 31b versions of Gemma 4, but as far as I can tell there is nothing yet for the E2B and E4B models. Using Hermes Agent, I had it set up Gemma 4 E4B in Google's Lite RT format, and then write a Python wrapper around it to create an OpenAI compatible endpoint, and ran some speed tests, comparing the LiteRT model with the Unsloth/AtomicChat Q4M quant of E4B. The tests were conducted by giving each model identical prompts and measuring the output speed.