AI RESEARCH
High E2E latency on fine-tuned Gemma 4 26B despite low TTFT [R]
r/MachineLearning
•
Recently fine-tuned a Gemma 4 26B model, and I’m seeing surprisingly high end-to-end latency despite the effective inference footprint being much smaller (~4B-ish behavior during serving). Current setup: Model: Gemma 4 26B (fine-tuned) Engine: vLLM Quantization: FP8 Hardware: H100 Observed latency: TTFT: ~100-300 ms E2E latency: ~3-5 seconds The TTFT seems reasonable, but the overall generation latency feels disproportionately high for the effective serving size. I already experimented with vLLM’s n-gram speculative decoding, but honestly didn’t see meaningful gains.