gemma 4 e2b quality degrades after ~30-40 continuous inferences on 4gb vram?

r/LocalLLaMA
Generative AI Open Source AI

Running gemma e2b via llama-server for continuous background tasks on a 1650 4gb. works great initially but after maybe 30-40 calls the outputs start getting noticeably worse - shorter responses, missing fields in json output, sometimes just empty. restarting llama-server fixes it immediately. using: flash-attn on, single slot, 6144 context, ngl 15 anyone seen this? is this a k cache thing or just vram fragmentation over time? if there's a way to handle it without restarting the whole server submitted by /u/Top_Speaker_7785 [link] [comments.