llama: limit max outputs of `llama_context` by am17an · Pull Request #23861 · ggml-org/llama.cpp

r/LocalLLaMA
Generative AI Open Source AI

Overview continue, this PR only reserves logits space for n_seqs when possible. With -ub 2048 and MTP, this saves another 1.2GB of VRAM for me. I've tested llama-perplexity also and it seems to work fine. But maybe there is a better API, putting up as a draft for now According to me an API in llama-context is a good solution for this, by default it will reserve all tokens but specifically in server-context we can set it to 1 whenever possible. - u/am17an submitted by /u/pmttyji [link] [comments.