Latest b9274 Addresses MTP VRAM leak

r/LocalLLaMA
AI Research

B9274 I have been having an issue with MTP models unloading after a couple minutes of use. Can't figure out why. Anyways z I don't think this is relevant to that but I did observe the vram creep so hopefully this helps. server: free draft/MTP resources on sleep to fix VRAM leak The destroy function in server_context_impl only cleaned up the main model and context (via llama_init.reset) but did not free the speculative decoder (spec), draft context (ctx_dft), or draft model (model_dft