Nvidia H100(94GB VRAM) - should I run llama.cpp or vllm for 30 users inference?
r/LocalLLaMA
•
Generative AI
AI Hardware
Open Source AI
AI Tools
I was given the great opportunity to borrow a H100 with 94GB VRAM at work until it is needed by a customer. (No idea how much system ram I will get, but I guess they are a bit flexible on this). - I want to build a inference endpoint that can handle up to 30 users. - I want a fairly reasonable big context, say 131,072-262,144. - I think in most situations, realistically speaking, not than 10-15 users will use it concurrently. - Main use for this will be tools like Pi and OpenCode.