Opinions/improvements for my Qwen3.6-35B-A3B-FP8 + Hermes Agent setup on NVIDIA DGX Spark?

r/LocalLLaMA
Generative AI AI Hardware Open Source AI AI Tools

I’m running Hermes Agent on a single NVIDIA DGX Spark using vLLM with: docker run --gpus all \ --name qwen36-aggressive \ --restart unless-stopped \ -p 8000:8000 \ --ipc=host \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ --shm-size=32g \ - ~/.cache/huggingface:/root/.cache/huggingface \ -e VLLM_ATTENTION_BACKEND=FLASHINFER \ -e FLASHINFER_DISABLE_VERSION_CHECK=1 \ -e VLLM_HTTP_TIMEOUT_KEEP_ALIVE=600 \ vllm/vllm-openai:cu130-nightly \ --model Qwen/Qwen3.6-35B-A3B-FP8 \ --served-model-name qwen36 \ --host 0.0.0.0 \ --port 8000 \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.75.