I ported NVIDIA Parakeet (speech-to-text) to ggml: same output as NeMo, faster, GGUF-quantized, no Python
r/LocalLLaMA
•
Generative AI
AI Hardware
Open Source AI
AI Tools
I ported NVIDIA's Parakeet speech-to-text models to pure C++/ggml (the engine behind llama.cpp and whisper.cpp). It runs the FastConformer TDT / CTC / RNNT / hybrid models with no Python and no PyTorch, on CPU and GPU (CUDA, HIP, Vulkan, Metal). The goal was to match NeMo exactly, then make it deployable anywhere. Where it landed: Output is byte-for-byte identical to NeMo (WER 0 on the f32/f16 path). Faster than NeMo's own PyTorch runtime: up to ~5x on the larger TDT/hybrid models on GPU, up to ~1.86x on CPU when quantized, and about 2x less memory.