AI RESEARCH

noisekit - CLI for generating realistic degraded speech datasets for ASR benchmarking [P]

r/MachineLearning

If you've ever tried to pick an STT vendor for a based voice agent or call center product, you've probably hit this wall: you have plenty of real production audio, but it's unlabeled, so you can't compute WER on it. And the annotated public datasets (FLEURS, CommonVoice, LibriSpeech) are clean studio recordings that have nothing to do with how STT models actually handle your G.711 encoded noisy calls. Annotating production audio is slow, expensive, and usually a privacy headache.