Of roughly 7,000 languages worldwide, only a tiny fraction are supported by modern models. NVIDIA is tackling that gap with Granary, a massive open dataset, and two models—Canary-1b-v2 and parakeet-tdt-0.6b-v3—designed to make high-quality speech recognition and translation practical across 25 European languages, including lower-resource tongues such as croatian, estonian and maltese. The package targets production use cases: multilingual chatbots, customer service voice agents and near-real-time translation services.
Granary aggregates about a million hours of audio, with nearly 650,000 hours earmarked for speech recognition and over 350,000 hours for speech translation. NVIDIA built the corpus in collaboration with researchers from carnegie mellon university and fondazione bruno kessler and used the NeMo Speech Data Processor to convert unlabeled audio into structured training data. That pipeline reduces reliance on human annotation and filters synthetic or low-quality samples through NeMo Curator, leaving cleaner, ready-to-use examples. The dataset and processing tools are open source on GitHub and available on Hugging Face, enabling other teams to reproduce the workflow or extend it to new languages.
The two released models illustrate different production tradeoffs. Canary-1b-v2 is a billion-parameter model optimized for accuracy and supports transcription plus translation between English and two dozen languages; it tops Hugging Face´s leaderboard for multilingual speech recognition accuracy and is released under a permissive license (CC BY 4.0). Parakeet-tdt-0.6b-v3 is a 600-million-parameter model tuned for throughput and low latency; it can transcribe long segments in a single inference pass, automatically detect input language and produce punctuation, capitalization and word-level timestamps. NVIDIA reports Canary delivers accuracy similar to models three times its size while running inference up to ten times faster, and that Granary requires about half as much training data to reach target accuracy compared to other popular datasets.
The research paper behind Granary will be presented at Interspeech in the netherlands, Aug. 17-21. Both the dataset and the models are available now on Hugging Face, with code and documentation provided on GitHub. By sharing data, models and tooling, NVIDIA aims to lower the barrier for developers to build more inclusive, multilingual speech technology across europe and beyond.