Back to blog

Blog

BharatVox: Open-Source Speech Corpus for Indian Languages Including Odia

@openodia
speechdatasetsmodels

What happened

The odisha-ml organization launched BharatVox — a community-driven open-source speech corpus for Indian languages. Odia was a first-class language in the dataset from day one, alongside Hindi, Bengali, and others.

Why it matters

Speech technology is one of the most impactful AI applications for Indian languages. Voice interfaces can serve populations with limited literacy, and speech-to-text enables content creation in Odia. BharatVox provided the training data needed to build these systems.

Community effort

BharatVox was built by volunteers recording and validating speech samples. This community-driven approach meant the corpus reflected real, diverse Odia speech — not just studio-quality recordings from a handful of speakers.

Links