Google DeepMind’s India unit is taking on an ambitious challenge with a project called Morni. It aims to develop an AI model that understands and represents 125 Indian languages and dialects. This effort, known as Multimodal Representation for India (Morni), is all about ensuring that the diverse languages spoken across India are included in the digital world, ensuring everyone can be heard.
India has a staggering number of languages—22 of them are officially recognized, but well over 100 people use them every day. Google DeepMind’s team realized that around 60 Indian languages are spoken by over a billion people, and more than 125 languages each have over 100,000 speakers.
The challenge, however, is that many of these languages, especially the lesser-known ones, don’t have much digital presence. For example, Hindi, spoken by nearly 10 percent of the world’s population, makes up only 0.1 percent of the content on the internet. Even more concerning is that 73 of these 125 languages have no digital data available.
To address this, Google DeepMind has launched Project Vaani in collaboration with the Indian Institute of Science (IISc) and ARTPARK (Artificial Intelligence & Robotics Technology Park). The project is focused on gathering speech data from across the country, making it available as open-source material. In its first phase, Project Vaani collected over 14,000 hours of speech data across 58 languages from 80000 people in 80 districts.
Project Vaani was first announced in December 2022. It aims to collect and translate 154,000 hours of speech data from all 773 districts in India. The project is now in its second phase, which aims to cover even more ground, extending the collection to 160 districts across all states. This massive data-gathering effort is crucial for developing an AI that reflects India’s linguistic diversity.
Google DeepMind’s work on Project Morni and Project Vaani isn’t just about technology—it’s about making sure that every language, no matter how small, has a place in the digital age. By focusing on such a wide range of languages, the project is helping to preserve India’s rich linguistic heritage while also making technology more accessible to people who speak these languages every day. This work is a significant step toward creating a more inclusive digital world where everyone’s voice can be heard.