Training AI models is an expensive affair for one simple. Data collection and data generation are usually costly. An Indian startup, though, has cracked the code to generate data for Big Tech companies working for Google, Meta, and Instagram. The fun part? The startup was started by a 27-year-old.
Karya, founded in 2021, ahead of the ChatGPT wave, has drawn the attention of tech giants who are hungry for data. In India, the number of data annotation workers is projected to reach nearly one million by 2030, according to Nasscom, the country’s tech industry trade association.
Karya distinguishes itself by paying its contractors, predominantly women in rural areas, up to 20 times the minimum wage, aiming to produce high-quality Indian-language data sought after by tech companies.
“Big tech companies spend billions on collecting training data for their AI and machine learning models,” says Manu Chopra, the 27-year-old founder of Karya, a Stanford-educated computer engineer. “Low pay for such work is an industry failure.”
Several leading tech companies are collaborating with Karya to address one of the critical challenges in AI development: acquiring high-quality data to serve non-English speaking users. These partnerships signal a potential shift in the data industry’s economics and Silicon Valley’s relationship with data providers.
Microsoft has employed Karya to source local speech data for its AI products, the Bill & Melinda Gates Foundation also works with Karya to reduce gender biases in data-feeding language models, and Google relies on Karya and other local partners to gather speech data across 85 Indian districts.
In India alone, nearly one billion potential users are eager to access AI-powered solutions across various sectors, from healthcare to education and finance.
“India is the first non-Western country we are doing this in, and we are testing Bard in nine Indian languages,” notes Manish Gupta, head of Google Research in India, referring to the company’s AI chatbot. “Over 70 Indian languages spoken by over a million people each had zero digital corpus. The problem is so stark.”
In India, over 32,000 crowdsourced workers have completed 40 million paid digital tasks, such as image recognition, contour alignments, video annotation, and speech annotation. The founder of Karya, Manu Chopra, seeks not only to improve data supply but also to combat poverty. Chopra, who experienced poverty in his early years, is dedicated to leveraging technology to address this issue.
Karya also works with over 30,000 educated young women to create “gender-intentional” datasets for the Bill & Melinda Gates Foundation. This extensive effort aims to reduce gender-related biases in large language models, setting a crucial milestone for Indian languages.
Karya’s impact is not limited to India. The company is in talks to expand its platform as a service to organizations in Africa and South America for similar data collection efforts.