Google and Microsoft Invest in Stanford Alum to Utilize AI for a Billion People
Preethi P., residing in a small village called Agara, located three hours southwest of Bangalore, occupies her one-room dwelling on a peaceful street surrounded by rice paddies and groundnut fields. Seated on a stool beside a sewing machine, she typically spends her time repairing or sewing garments, earning less than $1 per day for her efforts. However, today is different as she is engrossed in reading sentences in her native Kannada language into a mobile application. Pausing momentarily, she proceeds to read another sentence.
Preethi, who goes by one name, as is common in the region, is among 70 workers hired by a startup called Karya in Agara and neighboring villages to collect text, audio and image data in Indian vernacular languages. He’s part of a massive, unprecedented global workforce — operating in places like India, Kenya and the Philippines — collecting and tagging data that AI chatbots and virtual assistants rely on to generate relevant responses. However, unlike many other data contractors, Preethi is well paid for her efforts, at least by local standards.
After three days of working with Karya, Preethi earned 4,500 rupees ($54), more than four times what a 22-year-old high school graduate usually earns as a tailor in an entire month. The money, she says, is enough to pay one month’s installment on a loan taken out to partially repair the crumbling mud walls of her home, which are carefully patched with colorful sarees. “All I need is a phone and internet.”
We are now on WhatsApp. Click to join.
Karya was founded in 2021, before the rise of ChatGPT, but this year’s frenzy around generative AI has only fueled tech companies’ insatiable demand for data. India alone is expected to have nearly a million data entry workers by 2030, according to Nasscom, the country’s technology industry trade body. Karya differentiates itself from other data providers by offering its contractors—mostly women and mostly from rural communities—up to 20 times the prevailing minimum wage and promising to produce higher-quality Indian data that tech companies pay more to acquire. .
“Every year, big tech companies spend billions of dollars collecting training data for their AI and machine learning models,” Manu Chopra, the 27-year-old Stanford-educated computer engineer behind the startup, told Bloomberg in an interview. “Bad pay for that kind of work is a failure of the industry.”
If low wages are the failure of the industry, the creation of Silicon Valley is responsible. For years, technology companies have outsourced tasks such as data marking and content moderation to cheaper contractors abroad. But now some of Silicon Valley’s biggest names are turning to Karya to address one of the biggest challenges in its AI products: finding high-quality data to build tools that better serve billions of potential non-English speaking users. These partnerships could represent a powerful shift in the economics of the data industry and Silicon Valley’s relationship with data providers.
Microsoft Corp. has used Karya to acquire local voice data for its AI products. The Bill & Melinda Gates Foundation is working with Karya to reduce gender bias in data fed into large language models, the technology that powers AI chatbots. And Alphabet Inc’s Google relies on Karya and other local partners to collect voice data in 85 Indian districts. Google plans to expand into each district by including the majority spoken language or dialect and build a generative AI model for 125 Indian languages.
Many AI services have been developed disproportionately with English-language Internet data such as articles, books, and social media posts. As a result, these AI models poorly represent the different languages of Internet users in other countries, who are using AI-powered smartphones and apps faster than they are learning English. India alone is home to nearly a billion such potential users as the government calls for the adoption of AI tools in all sectors, from healthcare to education and financial services.
“India is the first non-Western country where we’re doing this, and we’re testing Bard in nine Indian languages,” said Manish Gupta, head of Google Research in India, referring to the company’s AI chatbot. “More than 70 Indian languages, spoken by more than a million people, each had no digital corpus. The problem is so serious.”
Gupta ticked off a list of issues AI companies need to address to serve India’s internet users: Non-English datasets are woefully low-quality; hardly any conversational data exists in Hindi and other Indian languages; Digitized content of Indian language books and newspapers is very limited.
When applied to South Asian languages, some large language models have been found to form words and struggle with basic grammar. There are also concerns that these AI services may reflect a distorted view of other cultures. It’s critical that training data is widely represented, including non-English language data, so that AI systems “don’t perpetuate harmful stereotypes, generate hate speech, or generate misinformation,” said Mehran Sahami, a professor in Stanford University’s Department of Computer Science. .
Karya, a social impact startup headquartered in Bangalore and supported by grants, is able to expand the range of languages represented in part by targeting workers in rural areas who would otherwise not be contracted for such positions. Karya’s app can work without an internet connection and offers voice support for those with limited reading skills. In India, more than 32,000 crowdsourced workers have logged into the app and completed 40 million paid digital tasks such as image recognition, contour alignment, video annotation and voice annotation.
Chopra’s goal is not only to improve the supply of information, but also to fight poverty. The founder of Karya grew up in a poor area called Shakur Basti in West Delhi. He won a scholarship to study at an elite school where he was bullied because his classmates said he “smelled bad”. Chopra landed at Stanford to study computer science, but realized he hated the “how to make a billion dollars” mindset he encountered there.
After graduating in 2017, he began working on his longtime interest: using technology to fight poverty. “It only takes $1,500 in savings for an Indian to enter the middle class,” Chopra said. “But it could take the poor 200 years to reach that level of savings.”
He discovered that Microsoft had paid a hefty sum to collect voice data, albeit of poor quality, to feed its AI systems and research. For example, in 2017, while 1 million hours of digitized voice data were available in Marathi, the language spoken in Mumbai and its western Indian region, only 165 hours were available for purchase. His startup has since collected 10,000 hours of Marathi speech data for Microsoft’s AI services, read by men and women from five different regions.
“Tech companies want data, accent and everything,” Chopra said. “You cough, they want it for speech – it represents natural language.” Microsoft Research India researcher Saikat Guha, who focuses on the ethics of data collection, said that he used Karya’s content in the project as well to help with visuals. injuries in the job search. “The quality of the information is much better than any other source I’ve used,” Guha said. “If you pay employees fairly, they are more invested in their work and the end result is better data.”
Meanwhile, more than 30,000 young, school-educated women are working with Karya to help collect “gender-based” datasets—for example, that the doctor or the boss isn’t always a she—in six Indian languages for the Bill & Melinda Gates Foundation. It is the largest such effort in Indian languages and will serve as a resource for creating datasets to reduce gender bias in LLMs. Karya doesn’t stop at India. The company said it is negotiating to sell its platform as a service to organizations in Africa and South America that do similar work.
For now, the women in Yelandur, another village southwest of Bangalore, are eagerly awaiting Karya’s next project: transcription from a Kannada audio recording. Among them is Shambhavi S., 25, who earned a few thousand rupees from a previous job while working in her quiet home, feeding her in-laws dinner and putting her children to bed.
“I don’t know what artificial intelligence is, I’ve never heard of it,” said Shambhavi. “I want to earn and train my children so they can learn to use it.”
One more thing! ReturnByte is now on WhatsApp channels! Follow us by clicking the link to never miss any updates from the world of technology. Click here to join now!