How this grassroots effort could make AI voices more diverse

15 Nov 2024, 10:24 by Melissa Heikkilä · MIT Technology Review

We are on the cusp of a voice AI boom, with tech companies such as Apple and OpenAI rolling out the next generation of artificial-intelligence-powered assistants. But the default voices for these assistants are often white American—British, if you’re lucky—and most definitely speak English. They represent only a tiny proportion of the many dialects and accents in the English language, which spans many regions and cultures. And if you’re one of the billions of people who don’t speak English, bad luck: These tools don’t sound nearly as good in other languages.

This is because the data that has gone into training these models is limited. In AI research, most data used to train models is extracted from the English-language internet, which reflects Anglo-American culture. But there is a massive grassroots effort underway to change this status quo and bring more transparency and diversity to what AI sounds like: Mozilla’s Common Voice initiative.

The data set Common Voice has created over the past seven years is one of the most useful resources for people wanting to build voice AI. It has seen a massive spike in downloads, partly thanks to the current AI boom; it recently hit the 5 million mark, up from 38,500 in 2020. Creating this data set has not been easy, mainly because the data collection relies on an army of volunteers. Their numbers have also jumped, from just under 500,000 in 2020 to over 900,000 in 2024. But by giving its data away, some members of this community argue, Mozilla is encouraging volunteers to effectively do free labor for Big Tech.

Since 2017, volunteers for the Common Voice project have collected a total of 31,000 hours of voice data in around 180 languages as diverse as Russian, Catalan, and Marathi. If you’ve used a service that uses audio AI, it’s likely been trained at least partly on Common Voice.

Mozilla’s cause is a noble one. As AI is integrated increasingly into our lives and the ways we communicate, it becomes more important that the tools we interact with sound like us. The technology could break down communication barriers and help convey information in a compelling way to, for example, people who can’t read. But instead, an intense focus on English risks entrenching a new colonial world order and wiping out languages entirely.

“It would be such an own goal if, rather than finally creating truly multimodal, multilingual, high-performance translation models and making a more multilingual world, we actually ended up forcing everybody to operate in, like, English or French,” says EM Lewis-Jong, the product lead for Common Voice.

Common Voice is open source, which means anyone can see what has gone into the data set, and users can do whatever they want with it for free. This kind of transparency is unusual in AI data governance. Most large audio data sets simply aren’t publicly available, and many consist of data that has been scraped from sites like YouTube, according to research conducted by a team from the University of Washington, and Carnegie Mellon andNorthwestern universities.

The vast majority of language data is collected by volunteers such as Bülent Özden, a researcher from Turkey. Since 2020, he has been not only donating his voice but also raising awareness around the project to get more people to donate. He recently spent two full-time months correcting data and checking for typos in Turkish. For him, improving AI models is not the only motivation to do this work.

“I’m doing it to preserve cultures, especially low-resource [languages],” Özden says. He tells me he has recently started collecting samples of Turkey’s smaller languages, such as Circassian and Zaza.

However, as I dug into the data set, I noticed that the coverage of languages and accents is very uneven. There are only 22 hours of Finnish voices from 231 people. In comparison, the data set contains 3,554 hours of English from 94,665 speakers. Some languages, such as Korean and Punjabi, are even less well represented. Even though they have tens of millions of speakers, they account for only a couple of hours of recorded data.

This imbalance has emerged because data collection efforts are started from the bottom up by language communities themselves, says Lewis-Jong.

“We’re trying to give communities what they need to create their own AI training data sets. We have a particular focus on doing this for language communities where there isn’t any data, or where maybe larger tech organizations might not be that interested in creating those data sets,” she says. She hopes that with the help of volunteers and various bits of grant funding, the Common Voice data set will have close to 200 languages by the end of the year.

Common Voice’s permissive license means that many companies rely on it—for example, the Swedish startup Mabel AI, which builds translation tools for health-care providers. One of the first languages the company used was Ukrainian; it built a translation tool to help Ukrainian refugees interact with Swedish social services, says Karolina Sjöberg, Mabel AI’s founder and CEO. The team has since expanded to other languages, such as Arabic and Russian.

The problem with a lot of other audio data is that it consists of people reading from books or texts. The result is very different from how people really speak, especially when they are distressed or in pain, Sjöberg says. Because anyone can submit sentences to Common Voice for others to read aloud, Mozilla’s data set also includes sentences that are more colloquial and feel more natural, she says.

Not that it is perfectly representative. The Mabel AI team soon found out that most voice data in the languages it needed was donated by younger men, which is fairly typical for the data set.

“The refugees that we intended to use the app with were really anything but younger men,” Sjöberg says. “So that meant that the voice data that we needed did not quite match the voice data that we had.” The team started collecting its own voice data from Ukrainian women, as well as from elderly people.

Unlike other data sets, Common Voice asks participants to share their gender and details about their accent. Making sure different genders are represented is important to fight bias in AI models, says Rebecca Ryakitimbo, a Common Voice fellow who created the project's gender action plan. More diversity leads not only to better representation but also to better models. Systems that are trained on narrow and homogenous data tend to spew stereotyped and harmful results.

“We don’t want a case where we have a chatbot that is named after a woman but does not give the same response to a woman as it would a man,” she says.

Ryakitimbo has collected voice data in Kiswahili in Tanzania, Kenya, and the Democratic Republic of Congo. She tells me she wanted to collect voices from a socioeconomically diverse set of Kiswahili speakers and has reached out to women young and old living in rural areas, who might not always be literate or even have access to devices.

This kind of data collection is challenging. The importance of collecting AI voice data can feel abstract to many people, especially if they aren’t familiar with the technologies. Ryakitimbo and volunteers would approach women in settings where they felt safe to begin with, such as presentations on menstrual hygiene, and explain how the technology could, for example, help disseminate information about menstruation. For women who did not know how to read, the team read out sentences that they would repeat for the recording.

The Common Voice project is bolstered by the belief that languages form a really important part of identity. “We think it’s not just about language, but about transmitting culture and heritage and treasuring people’s particular cultural context,” says Lewis-Jong. “There are all kinds of idioms and cultural catchphrases that just don’t translate,” she adds.

Common Voice is the only audio data set where English doesn’t dominate, says Willie Agnew, a researcher at Carnegie Mellon University who has studied audio data sets. “I’m very impressed with how well they've done that and how well they've made this data set that is actually pretty diverse,” Agnew says. “It feels like they’re way far ahead of almost all the other projects we looked at.”

I spent some time verifying the recordings of other Finnish speakers on the Common Voice platform. As their voices echoed in my study, I felt surprisingly touched. We had all gathered around the same cause: making AI data more inclusive, and making sure our culture and language was properly represented in the next generation of AI tools.

But I had some big questions about what would happen to my voice if I donated it. Once it was in the data set, I would have no control about how it might be used afterwards. The tech sector isn’t exactly known for giving people proper credit, and the data is available for anyone’s use.

“As much as we want it to benefit the local communities, there’s a possibility that also Big Tech could make use of the same data and build something that then comes out as the commercial product,” says Ryakitimbo. Though Mozilla does not share who has downloaded Common Voice, Lewis-Jong tells me Meta and Nvidia have said that they have used it.

Open access to this hard-won and rare language data is not something all minority groups want, says Harry H. Jiang, a researcher at Carnegie Mellon University, who was part of the team doing audit research. For example, Indigenous groups have raised concerns.

“Extractivism” is something that Mozilla has been thinking about a lot over the past 18 months, says Lewis-Jong. Later this year the company will pilot the Nwulite Obodo Open Data License, which was created by researchers at the University of Pretoria for sharing African data sets more equitably. For example, people who want to download the data might be asked to write a request with details on how they plan to use it, and they might be allowed to license it only for certain products or for a limited time. Users might also be asked to contribute to community projects that support poverty reduction, says Lewis-Jong.

She says the pilot is a learning exercise to explore whether people will want data with alternative licenses, and whether they are sustainable for communities managing them. The hope is that it could lead to something resembling “open source 2.0.”

In the end, I decided to donate my voice. I received a list of phrases to say, sat in front of my computer, and hit Record. One day, I hope, my effort will help a company or researcher build voice AI that sounds less generic, and more like me.