Meta’s new AI model can translate speech from more than 100 languages
by Scott J Mulligan · MIT Technology ReviewMeta has released a new AI model that can translate speech from 101 different languages. It represents a step toward real-time, simultaneous interpretation, where words are translated as soon as they come out of someone’s mouth.
Typically, translation models for speech use a multistep approach. First they translate speech into text. Then they translate that text into text in another language. Finally, that translated text is turned into speech in the new language. This method can be inefficient, and at each step, errors and mistranslations can creep in. But Meta’s new model, called SeamlessM4T, enables more direct translation from speech in one language to speech in another. The model is described in a paper published today in Nature.
Seamless can translate text with 23% more accuracy than the top existing models. And although another model, Google’s AudioPaLM, can technically translate more languages—113 of them, versus 101 for Seamless—it can translate them only into English. SeamlessM4T can translate into 36 other languages.
The key is a process called parallel data mining, which finds instances when the sound in a video or audio matches a subtitle in another language from crawled web data. The model learned to associate those sounds in one language with the matching pieces of text in another. This opened up a whole new trove of examples of translations for their model.
“Meta has done a great job having a breadth of different things they support, like text-to-speech, speech-to-text, even automatic speech recognition,” says Chetan Jaiswal, a professor of computer science at Quinnipiac University, who was not involved in the research. “The mere number of languages they are supporting is a tremendous achievement.”
Human translators are still a vital part of the translation process, the researchers say in the paper, because they can grapple with diverse cultural contexts and make sure the same meaning is conveyed from one language into another. This step is important, says Lynne Bowker of the University of Ottawa’s School of Translation & Interpretation, who didn’t work on Seamless. “Languages are a reflection of cultures, and cultures have their own ways of knowing things,” she says.
When it comes to applications like medicine or law, machine translations need to be thoroughly checked by a human, she says. If not, misunderstandings can result. For example, when Google Translate was used to translate public health information about the covid-19 vaccine from the Virginia Department of Health in January 2021, it translated “not mandatory” in English into “not necessary” in Spanish, changing the whole meaning of the message.
AI models have much more examples to train on in some languages than others. This means current speech-to-speech models may be able to translate a language like Greek into English, where there may be many examples, but cannot translate from Swahili to Greek. The team behind Seamless aimed to solve this problem by pre-training the model on millions of hours of spoken audio in different languages. This pre-training allowed it to recognize general patterns in language, making it easier to process less widely spoken languages because it already had some baseline for what spoken language is supposed to sound like.
The system is open-source, which the researchers hope will encourage others to build upon its current capabilities. But some are skeptical of how useful it may be compared with available alternatives. “Google’s translation model is not as open-source as Seamless, but it’s way more responsive and fast, and it doesn’t cost anything as an academic,” says Jaiswal.
The most exciting thing about Meta’s system is that it points to the possibility of instant interpretation across languages in the not-too-distant future—like the Babel fish in Douglas Adams’ cult novel The Hitchhiker's Guide to the Galaxy. SeamlessM4T is faster than existing models but still not instant. That said, Meta claims to have a newer version of Seamless that’s as fast as human interpreters.
“While having this kind of delayed translation is okay and useful, I think simultaneous translation will be even more useful,” says Kenny Zhu, director of the Arlington Computational Linguistics Lab at the University of Texas at Arlington, who is not affiliated with the new research.