Meta’s Multimodal Translator Utilizes Single Model to Translate into 100 Languages
While it may not be fully capable of bringing about the highly anticipated Doolittle future, modern AI translation techniques are effectively converting the approximately 6,500 spoken and written communication systems used by humanity. However, the issue lies in the fact that each of these models excels in only one or two tasks, such as translating, converting text to speech, or speech to text. Consequently, multiple models need to be combined to achieve the comprehensive performance observed in platforms like Google Translate or Facebook’s diverse language services.
It’s a computationally intensive process, so Meta developed one model that can do it all. SeamlessM4T is “a basic multilingual and multi-threading model that seamlessly translates and transcribes between speech and text,” Meta’s blog reads on Tuesday. It can translate any of nearly 100 languages for speech-to-text and text-to-text functions, speech-to-speech and text-to-speech support the same languages as inputs and output them in any of 36 other languages, including English.
In a blog post, Meta’s research team notes that SeamlessM4T “significantly improves the performance of the low- and medium-resource languages we support” while maintaining “strong performance in high-resource languages such as English, Spanish, and German.” Meta built SeamlessM4T from its existing PyTorch-based multitasking UnitY model architecture , which already natively performs different modal translations and automatic speech recognition. It uses the BERT 2.0 system for audio encoding, splits the input into component identifiers for analysis, and the HiFi-GAN unit’s vocoder to generate spoken responses.
Meta has also curated a massive open source speech-to-speech and speech-to-text parallel corpus called SeamlessAlign. The company mined “tens of billions of sentences” and “four million hours” of speech from publicly available archives to “automatically align more than 443,000 hours of speech with texts and create approximately 29,000 hours of speech-to-speech alignment.” blog. When tested for durability, the SeamlessM4T outperformed its (currently state-of-the-art) predecessor against background noise by 37 percent and speaker style variation by 48 percent.
Like most of its previous machine translation projects—be it Llama 2, Massively Multilingual Speech (MMS), Universal Speech Translator (UST), or the ambitious No Language Left Behind (NLLB) project—SeamlessM4T is open source. “We believe SeamlessM4T is an important breakthrough in the AI community’s efforts to create universal multitasking systems,” the team wrote. “In keeping with our open science approach, we are excited to share our model publicly so that researchers and developers can develop this technology.” If you’re interested in working with SeamlessM4T yourself, head over to GitHub to download the template, training information, and documentation.