Meta’s Voicebox AI Offers Text-to-Speech Conversion with Dall-E Technology
Meta has brought us closer to the promised immortal celebrity future with the unveiling of Voicebox, a generative text-to-speech model that aims to revolutionize spoken word generation, similar to how ChatGPT and Dall-E transformed text and image generation.
It’s basically a text-to-print generator like GPT or Dall-E – instead of generating prose or pretty pictures, it spits out audio clips. Meta defines the system as “a non-autoregressive flow matching model trained to complete speech based on audio context and text”. It has been trained on over 50,000 hours of unfiltered audio. In particular, Meta used recorded speech and transcriptions from a range of freely available audiobooks written in English, French, Spanish, German, Polish and Portuguese.
According to the researchers, this diverse set of data allows the system to generate more conversational-sounding speech, regardless of the languages spoken by each party. “Our results show that speech recognition models trained on synthetic speech created by Voicebox perform almost as well as models trained on real speech.” Furthermore, the computer-generated speech performed with only a 1 percent error rate drop, compared to a 45-70 percent drop for existing TTS models.
The system was first taught to predict speech segments based on the segments around them as well as the transcription of the passage. “After learning to complete speech from context, the model can then apply this to speech generation tasks, including creating segments in the middle of an audio recording without having to recreate the entire input,” the Meta researchers explained.
Voicebox is also said to be able to actively edit audio clips, remove noise from speech and even replace incorrectly spoken words. “A human can identify what raw part of speech is corrupted by noise (such as a dog barking), crop it, and instruct the model to recreate it,” the researchers said, much like using image processing software to clean up photographs. .
Text-to-speech generators have been in use for a minute – that’s how your parents’ TomToms were able to give sly driving directions in Morgan Freeman’s voice. Modern iterations like Speechify or Elevenlab’s Prime Voice AI are much more powerful, but still largely require mountains of source material to properly mimic a subject—and then another mountain of different data for each. single. other. the subject you want to train.
Voicebox no, thanks to the new zero-image text-to-speech training method Meta calls Flow Matching. The benchmark results aren’t even close, as Meta’s AI outperformed current technology in terms of both intelligibility (1.9 percent word error rate vs. 5.9 percent) and “sound similarity” (composite score of 0.681 and SOA of 0.580). all while running up to 20 times faster than today’s best TTS systems.
But don’t get your celebrity navigators ready just yet, neither the Voicebox app nor its source code is currently being released publicly, Meta confirmed on Friday, citing “potential risks of abuse” despite “many exciting uses.” cases for generative speech models.” Instead, the company released a series of audio examples (see above/below) as well as the program’s original research paper. In the future, the research team hopes the technology will find its way into prosthetics for patients with vocal cord injuries, in-game NPCs, and digital assistants.