Voicebox can generalize to speech-generation tasks (Meta AI)

Meta Launches Voicebox, Pioneers Generative AI Speech

Jim HendlerJune 18, 23 Generative AI, Meta, Voicebox

The field of generative AI for speech has advanced as Meta AI researchers have made progress with Voicebox. This new development is capable of performing speech-generation tasks that it was not specifically trained for, surpassing previous models and showcasing exceptional performance.

Voicebox is a versatile generative speech system that can produce high-quality audio clips in a variety of styles. It can create outputs from scratch or modify existing samples. The model supports speech synthesis in six languages as well as noise removal, content editing, style conversion and versatile sample creation.

Traditionally, speech generative AI models require specific training for each task using carefully prepared training data. However, Voicebox introduces a new approach called Flow Matching that outperforms diffusion models. It outperforms current state-of-the-art models such as VALL-E in English text-to-speech tasks, achieving better word error rate (5.9% vs. 1.9%) and sound similarity (0.580 vs. 0.681), but is also better. up to 20 times faster. In cross-language style transfer, Voicebox outperforms YourTTS by reducing word errors from 10.9% to 5.2% and improving voice similarity from 0.335 to 0.481.

One of the main limitations of existing speech synthesizers is that they rely on monotone. They clean data that is hard to generate and limited in quantity. However, Voicebox overcomes this limitation by utilizing the non-deterministic mapping capabilities of the Flow Matching model. This allows Voicebox to learn from many different speech data without careful tagging. The model was trained on over 50,000 hours of recorded speech and transcriptions of public audiobooks in multiple languages.

A voice box can perform a variety of tasks, including:

1-context text-to-speech synthesis: Voicebox’s versatility enables it to excel in various speech generation tasks. It can perform contextual text-to-speech synthesis by matching the audio style of a given input sample and using it to generate speech from text. This feature has potential applications for helping people who cannot speak, or adapting voices for non-player characters and virtual assistants.

2-Transfer of style between languages: Voicebox shows language skills for transfer. Voicebox can produce text in this language by providing a sample of speech and a piece of text in one of the supported languages, i.e. English, French, German, Spanish, Polish or Portuguese. This feature can facilitate natural and authentic communication between people speaking different languages.

3-speech muting and editing:

Voicebox is also excellent at muting speech and editing tasks. Using in-context learning, the model can generate speech to seamlessly edit segments of audio recordings. It can replace misspoken words or synthesize parts corrupted by short-duration noise without having to re-record the entire speech. This feature simplifies cleaning and editing audio recordings, like popular image editing tools for adjusting photos.

4- Voicebox’s ability to learn from a variety of real-world data enables it to produce speech that better represents natural human communication in the six supported languages. This feature can be used to generate synthetic data for training speech assistant models. Models trained on synthetic speech generated by Voicebox have similar performance to models trained on real speech, with only a 1% error rate degradation compared to the significant degradation of synthetic speech in previous text-to-speech models.

While the researchers acknowledge the exciting use cases of generative speech models, they have chosen not to make the Voicebox model or code publicly available at this time due to potential risks of abuse. The responsible development and use of AI is paramount, and the balance between transparency and accountability is crucial. Instead, the researchers have shared audio samples and a research paper detailing the approach, the results, and the creation of an efficient classifier that differentiates between real speech and Voicebox-generated audio.