Meta’s Latest Dataset to Train Speech Recognition Engines on ‘Groups’ of Speakers
In 2023, Siri still hasn’t managed to understand certain commands, despite the significant progress made in generative AI systems. The synthetic assistants on our smartphones are still as ineffective at hearing as they were in 2011. Fortunately, Meta AI has created a new dataset that could enhance the capabilities of automatic speech recognition (ASR) tools by grouping speech at the “utterance level.”
Meta has long sought to improve the performance of ASRs by teaching them to train without transcripts, recognize more than 4,000 spoken languages, and even read lips with higher skill than human experts. However, many of the datasets used to train ASR models are organized by demographic—age group, gender, nationality, English accent—which limits the variation in pronunciation with which the models are trained, ultimately preventing their performance from understanding a broad cross-section of users. .
To get around this, Meta AI has developed a data set that is instead based on an expression clustering method. “Instead of partitioning the dataset based on speaker demographics… our proposed algorithm clusters speech at the utterance level,” the Meta AI team explained in a Wednesday blog post. “One cluster contains similar utterances from a different group of speakers. We can then train our model using different clusters and use fairness datasets to measure how the model affects outcomes in different populations.”
Meta’s resulting dataset contains just over 27,000 command phrases collected from 595 paid US volunteers. Their talks revolved around seven main themes—music, capture, utilities, notification management, messaging, calls, and dictation—which other researchers can then use to train their own models and digital assistants. Prompts included asking speakers how they would do a voice search for a song or make plans with friends and deciding on a meeting place.
To evaluate this new system, Meta first trained a model on publicly available English Facebook videos. The researchers then evaluated the model using two other datasets: Casual Conversations v1, published by Meta in 2021, and an “anonymized dataset collected from ASR’s data provider,” which contains 48,000 conversations from 867 individuals.
Initial results proved promising, with model performance improving “for all demographics across our evaluation datasets, although by far the biggest gains are for adding accents,” according to the blog. Overall, ASR performance improved by 10 percent with the clustering method, and big gains also came from the 66-85 demographic, a traditionally underrepresented demographic in the voice command space.
“The algorithm we propose is part of Meta’s long-term focus on responsible artificial intelligence and just one part of our holistic approach to solving fairness problems,” the researchers wrote. In the future, the team will study adapting the system to other languages.