Meta published, in early November 2022, SpeechMatrix, a large-scale multilingual corpus of speech-to-speech translations. The goal, according to Meta, is to facilitate the development of speech-to-speech translation systems (S2ST).
SpeechMatrix was extracted from real speech; that is to say the recordings of the European Parliament. It contains speech alignments in 136 languages, at an average of 1,537 hours of source speech in each direction, for a total of over 418,000 hours of speech.
“To our knowledge, SpeechMatrix is by far the largest speech-to-speech translation corpus freely available,” the Meta researchers wrote in their paper.
Data scarcity
As mentioned in the Slator report on interpreting services and technology, large technology companies and universities are driving rapid progress in the field of speech-to-speech translation.
Speech-to-speech translation models can be indirect – via text translation and machine translation – or. live, by creating machine learning models based on audio recordings of speech in the source and target languages.
Direct models are attracting more and more interest in research and have many advantages. For example, they apply to the translation of languages without a well-defined writing script because direct models do not rely on any intermediate text. However, model training faces the major problem of data scarcity.
As the researchers explained, “Human-tagged speech data is expensive to create, there are very few data resources providing parallel speech, and the amount of data is quite limited.”
Quality of extracted data and multilingual S2ST
To assess the quality of the extracted data, Meta researchers trained bilingual speech-to-speech translation models on the SpeechMatrix data and reported the translation performance.
Thanks to the multilingualism of SpeechMatrix, they also explored multilingual speech-to-speech translation.
According to the same article, “There are very few studies on multilingual speech-to-speech translation, partly due to the lack of multilingual speech-to-speech resources. Thanks to the massively multilingual data that we have extracted, we are able to explore multilingual S2ST training.
As we look to the future of translation, we look forward to seeing other researchers use the techniques we have developed with Hokkien to create their own speech-to-speech translation systems for other written languages and unwritten.
— Meta AI (@MetaAI) October 19, 2022
The researchers found that strong S2ST models can be trained with extracted data and validated the good quality of speech alignments in all languages.
Furthermore, they demonstrated that pre-training the model, sparse scaling using Mixture-of-Experts – an ensemble machine learning technique where the number of model parameters increases in amplitude without sacrificing computational efficiency – and multilingualism can “bring significant gains to translation performance.
The researchers hope this work can help others develop textless speech-to-speech translation systems for other written and unwritten languages.
Everything about SpeechMatrix is open source and accessible for download via the GitHub repository.