Meta releases large dataset for multilingual speech-to-speech translation


Meta published, in early November 2022, SpeechMatrix, a large-scale multilingual corpus of speech-to-speech translations. The goal, according to Meta, is to facilitate the development of speech-to-speech translation systems (S2ST).

SpeechMatrix was extracted from real speech; that is to say the recordings of the European Parliament. It contains speech alignments in 136 languages, at an average of 1,537 hours of source speech in each direction, for a total of over 418,000 hours of speech.

“To our knowledge, SpeechMatrix is ​​by far the largest speech-to-speech translation corpus freely available,” the Meta researchers wrote in their paper.

Data scarcity

As mentioned in the Slator report on interpreting services and technology, large technology companies and universities are driving rapid progress in the field of speech-to-speech translation.

Speech-to-speech translation models can be indirect – via text translation and machine translation – or. live, by creating machine learning models based on audio recordings of speech in the source and target languages.

Slator Interpretation Services and Technology Report

60 page report on the growing interpreting industry, with analysis by mode, environment, geolocation, buyers, business use case, ROI, OPI, VRI. Incl. market size estimate.

Direct models are attracting more and more interest in research and have many advantages. For example, they apply to the translation of languages ​​without a well-defined writing script because direct models do not rely on any intermediate text. However, model training faces the major problem of data scarcity.

As the researchers explained, “Human-tagged speech data is expensive to create, there are very few data resources providing parallel speech, and the amount of data is quite limited.”

Quality of extracted data and multilingual S2ST

To assess the quality of the extracted data, Meta researchers trained bilingual speech-to-speech translation models on the SpeechMatrix data and reported the translation performance.

Thanks to the multilingualism of SpeechMatrix, they also explored multilingual speech-to-speech translation.

According to the same article, “There are very few studies on multilingual speech-to-speech translation, partly due to the lack of multilingual speech-to-speech resources. Thanks to the massively multilingual data that we have extracted, we are able to explore multilingual S2ST training.

The researchers found that strong S2ST models can be trained with extracted data and validated the good quality of speech alignments in all languages.

Furthermore, they demonstrated that pre-training the model, sparse scaling using Mixture-of-Experts – an ensemble machine learning technique where the number of model parameters increases in amplitude without sacrificing computational efficiency – and multilingualism can “bring significant gains to translation performance.

The researchers hope this work can help others develop textless speech-to-speech translation systems for other written and unwritten languages.

Everything about SpeechMatrix is ​​open source and accessible for download via the GitHub repository.


Comments are closed.