Facebook has unveiled software based on machine learning, translating from any language without relying on English. According to a Facebook blog post, M2M-100 is the first multilingual machine translation (MMT) model that can translate between any pair of 100 languages without relying on English data. Stating that breaking language barriers through machine translation (MT) is one of the most important ways to bring people together and provide information on COVID-19, Facebook said that the single multilingual model performs equally as well as traditional bilingual models and manages to get 10 BLEU point improvement over English-centric multilingual models.
The blog used novel mining strategies to create translation data and built the first truly ‘may-to-many’ data set with 7.5 billion sentences for 100 languages.
As per the post, Facebook used several scaling techniques to build a universal model with 15 billion parameters. This captures information from associated languages and shows a more varied script of languages and morphology.
The post revealed that one of the biggest issues in creating a many-to-many MMT model is bringing together massive volumes of quality sentence pairs for arbitrary translation directions not involving English. However, they took on the challenge and made it possible by combining complementary data mining resources with years in the making, including ccAligned, ccMatrix, and LASER.
A new LASER 2.0 and improved fastest language identification have been created that improve mining quality and include open-sourced training and evaluation scripts.
According to Facebook, deploying M2M-100 will improve the quality of translations for billions of people, especially those that speak low-resource languages.