M2M100 sucks at Irish
So, do massively multilingual MT models trained on massively crawled datasets lead to great output? No
Huggingface Transformers added the M2M 100 model, I tried it out and tweeted screenshots of the appalling output, so I thought I'd recreate the translations to show they were very real.
!pip install sentencepiece transformers
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")
def translate(text, src_lang="pl", trg_lang="ga"):
tokenizer.src_lang = src_lang
encoded = tokenizer(text, return_tensors="pt")
generated_tokens = model.generate(**encoded, forced_bos_token_id=tokenizer.get_lang_id(trg_lang))
print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True))
So, do massively multilingual MT models trained on massively crawled datasets lead to great output?
— Jim O'Regan (@jimregan) April 12, 2021
No pic.twitter.com/SckNGTq09B
“One must love one's wife”
translate("Trzeba kochać swoją żonę")
— Jim O'Regan (@jimregan) April 12, 2021
“What are you on about?” or “What are you getting at?”
translate("O co Ci chodzi?")
It's almost poetic pic.twitter.com/IbJi1zvlrX
— Jim O'Regan (@jimregan) April 12, 2021
Let's try English:
translate("Hello, how are you?", src_lang='en')
— Jim O'Regan (@jimregan) April 12, 2021
How poetic. How about some actual poetry? (Pan Tadeusz)
translate("Litwo, Ojczyzno moja! ty jesteś jak zdrowie; Ile cię trzeba cenić, ten tylko się dowie, Kto cię stracił.")
Switching to English output, it at least gives a decent-looking sentence. (It only looks decent, it's wrong) pic.twitter.com/4HyBQvTAux
— Jim O'Regan (@jimregan) April 12, 2021
“It seems to me that you are not sober”
translate("Mi się wydaje, że nie jesteś trzeźwy", trg_lang='en')