G2P with MFA
Grapheme-to-phoneme conversion with Montreal Forced Aligner (on Kaggle)
Original on Kaggle
%%capture
import os
os.chdir('/tmp')
!wget https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.0.1/montreal-forced-aligner_linux.tar.gz
!tar zxvf montreal-forced-aligner_linux.tar.gz
!ln -s /tmp/montreal-forced-aligner/lib/libpython3.6m.so.1.0 /tmp/montreal-forced-aligner/lib/libpython3.6m.so
os.chdir('/kaggle/working')
os.environ['LD_LIBRARY_PATH'] = f'{os.environ["LD_LIBRARY_PATH"]}:/tmp/montreal-forced-aligner/lib/'
os.environ['PATH'] = f'{os.environ["PATH"]}:/tmp/montreal-forced-aligner/bin/'
%%capture
!apt-get -y install libgfortran3
!mkdir /tmp/example
The example below is from section 488 (p. 239) of Gaeilge Chorca Dhuibhne by Diarmuid Ó Sé.
The provided transcription is: əs kiːn′ l′əm nə ˈheːn′ɪ v′eh ə bɪn′t′ vuːn er′
%%writefile /tmp/example/test1.lab
is cuimhin liom na haoinne a bheith ag baint mhóna air
Ibid, Section 488, p. 238
xuːərˈdiːs ˈgax ɑːt′
%%writefile /tmp/example/test2.lab
chuardaíos gach áit
MFA insists on having .wav
files, which it reads, even though it makes no use of them for G2P
%%capture
!apt-get -y install sox
!sox -n -r 16000 -b 16 -c 1 -L /tmp/example/test1.wav trim 0.0 6.000
!sox -n -r 16000 -b 16 -c 1 -L /tmp/example/test2.wav trim 0.0 6.000
!mfa_generate_dictionary ../input/train-irish-mfa-model-fuaimeanna/g2p-munster.zip /tmp/example/ output
!cat output
Word | Pronunciation | Alt. Transcript | Generated | Correct? | In context? | Rule/Reason |
---|---|---|---|---|---|---|
is | əs | əsˠ (~ ɪʃ) | ɪ ʃ | ✔️ | ❌ | Exception: ios but correct before a slender consonant |
cuimhin | kiːn′ | kiːnʲ | k ɪ vˠ nʲ | ❌ | ❌ | Missing grapheme: uimhi |
liom | l′əm | lʲəmˠ (~ lʲʌmˠ) | lʲ ʌ mˠ | ✔️ | ✔️ | (See, e.g., section 291: l′um) |
na | nə | n̪ˠə | n̪ˠ ə | ✔️ | ✔️ | |
haoinne | ˈheːn′ɪ | heːnʲɪ | ɪ nʲ ɛ | ❌ | ❌ | |
a | ə | ə | ✔️ | ✔️ | ||
bheith | v′eh | vʲɛ(h) | vʲ ɛ | ✔️ | ❌ | Section 9: h → ∅ / _ # -V |
ag | ə | ə (~ ɪɟ) | a ɡ | ❌ | ❌ | ɪg′, section 60 |
baint | bɪn′t′ | bˠɪnʲtʲ | bˠ ɪ nʲ tʲ | ✔️ | ✔️ | |
mhóna | vuːn | vˠuːn̪ˠ(ə) | vˠ oː n̪ˠ ə | ✔️ | ❌ | ó → oː ~ uː / _ [+nasal], ə → ∅ / _ # |
air | er′ | eɾʲ | a ɾʲ | ❌ | ❌ | Exception: eir |
chuardaíos | xuːərˈdiːs | xuəɾˠd̪ˠiːsˠ | x uə ɾˠ d̪ˠ iː ʌ sˠ | ❌ | ❌ | Missing grapheme aío |
gach | ˈgax | ɡax (~ ɡəx) | ɡ ə x | ✔️ | ✔️ | See section 810 |
áit | ɑːt′ | ɑːtʲ | ɑː tʲ | ✔️ | ✔️ |