wav2vec-u Common Voice Swedish - prepare ltr/phn/wrd
ltr/phn/wrd preparation for wav2vec-u on Common Voice Swedish
Original here
In the section Preparation of speech and text data of the readme, it says:
Similar to wav2vec 2.0, data folders contain {train,valid,test}.{tsv,wrd,phn} files, where audio paths are stored in tsv files, and word, letter or phoneme transcriptions are stored in .{wrd,ltr,phn}. The
.wrd
and.ltr
files are outputs oflibri_labels.py
%%capture
!pip install phonemizer
%%capture
!apt-get -y install espeak
%%capture
!apt-get -y install zsh
This is just my best guess at what the .wrd
files contain - it seems to match up with what libri_labels.py
does: given input like
1272-128104-0000 MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL
it does " ".join(items[1:])
, which is basically the same
!cat /kaggle/input/download-common-voice-swedish/cv-corpus-6.1-2020-12-11/sv-SE/test.tsv | awk -F'\t' '{print $3}'|grep -v '^sentence$' | perl -C7 -ane 'chomp;$_=lc($_);s/[^\p{L}\p{N}\p{M}'"\'"' \-]/ /g;s/ +/ /g;s/ $//;s/^ //;print "$_\n";' > test.wrd
!cat /kaggle/input/download-common-voice-swedish/cv-corpus-6.1-2020-12-11/sv-SE/dev.tsv | awk -F'\t' '{print $3}'|grep -v '^sentence$' | perl -C7 -ane 'chomp;$_=lc($_);s/[^\p{L}\p{N}\p{M}'"\'"' \-]/ /g;s/ +/ /g;s/ $//;s/^ //;print "$_\n";' > valid.wrd
!cat /kaggle/input/download-common-voice-swedish/cv-corpus-6.1-2020-12-11/sv-SE/train.tsv | awk -F'\t' '{print $3}'|grep -v '^sentence$' | perl -C7 -ane 'chomp;$_=lc($_);s/[^\p{L}\p{N}\p{M}'"\'"' \-]/ /g;s/ +/ /g;s/ $//;s/^ //;print "$_\n";' > train.wrd
for i in ['train', 'test', 'valid']:
with open(f'/kaggle/working/{i}.wrd', 'r') as inf, open(f'/kaggle/working/{i}.ltr', 'w') as out:
for line in inf.readlines():
print(" ".join(list(line.strip().replace(" ", "|"))) + " |", file=out)
!head train.ltr
There are some warnings about switching, so echo the filename first to known where the errors are
!for i in train test valid; do echo $i.wrd; cat $i.wrd | PHONEMIZER_ESPEAK_PATH=$(which espeak) phonemize -o $i.phn -p ' ' -w '' -l sv -j 70 --language-switch remove-flags ;done
!cat test.wrd|awk 'BEGIN{ln=1}{if(ln==81){print $0};ln++}'
!cat train.wrd|awk 'BEGIN{ln=1}{if(ln==254||ln==1457){print $0};ln++}'
!cat valid.wrd|awk 'BEGIN{ln=1}{if(ln==1831){print $0};ln++}'
!cat test.phn|awk 'BEGIN{ln=1}{if(ln==81){print $0};ln++}'
!cat train.phn|awk 'BEGIN{ln=1}{if(ln==254||ln==1457){print $0};ln++}'
!cat valid.phn|awk 'BEGIN{ln=1}{if(ln==1831){print $0};ln++}'
!echo taskigt|espeak -v sv --ipa 2> /dev/null
!cat test.phn|sed -e 's/^ //;s/t a s k ɪ ɡ t/t a s k ɪ t/' > tmp
!mv tmp test.phn
!cat train.phn|sed -e 's/^ //;s/d ɪ z aɪ n/d ɛ s a j n/;s/ɪ n t ə n ɛ t/ɪ n t ɛ r n ɛ t/' > tmp
!mv tmp train.phn
!cat valid.phn|sed -e 's/^ //;s/ɪ n t ə n ɛ t/ɪ n t ɛ r n ɛ t/' > tmp
!mv tmp valid.phn
!for i in train test valid; do cat $i.wrd|tr ' ' '\n'|sort|uniq |grep -v '^internet$'|grep -v '^design$'|grep -v '^taskigt$' > /tmp/$i.wl; cat /tmp/$i.wl | PHONEMIZER_ESPEAK_PATH=$(which espeak) phonemize -o /tmp/$i.wl.phn -p ' ' -w '' -l sv -j 70 --language-switch remove-flags;paste /tmp/$i.wl /tmp/$i.wl.phn > dict.$i; done
!printf "taskigt\tt a s k ɪ t\n" >> dict.test
!printf "design\td ɛ s a j n\n" >> dict.train
!printf "internet\tɪ n t ɛ r n ɛ t\n" >> dict.train
!printf "internet\tɪ n t ɛ r n ɛ t\n" >> dict.valid
!for i in dic*;do cat $i |sort > tmp;mv tmp $i;done