Original here

%cd /opt
/opt
%%capture
!tar xvf /kaggle/input/extract-prebuilt-kaldi-from-docker/kaldi.tar
%cd /tmp
/tmp
!git clone https://github.com/pytorch/fairseq/
%%capture
!pip install phonemizer
%%capture
!pip install git+https://github.com/pytorch/fairseq/
%%capture
!apt-get -y install espeak
!git clone https://github.com/kpu/kenlm
%%capture
!apt-get -y install libeigen3-dev liblzma-dev zlib1g-dev libbz2-dev
%%capture
%cd kenlm
!mkdir build
%cd build
!cmake ..
!make -j 4
%cd /tmp
import os
os.environ['PATH'] = f"{os.environ['PATH']}:/tmp/kenlm/build/bin/"
os.environ['FAIRSEQ_ROOT'] = '/tmp/fairseq'
!cat /kaggle/input/wav2vec-u-cv-swedish-audio/*.wrd | grep -v '^$' | sort| uniq > /kaggle/working/sentences.txt
%cd fairseq/examples/wav2vec/unsupervised
/tmp/fairseq/examples/wav2vec/unsupervised
%%capture
!apt-get -y install zsh
!mkdir /kaggle/working/preppedtext
%cd scripts
/tmp/fairseq/examples/wav2vec/unsupervised/scripts

The next part requires a FastText language id model; I don't know where the 187 language model comes from, but there is a model for 176 languages here

!wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
!cat normalize_and_filter_text.py|sed -e 's/187/176/' > tmp
!mv tmp normalize_and_filter_text.py
os.environ['HYDRA_FULL_ERROR'] = '1'
import os
os.environ['LD_LIBRARY_PATH'] = '/opt/conda/lib:/opt/kaldi/tools/openfst-1.6.7/lib:/opt/kaldi/src/lib'

There are two lines with missing variables in prepare_text.sh - pull request - so replace the file.

While I'm replacing the file: most of the first part of the script is unneeded, as I already have a phonetic dictionary, so I'm using that instead.

With the calls of the preprocess.py script, make sure to check the threshold: there's a divide by zero if the threshold is set too high.

Config options for kaldi_initializer.py

  • in_labels: a naming component, for the Kaldi lexicons/fsts (required)
  • wav2letter_lexicon: path to wav2letter lexicon
  • out_labels: a naming component, for the Kaldi lexicons/fsts: set to in_label if missing
  • kaldi_root: path to Kaldi: /opt/kaldi for my kaggle image
  • fst_dir: path where generated fsts will be saved
  • data_dir: path to phones data
  • lm_arpa: path to the lm in ARPA format
  • blank_symbol: CTC blank symbol (<s> here)
  • silence_symbol: Kaldi symbol for silence (<SIL> is set for two of the scripts)

A config file needs to exist for this, even though the options set in it seem to be ignored.

!mkdir /tmp/fairseq/examples/speech_recognition/kaldi/config/
%%writefile /tmp/fairseq/examples/speech_recognition/kaldi/config/config.yaml
kaldi_root: "/opt/kaldi"
Writing /tmp/fairseq/examples/speech_recognition/kaldi/config/config.yaml
%%writefile prepare_text.sh
#!/usr/bin/env zsh
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

lg=$1
text_path=$2
target_dir=$3

#ph_lg=${lg:l}
#if test "$lg" = 'fr'; then
#  ph_lg='fr-fr'
#elif test "$lg" = 'en'; then
#  ph_lg='en-us'
#elif test "$lg" = 'pt'; then
#  ph_lg='pt-br'
#fi
ph_lg="sv"

echo $lg
echo $ph_lg
echo $text_path
echo $target_dir

mkdir -p $target_dir
#python normalize_and_filter_text.py --lang $lg < $text_path | grep -v '\-\-\-' >! $target_dir/lm.upper.lid.txt
#python $FAIRSEQ_ROOT/fairseq_cli/preprocess.py --dataset-impl mmap --trainpref $target_dir/lm.upper.lid.txt --only-source --destdir $target_dir --thresholdsrc 2 --padding-factor 1 --dict-only
#cut -f1 -d' ' $target_dir/dict.txt | grep -v -x '[[:punct:]]*' | grep -Pv '\d\d\d\d\d+' >! $target_dir/words.txt
cp /kaggle/input/wav2vec-u-cv-swedish-audio/train.wrd $target_dir/lm.upper.lid.txt
cut -f1 -d' ' /kaggle/input/wav2vec-u-cv-swedish-audio/dict.train >! $target_dir/words.txt

#one=$(echo "1" | PHONEMIZER_ESPEAK_PATH=$(which espeak) phonemize -p ' ' -w '' -l $ph_lg --language-switch remove-flags)
#sed 's/$/ 1/' $target_dir/words.txt | PHONEMIZER_ESPEAK_PATH=$(which espeak) phonemize -o $target_dir/phones.txt -p ' ' -w '' -l $ph_lg -j 70 --language-switch remove-flags
cut -f2- -d' ' /kaggle/input/wav2vec-u-cv-swedish-audio/dict.train >! $target_dir/phones.txt

#echo "one is ${one}"

#sed -i "s/${one}$//" $target_dir/phones.txt
#paste $target_dir/words.txt $target_dir/phones.txt >! $target_dir/lexicon.lst
cp /kaggle/input/wav2vec-u-cv-swedish-audio/dict.train $target_dir/lexicon.lst

#python $FAIRSEQ_ROOT/fairseq_cli/preprocess.py --dataset-impl mmap --trainpref $target_dir/phones.txt --only-source --destdir $target_dir/phones --thresholdsrc 1000 --padding-factor 1 --dict-only
python $FAIRSEQ_ROOT/fairseq_cli/preprocess.py --dataset-impl mmap --trainpref $target_dir/phones.txt --only-source --destdir $target_dir/phones --thresholdsrc 2 --padding-factor 1 --dict-only

python filter_lexicon.py -d $target_dir/phones/dict.txt < $target_dir/lexicon.lst >! $target_dir/lexicon_filtered.lst
python phonemize_with_sil.py -s 0.25 --surround --lexicon $target_dir/lexicon_filtered.lst < $target_dir/lm.upper.lid.txt >! $target_dir/phones/lm.phones.filtered.txt
cp $target_dir/phones/dict.txt $target_dir/phones/dict.phn.txt
echo "<SIL> 0" >> $target_dir/phones/dict.phn.txt
python $FAIRSEQ_ROOT/fairseq_cli/preprocess.py --dataset-impl mmap --trainpref $target_dir/phones/lm.phones.filtered.txt --workers 70 --only-source --destdir $target_dir/phones --srcdict $target_dir/phones/dict.phn.txt

lmplz -o 4 < $target_dir/lm.upper.lid.txt --discount_fallback --prune 0 0 0 3 >! $target_dir/kenlm.wrd.o40003.arpa
build_binary $target_dir/kenlm.wrd.o40003.arpa $target_dir/kenlm.wrd.o40003.bin
lg=$lg python $FAIRSEQ_ROOT/examples/speech_recognition/kaldi/kaldi_initializer.py fst_dir=$target_dir/fst/phn_to_words_sil lm_arpa=$target_dir/kenlm.wrd.o40003.arpa wav2letter_lexicon=$target_dir/lexicon_filtered.lst data_dir=$target_dir/phones "blank_symbol='<SIL>'" "in_labels='phn'" "kaldi_root='/opt/kaldi'"
lg=$lg python $FAIRSEQ_ROOT/examples/speech_recognition/kaldi/kaldi_initializer.py fst_dir=$target_dir/fst/phn_to_words lm_arpa=$target_dir/kenlm.wrd.o40003.arpa wav2letter_lexicon=$target_dir/lexicon_filtered.lst data_dir=$target_dir/phones  "in_labels='phn'" "kaldi_root='/opt/kaldi'"

lmplz -o 4 < $target_dir/phones/lm.phones.filtered.txt --discount_fallback >! $target_dir/phones/lm.phones.filtered.04.arpa
build_binary -s $target_dir/phones/lm.phones.filtered.04.arpa $target_dir/phones/lm.phones.filtered.04.bin
lmplz -o 6 < $target_dir/phones/lm.phones.filtered.txt --discount_fallback >! $target_dir/phones/lm.phones.filtered.06.arpa
build_binary -s $target_dir/phones/lm.phones.filtered.06.arpa $target_dir/phones/lm.phones.filtered.06.bin

lg=$lg python $FAIRSEQ_ROOT/examples/speech_recognition/kaldi/kaldi_initializer.py fst_dir=$target_dir/fst/phn_to_phn_sil lm_arpa=$target_dir/phones/lm.phones.filtered.06.arpa data_dir=$target_dir/phones "blank_symbol='<SIL>'" "in_labels='phn'" "kaldi_root='/opt/kaldi'"
Overwriting prepare_text.sh

add-self-loop-simple.cc attempts to use std::endl with KALDI_LOG, which doesn't work, so rewrite that (I'm not sure if this actually prevents anything from working, but it is really distracting).

%%writefile /tmp/fairseq/examples/speech_recognition/kaldi/add-self-loop-simple.cc
/*
* Copyright (c) Facebook, Inc. and its affiliates.
*
* This source code is licensed under the MIT license found in the
* LICENSE file in the root directory of this source tree.
*/

#include <iostream>
#include "fstext/fstext-lib.h" // @manual
#include "util/common-utils.h" // @manual

/*
 * This program is to modify a FST without self-loop by:
 *   for each incoming arc with non-eps input symbol, add a self-loop arc
 *   with that non-eps symbol as input and eps as output.
 *
 * This is to make sure the resultant FST can do deduplication for repeated
 * symbols, which is very common in acoustic model
 *
 */
namespace {
int32 AddSelfLoopsSimple(fst::StdVectorFst* fst) {
  typedef fst::MutableArcIterator<fst::StdVectorFst> IterType;

  int32 num_states_before = fst->NumStates();
  fst::MakePrecedingInputSymbolsSame(false, fst);
  int32 num_states_after = fst->NumStates();
  KALDI_LOG << "There are " << num_states_before
            << " states in the original FST; "
            << " after MakePrecedingInputSymbolsSame, there are "
            << num_states_after << " states ";

  auto weight_one = fst::StdArc::Weight::One();

  int32 num_arc_added = 0;

  fst::StdArc self_loop_arc;
  self_loop_arc.weight = weight_one;

  int32 num_states = fst->NumStates();
  std::vector<std::set<int32>> incoming_non_eps_label_per_state(num_states);

  for (int32 state = 0; state < num_states; state++) {
    for (IterType aiter(fst, state); !aiter.Done(); aiter.Next()) {
      fst::StdArc arc(aiter.Value());
      if (arc.ilabel != 0) {
        incoming_non_eps_label_per_state[arc.nextstate].insert(arc.ilabel);
      }
    }
  }

  for (int32 state = 0; state < num_states; state++) {
    if (!incoming_non_eps_label_per_state[state].empty()) {
      auto& ilabel_set = incoming_non_eps_label_per_state[state];
      for (auto it = ilabel_set.begin(); it != ilabel_set.end(); it++) {
        self_loop_arc.ilabel = *it;
        self_loop_arc.olabel = 0;
        self_loop_arc.nextstate = state;
        fst->AddArc(state, self_loop_arc);
        num_arc_added++;
      }
    }
  }
  return num_arc_added;
}

void print_usage() {
  std::cout << "add-self-loop-simple usage:\n"
               "\tadd-self-loop-simple <in-fst> <out-fst> \n";
}
} // namespace

int main(int argc, char** argv) {
  if (argc != 3) {
    print_usage();
    exit(1);
  }

  auto input = argv[1];
  auto output = argv[2];

  auto fst = fst::ReadFstKaldi(input);
  auto num_states = fst->NumStates();
  KALDI_LOG << "Loading FST from " << input << " with " << num_states
            << " states.";

  int32 num_arc_added = AddSelfLoopsSimple(fst);
  KALDI_LOG << "Adding " << num_arc_added << " self-loop arcs ";

  fst::WriteFstKaldi(*fst, std::string(output));
  KALDI_LOG << "Writing FST to " << output;

  delete fst;
}
Overwriting /tmp/fairseq/examples/speech_recognition/kaldi/add-self-loop-simple.cc
!zsh prepare_text.sh sv /kaggle/working/sentences.txt /kaggle/working/preppedtext
sv
sv
/kaggle/working/sentences.txt
/kaggle/working/preppedtext
=== 1/5 Counting and sorting n-grams ===
Reading /kaggle/working/preppedtext/lm.upper.lid.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 14359 types 3160
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:37920 2:2571431424 3:4821433856 4:7714294272
Statistics:
1 3160 D1=0.722623 D2=1.14413 D3+=1.45956
2 10285 D1=0.848104 D2=1.2466 D3+=1.46191
3 12632 D1=0.943362 D2=1.24166 D3+=1.32723
4 19/11699 D1=0.970399 D2=1.4843 D3+=2.12351
Memory estimate for binary LM:
type     kB
probing 617 assuming -p 1.5
probing 764 assuming -r models -p 1.5
trie    309 without quantization
trie    182 assuming -q 8 -b 8 quantization 
trie    293 assuming -a 22 array pointer compression
trie    166 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:37920 2:164560 3:252640 4:456
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:37920 2:164560 3:252640 4:456
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Name:lmplz	VmPeak:14925024 kB	VmRSS:6488 kB	RSSMax:2975268 kB	user:0.194576	sys:0.839708	CPU:1.03431	real:1.03864
Reading /kaggle/working/preppedtext/kenlm.wrd.o40003.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS
[2021-05-30 15:50:13,771][__main__][INFO] - Creating /kaggle/working/preppedtext/fst/phn_to_words_sil/kaldi_dict.phn.txt
[2021-05-30 15:50:13,771][__main__][INFO] - Creating /kaggle/working/preppedtext/fst/phn_to_words_sil/G_kenlm.wrd.o40003.fst
/opt/kaldi/src/lmbin/arpa2fst --disambig-symbol=#0 --write-symbol-table=/kaggle/working/preppedtext/fst/phn_to_words_sil/kaldi_dict.kenlm.wrd.o40003.txt /kaggle/working/preppedtext/kenlm.wrd.o40003.arpa /kaggle/working/preppedtext/fst/phn_to_words_sil/G_kenlm.wrd.o40003.fst 
LOG (arpa2fst[5.5.0~1-2b62]:Read():arpa-file-parser.cc:94) Reading \data\ section.
LOG (arpa2fst[5.5.0~1-2b62]:Read():arpa-file-parser.cc:149) Reading \1-grams: section.
LOG (arpa2fst[5.5.0~1-2b62]:Read():arpa-file-parser.cc:149) Reading \2-grams: section.
LOG (arpa2fst[5.5.0~1-2b62]:Read():arpa-file-parser.cc:149) Reading \3-grams: section.
LOG (arpa2fst[5.5.0~1-2b62]:Read():arpa-file-parser.cc:149) Reading \4-grams: section.
LOG (arpa2fst[5.5.0~1-2b62]:RemoveRedundantStates():arpa-lm-compiler.cc:359) Reduced num-states from 22665 to 12144
[2021-05-30 15:50:13,918][__main__][INFO] - Creating /kaggle/working/preppedtext/fst/phn_to_words_sil/kaldi_lexicon.phn.kenlm.wrd.o40003.txt (in units file: /kaggle/working/preppedtext/fst/phn_to_words_sil/kaldi_dict.phn.txt)
[2021-05-30 15:50:14,005][__main__][INFO] - Creating /kaggle/working/preppedtext/fst/phn_to_words_sil/H.phn.fst
[2021-05-30 15:50:14,045][__main__][INFO] - Creating /kaggle/working/preppedtext/fst/phn_to_words_sil/L.phn.kenlm.wrd.o40003.fst (in units: /kaggle/working/preppedtext/fst/phn_to_words_sil/kaldi_dict.phn_disambig.txt)
[2021-05-30 15:50:14,244][__main__][INFO] - Creating /kaggle/working/preppedtext/fst/phn_to_words_sil/LG.phn.kenlm.wrd.o40003.fst
[2021-05-30 15:50:15,269][__main__][INFO] - Creating /kaggle/working/preppedtext/fst/phn_to_words_sil/HLGa.phn.kenlm.wrd.o40003.fst
[2021-05-30 15:50:17,600][__main__][INFO] - Creating /kaggle/working/preppedtext/fst/phn_to_words_sil/HLG.phn.kenlm.wrd.o40003.fst
[2021-05-30 15:50:26,782][__main__][INFO] - Creating /kaggle/working/preppedtext/fst/phn_to_words/kaldi_dict.phn.txt
[2021-05-30 15:50:26,783][__main__][INFO] - Creating /kaggle/working/preppedtext/fst/phn_to_words/G_kenlm.wrd.o40003.fst
/opt/kaldi/src/lmbin/arpa2fst --disambig-symbol=#0 --write-symbol-table=/kaggle/working/preppedtext/fst/phn_to_words/kaldi_dict.kenlm.wrd.o40003.txt /kaggle/working/preppedtext/kenlm.wrd.o40003.arpa /kaggle/working/preppedtext/fst/phn_to_words/G_kenlm.wrd.o40003.fst 
LOG (arpa2fst[5.5.0~1-2b62]:Read():arpa-file-parser.cc:94) Reading \data\ section.
LOG (arpa2fst[5.5.0~1-2b62]:Read():arpa-file-parser.cc:149) Reading \1-grams: section.
LOG (arpa2fst[5.5.0~1-2b62]:Read():arpa-file-parser.cc:149) Reading \2-grams: section.
LOG (arpa2fst[5.5.0~1-2b62]:Read():arpa-file-parser.cc:149) Reading \3-grams: section.
LOG (arpa2fst[5.5.0~1-2b62]:Read():arpa-file-parser.cc:149) Reading \4-grams: section.
LOG (arpa2fst[5.5.0~1-2b62]:RemoveRedundantStates():arpa-lm-compiler.cc:359) Reduced num-states from 22665 to 12144
[2021-05-30 15:50:26,992][__main__][INFO] - Creating /kaggle/working/preppedtext/fst/phn_to_words/kaldi_lexicon.phn.kenlm.wrd.o40003.txt (in units file: /kaggle/working/preppedtext/fst/phn_to_words/kaldi_dict.phn.txt)
[2021-05-30 15:50:27,047][__main__][INFO] - Creating /kaggle/working/preppedtext/fst/phn_to_words/H.phn.fst
[2021-05-30 15:50:27,088][__main__][INFO] - Creating /kaggle/working/preppedtext/fst/phn_to_words/L.phn.kenlm.wrd.o40003.fst (in units: /kaggle/working/preppedtext/fst/phn_to_words/kaldi_dict.phn_disambig.txt)
[2021-05-30 15:50:27,281][__main__][INFO] - Creating /kaggle/working/preppedtext/fst/phn_to_words/LG.phn.kenlm.wrd.o40003.fst
[2021-05-30 15:50:28,293][__main__][INFO] - Creating /kaggle/working/preppedtext/fst/phn_to_words/HLGa.phn.kenlm.wrd.o40003.fst
[2021-05-30 15:50:31,245][__main__][INFO] - Creating /kaggle/working/preppedtext/fst/phn_to_words/HLG.phn.kenlm.wrd.o40003.fst
=== 1/5 Counting and sorting n-grams ===
Reading /kaggle/working/preppedtext/phones/lm.phones.filtered.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 63676 types 44
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:528 2:2571437824 3:4821446144 4:7714313728
Substituting fallback discounts for order 0: D1=0.5 D2=1 D3+=1.5
Statistics:
1 44 D1=0.5 D2=1 D3+=1.5
2 1053 D1=0.421189 D2=1.06361 D3+=1.49793
3 8534 D1=0.558099 D2=1.17765 D3+=1.45173
4 23058 D1=0.643934 D2=1.15876 D3+=1.53884
Memory estimate for binary LM:
type     kB
probing 631 assuming -p 1.5
probing 687 assuming -r models -p 1.5
trie    203 without quantization
trie     88 assuming -q 8 -b 8 quantization 
trie    196 assuming -a 22 array pointer compression
trie     81 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:528 2:16848 3:170680 4:553392
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:528 2:16848 3:170680 4:553392
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Name:lmplz	VmPeak:14916784 kB	VmRSS:7056 kB	RSSMax:2973864 kB	user:0.209899	sys:0.716931	CPU:0.926881	real:0.936705
Reading /kaggle/working/preppedtext/phones/lm.phones.filtered.04.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS
=== 1/5 Counting and sorting n-grams ===
Reading /kaggle/working/preppedtext/phones/lm.phones.filtered.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 63676 types 44
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:528 2:929673728 3:1743138176 4:2789021184 5:4067322624 6:5578042368
Substituting fallback discounts for order 0: D1=0.5 D2=1 D3+=1.5
Statistics:
1 44 D1=0.5 D2=1 D3+=1.5
2 1053 D1=0.421189 D2=1.06361 D3+=1.49793
3 8534 D1=0.558099 D2=1.17765 D3+=1.45173
4 23058 D1=0.704256 D2=1.25425 D3+=1.63465
5 35879 D1=0.821218 D2=1.34714 D3+=1.61281
6 43593 D1=0.834579 D2=1.24241 D3+=1.56972
Memory estimate for binary LM:
type      kB
probing 2373 assuming -p 1.5
probing 2775 assuming -r models -p 1.5
trie     907 without quantization
trie     401 assuming -q 8 -b 8 quantization 
trie     838 assuming -a 22 array pointer compression
trie     331 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:528 2:16848 3:170680 4:553392 5:1004612 6:1394976
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:528 2:16848 3:170680 4:553392 5:1004612 6:1394976
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Name:lmplz	VmPeak:14949572 kB	VmRSS:6500 kB	RSSMax:2354520 kB	user:0.256512	sys:0.588288	CPU:0.84484	real:0.81585
Reading /kaggle/working/preppedtext/phones/lm.phones.filtered.06.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS
[2021-05-30 15:50:35,812][__main__][INFO] - Creating /kaggle/working/preppedtext/fst/phn_to_phn_sil/kaldi_dict.phn.txt
[2021-05-30 15:50:35,812][__main__][INFO] - Creating /kaggle/working/preppedtext/fst/phn_to_phn_sil/G_lm.phones.filtered.06.fst
/opt/kaldi/src/lmbin/arpa2fst --disambig-symbol=#0 --write-symbol-table=/kaggle/working/preppedtext/fst/phn_to_phn_sil/kaldi_dict.lm.phones.filtered.06.txt /kaggle/working/preppedtext/phones/lm.phones.filtered.06.arpa /kaggle/working/preppedtext/fst/phn_to_phn_sil/G_lm.phones.filtered.06.fst 
LOG (arpa2fst[5.5.0~1-2b62]:Read():arpa-file-parser.cc:94) Reading \data\ section.
LOG (arpa2fst[5.5.0~1-2b62]:HeaderAvailable():arpa-lm-compiler.cc:300) Reverting to slower state tracking because model is large: 6-gram with symbols up to 47
LOG (arpa2fst[5.5.0~1-2b62]:Read():arpa-file-parser.cc:149) Reading \1-grams: section.
LOG (arpa2fst[5.5.0~1-2b62]:Read():arpa-file-parser.cc:149) Reading \2-grams: section.
LOG (arpa2fst[5.5.0~1-2b62]:Read():arpa-file-parser.cc:149) Reading \3-grams: section.
LOG (arpa2fst[5.5.0~1-2b62]:Read():arpa-file-parser.cc:149) Reading \4-grams: section.
LOG (arpa2fst[5.5.0~1-2b62]:Read():arpa-file-parser.cc:149) Reading \5-grams: section.
LOG (arpa2fst[5.5.0~1-2b62]:Read():arpa-file-parser.cc:149) Reading \6-grams: section.
LOG (arpa2fst[5.5.0~1-2b62]:RemoveRedundantStates():arpa-lm-compiler.cc:359) Reduced num-states from 67529 to 67528
[2021-05-30 15:50:36,696][__main__][INFO] - Creating /kaggle/working/preppedtext/fst/phn_to_phn_sil/kaldi_lexicon.phn.lm.phones.filtered.06.txt (in units file: /kaggle/working/preppedtext/fst/phn_to_phn_sil/kaldi_dict.phn.txt)
[2021-05-30 15:50:36,713][__main__][INFO] - Creating /kaggle/working/preppedtext/fst/phn_to_phn_sil/H.phn.fst
[2021-05-30 15:50:36,754][__main__][INFO] - Creating /kaggle/working/preppedtext/fst/phn_to_phn_sil/L.phn.lm.phones.filtered.06.fst (in units: /kaggle/working/preppedtext/fst/phn_to_phn_sil/kaldi_dict.phn_disambig.txt)
[2021-05-30 15:50:36,802][__main__][INFO] - Creating /kaggle/working/preppedtext/fst/phn_to_phn_sil/LG.phn.lm.phones.filtered.06.fst
[2021-05-30 15:50:37,700][__main__][INFO] - Creating /kaggle/working/preppedtext/fst/phn_to_phn_sil/HLGa.phn.lm.phones.filtered.06.fst
[2021-05-30 15:50:40,759][__main__][INFO] - Creating /kaggle/working/preppedtext/fst/phn_to_phn_sil/HLG.phn.lm.phones.filtered.06.fst