Interspeech papers

TODO

End-to-End Spelling Correction Conditioned on Acoustic Feature for Code-Switching Speech Recognition

Shuai Zhang, Jiangyan Yi, Zhengkun Tian, Ye Bai, Jianhua Tao, Xuefei Liu, Zhengqi Wen

@inproceedings{zhang21d_interspeech,
  author={Shuai Zhang and Jiangyan Yi and Zhengkun Tian and Ye Bai and Jianhua Tao and Xuefei Liu and Zhengqi Wen},
  title={{End-to-End Spelling Correction Conditioned on Acoustic Feature for Code-Switching Speech Recognition}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={266--270},
  doi={10.21437/Interspeech.2021-1242}
}

Spell checking as conditioned language model for code-switching in ASR.

Dataset:

ASRU 2019 Mandarin-English code-switching Challenge dataset
- 500 hours Mandarin
- 200 hours code-switching
- only used code-switching

Augmentation:

ASR text:
- 10-fold cross validation
- beam size 10
audio:
- SpecAugment
- dropout

Metric:

Mix error rate (MER): WER for English, CER for Mandarin

Experimental setup:

ASR
- Kaldi
- 40-dim Mel filter-bank
- 25ms windowing
- 10ms frame shift
- 3 * 2D CNN downsampling layers w/ stride 2 for acoustic features
- attention dimensions 256 for both encode and decoder
- 4 heards
- position-wise feed-forward networks dim 1024
- 12 encoder blocks, 6 decoder blocks
LM
- 6-gram, KenLM
- unidirectional LSTM
Spelling correction
- Encoder/decoder dims 256, num. heads: 4
- position-wise feed-forward networks dim 512
- dimension conversion layer to unify text & acoustic features
- uniform label smoothing, 0.1
- residual dropout: 0.1 applied to each sub-block
- learning rate set by warm up
- average last 5 checkpoints
- wordpiece vocab: 1k for English, Chinese characters appearing more than 5 times in training set

Phoneme Recognition Through Fine Tuning of Phonetic Representations: A Case Study on Luhya Language Varieties

Kathleen Siminyu, Xinjian Li, Antonios Anastasopoulos, David R. Mortensen, Michael R. Marlo, Graham Neubig

PDF, arXiv

@inproceedings{siminyu21_interspeech,
  author={Kathleen Siminyu and Xinjian Li and Antonios Anastasopoulos and David R. Mortensen and Michael R. Marlo and Graham Neubig},
  title={{Phoneme Recognition Through Fine Tuning of Phonetic Representations: A Case Study on Luhya Language Varieties}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={271--275},
  doi={10.21437/Interspeech.2021-1434}
}

Fine tuning of allosaurus (“universal phone recognizer”) for three varieties of Luhya.

Data:

Saamia
- bible.is via CMU Wilderness
- 18.2 hours
Bukusu
- Dictionary pronunciations
- 3.7 hours
East Tusom
- Tusom2021
- 55.3 minutes
G2P with epitran
Splits:
- Bukusu: 6442 (train), 1001 (dev), 2458 (test)
- Saamia: 7254 (train), 1000 (dev), 1500 (test)
- East Tusom: 1600 (train), 400 (dev), 392 (test)

Experiment:

sizes: 10, 25, 50, 100, 250, 500 and 1000 (approx. doubling progression)
fine-tuning is done on one model: same encoder
- 6 layer bilstm
- hidden size 1024 per layer
250 epochs of fine tuning

Results: PER (relative improvement)

	Bukusu	Saamia	East Tusom
Allosaurus	72.8	63.7	67.5
& constraint	52.5	37.4	56.7
& fine-tuning (100)	41.2 (21.5%)	15.5 (58.5%)	44.8 (20.9%)
& fine-tuning (1000)	17.3 (67.0%)	11.7 (65.7%)	34.6 (38.9%)
& fine-tuning (all)	5.2 (90.1%)	9.2 (75.4%)	33.1 (41.6%)

Exploring wav2vec 2.0 on Speaker Verification and Language Identification

Zhiyun Fan, Meng Li, Shiyu Zhou, Bo Xu

PDF

@inproceedings{fan21_interspeech,
  author={Zhiyun Fan and Meng Li and Shiyu Zhou and Bo Xu},
  title={{Exploring wav2vec 2.0 on Speaker Verification and Language Identification}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={1509--1513},
  doi={10.21437/Interspeech.2021-1280}
}

Finetunes monolingual English wav2vec model for speaker verification and/or language ID.

Fine tuning
- average pooling layer and fully connected layer
- Loss: cross-entropy (AM-softmax for speaker classification)
Fine tuning, multi-task (speaker + language)
- average pooling, two parallel fully connected layers
- loss is weighted sum of individual losses
Datasets
- VoxCeleb1 (speaker verification)
- AP17-OLR (language ID)
Metric
- Equal error rate (EER)

Results (single):

SV: 3.61
LID: 3.47

Results (multitask):

SV: 4.18
LID: 4.88

Low Resource ASR: The Surprising Effectiveness of High Resource Transliteration

Shreya Khare, Ashish Mittal, Anuj Diwan, Sunita Sarawagi, Preethi Jyothi, Samarth Bharadwaj

pdf

@inproceedings{khare21_interspeech,
  author={Shreya Khare and Ashish Mittal and Anuj Diwan and Sunita Sarawagi and Preethi Jyothi and Samarth Bharadwaj},
  title={{Low Resource ASR: The Surprising Effectiveness of High Resource Transliteration}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={1529--1533},
  doi={10.21437/Interspeech.2021-2062}

}

Uses text in a second language transliterated to target language to augment training data for ASR.

We observe that for languages like Hindi and Telugu where the KL distance in phone distribution is small and the transliteration PER is low, we get consistent gains across different architectures and training data sizes.

G2P: epitran, g2ps
Tools: ESPnet, wav2vec2
Datasets
- Microsoft Speech Corpus (Indian Languages): Gujarati and Telugu
- Hindi ASR Challenge dataset
- OpenSLR Large Bengali dataset
- Zeroth Korean
- ALFFA Amharic

there’s a significant improvement in performance across both training settings with using Hindi instead of English during pretraining. Using both transliterated English and Hindi data during pretraining for the 10-hour Gujarati task further reduces WERs from 55.8% to 32.4%

Improving Accent Identification and Accented Speech Recognition Under a Framework of Self-Supervised Learning pdf

@inproceedings{deng21b_interspeech,
  author={Keqi Deng and Songjun Cao and Long Ma},
  title={{Improving Accent Identification and Accented Speech Recognition Under a Framework of Self-Supervised Learning}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={1504--1508},
  doi={10.21437/Interspeech.2021-1186}
}