Interspeech papers
Interesting papers from Interspeech
End-to-End Spelling Correction Conditioned on Acoustic Feature for Code-Switching Speech Recognition
Shuai Zhang, Jiangyan Yi, Zhengkun Tian, Ye Bai, Jianhua Tao, Xuefei Liu, Zhengqi Wen
@inproceedings{zhang21d_interspeech,
author={Shuai Zhang and Jiangyan Yi and Zhengkun Tian and Ye Bai and Jianhua Tao and Xuefei Liu and Zhengqi Wen},
title={{End-to-End Spelling Correction Conditioned on Acoustic Feature for Code-Switching Speech Recognition}},
year=2021,
booktitle={Proc. Interspeech 2021},
pages={266--270},
doi={10.21437/Interspeech.2021-1242}
}
Spell checking as conditioned language model for code-switching in ASR.
Dataset:
- ASRU 2019 Mandarin-English code-switching Challenge dataset
- 500 hours Mandarin
- 200 hours code-switching
- only used code-switching
Augmentation:
- ASR text:
- 10-fold cross validation
- beam size 10
- audio:
- SpecAugment
- dropout
Metric:
- Mix error rate (MER): WER for English, CER for Mandarin
Experimental setup:
- ASR
- Kaldi
- 40-dim Mel filter-bank
- 25ms windowing
- 10ms frame shift
- 3 * 2D CNN downsampling layers w/ stride 2 for acoustic features
- attention dimensions 256 for both encode and decoder
- 4 heards
- position-wise feed-forward networks dim 1024
- 12 encoder blocks, 6 decoder blocks
- LM
- 6-gram, KenLM
- unidirectional LSTM
- Spelling correction
- Encoder/decoder dims 256, num. heads: 4
- position-wise feed-forward networks dim 512
- dimension conversion layer to unify text & acoustic features
- uniform label smoothing, 0.1
- residual dropout: 0.1 applied to each sub-block
- learning rate set by warm up
- average last 5 checkpoints
- wordpiece vocab: 1k for English, Chinese characters appearing more than 5 times in training set
Kathleen Siminyu, Xinjian Li, Antonios Anastasopoulos, David R. Mortensen, Michael R. Marlo, Graham Neubig
@inproceedings{siminyu21_interspeech,
author={Kathleen Siminyu and Xinjian Li and Antonios Anastasopoulos and David R. Mortensen and Michael R. Marlo and Graham Neubig},
title={{Phoneme Recognition Through Fine Tuning of Phonetic Representations: A Case Study on Luhya Language Varieties}},
year=2021,
booktitle={Proc. Interspeech 2021},
pages={271--275},
doi={10.21437/Interspeech.2021-1434}
}
Fine tuning of allosaurus (“universal phone recognizer”) for three varieties of Luhya.
Data:
- Saamia
- bible.is via CMU Wilderness
- 18.2 hours
- Bukusu
- Dictionary pronunciations
- 3.7 hours
- East Tusom
- Tusom2021
- 55.3 minutes
- G2P with epitran
- Splits:
- Bukusu: 6442 (train), 1001 (dev), 2458 (test)
- Saamia: 7254 (train), 1000 (dev), 1500 (test)
- East Tusom: 1600 (train), 400 (dev), 392 (test)
Experiment:
- sizes: 10, 25, 50, 100, 250, 500 and 1000 (approx. doubling progression)
- fine-tuning is done on one model: same encoder
- 6 layer bilstm
- hidden size 1024 per layer
- 250 epochs of fine tuning
Results: PER (relative improvement)
Bukusu | Saamia | East Tusom | |
---|---|---|---|
Allosaurus | 72.8 | 63.7 | 67.5 |
& constraint | 52.5 | 37.4 | 56.7 |
& fine-tuning (100) | 41.2 (21.5%) | 15.5 (58.5%) | 44.8 (20.9%) |
& fine-tuning (1000) | 17.3 (67.0%) | 11.7 (65.7%) | 34.6 (38.9%) |
& fine-tuning (all) | 5.2 (90.1%) | 9.2 (75.4%) | 33.1 (41.6%) |
Exploring wav2vec 2.0 on Speaker Verification and Language Identification
Zhiyun Fan, Meng Li, Shiyu Zhou, Bo Xu
@inproceedings{fan21_interspeech,
author={Zhiyun Fan and Meng Li and Shiyu Zhou and Bo Xu},
title={{Exploring wav2vec 2.0 on Speaker Verification and Language Identification}},
year=2021,
booktitle={Proc. Interspeech 2021},
pages={1509--1513},
doi={10.21437/Interspeech.2021-1280}
}
Finetunes monolingual English wav2vec model for speaker verification and/or language ID.
- Fine tuning
- average pooling layer and fully connected layer
- Loss: cross-entropy (AM-softmax for speaker classification)
- Fine tuning, multi-task (speaker + language)
- average pooling, two parallel fully connected layers
- loss is weighted sum of individual losses
- Datasets
- VoxCeleb1 (speaker verification)
- AP17-OLR (language ID)
- Metric
- Equal error rate (EER)
Results (single):
- SV: 3.61
- LID: 3.47
Results (multitask):
- SV: 4.18
- LID: 4.88
Low Resource ASR: The Surprising Effectiveness of High Resource Transliteration
Shreya Khare, Ashish Mittal, Anuj Diwan, Sunita Sarawagi, Preethi Jyothi, Samarth Bharadwaj
@inproceedings{khare21_interspeech,
author={Shreya Khare and Ashish Mittal and Anuj Diwan and Sunita Sarawagi and Preethi Jyothi and Samarth Bharadwaj},
title={{Low Resource ASR: The Surprising Effectiveness of High Resource Transliteration}},
year=2021,
booktitle={Proc. Interspeech 2021},
pages={1529--1533},
doi={10.21437/Interspeech.2021-2062}
}
Uses text in a second language transliterated to target language to augment training data for ASR.
We observe that for languages like Hindi and Telugu where the KL distance in phone distribution is small and the transliteration PER is low, we get consistent gains across different architectures and training data sizes.
- G2P: epitran, g2ps
- Tools: ESPnet, wav2vec2
- Datasets
- Microsoft Speech Corpus (Indian Languages): Gujarati and Telugu
- Hindi ASR Challenge dataset
- OpenSLR Large Bengali dataset
- Zeroth Korean
- ALFFA Amharic
there’s a significant improvement in performance across both training settings with using Hindi instead of English during pretraining. Using both transliterated English and Hindi data during pretraining for the 10-hour Gujarati task further reduces WERs from 55.8% to 32.4%
Improving Accent Identification and Accented Speech Recognition Under a Framework of Self-Supervised Learning pdf
@inproceedings{deng21b_interspeech,
author={Keqi Deng and Songjun Cao and Long Ma},
title={{Improving Accent Identification and Accented Speech Recognition Under a Framework of Self-Supervised Learning}},
year=2021,
booktitle={Proc. Interspeech 2021},
pages={1504--1508},
doi={10.21437/Interspeech.2021-1186}
}
Fine tuning wav2vec2 for accented speech recognition/accent ID
Dataset:
- LibriSpeech (pretrain)
- AESRC2020 (finetune)
TODO
Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training
AST: Audio Spectrogram Transformer
Y-Vector: Multiscale Waveform Encoder for Speaker Embedding
Speech Acoustic Modelling Using Raw Source and Filter Components
Leveraging Phone Mask Training for Phonetic-Reduction-Robust E2E Uyghur Speech Recognition
Rethinking Evaluation in ASR: Are Our Models Robust Enough?
wav2vec-C: A Self-Supervised Model for Speech Representation Learning
Multimodal Speech Summarization Through Semantic Concept Learning
A Noise Robust Method for Word-Level Pronunciation Assessment
Improving RNN-T for Domain Scaling Using Semi-Supervised Training with Neural TTS
Phonetically Motivated Self-Supervised Speech Representation Learning
slimIPL: Language-Model-Free Iterative Pseudo-Labeling
Semi-Supervision in ASR: Sequential MixMatch and Factorized TTS-Based Augmentation
A Comparison of Supervised and Unsupervised Pre-Training of End-to-End Models
Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition
On the Learning Dynamics of Semi-Supervised Training for ASR
Improving Streaming Transformer Based ASR Under a Framework of Self-Supervised Learning
Data Augmentation Methods for End-to-End Speech Recognition on Distant-Talk Scenarios
Multi-Channel Transformer Transducer for Speech Recognition
Scaling Sparsemax Based Channel Selection for Speech Recognition with ad-hoc Microphone Arrays
IR-GAN: Room Impulse Response Generator for Far-Field Speech Recognition
Phoneme-Aware and Channel-Wise Attentive Learning for Text Dependent Speaker Verification