Interesting links, 07/09/2022
Misc. interesting things.
Towards End-to-end Unsupervised Speech Recognition
@misc{https://doi.org/10.48550/arxiv.2204.02492,
doi = {10.48550/ARXIV.2204.02492},
url = {https://arxiv.org/abs/2204.02492},
author = {Liu, Alexander H. and Hsu, Wei-Ning and Auli, Michael and Baevski, Alexei},
title = {Towards End-to-end Unsupervised Speech Recognition},
year = {2022},
}
@misc{https://doi.org/10.48550/arxiv.1808.02228,
doi = {10.48550/ARXIV.1808.02228},
url = {https://arxiv.org/abs/1808.02228},
author = {Wang, Yu-Hsuan and Lee, Hung-yi and Lee, Lin-shan},
title = {Segmental Audio Word2Vec: Representing Utterances as Sequences of Vectors with Applications in Spoken Term Detection},
year = {2018},
}
SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing, microsoft/SpeechT5
@misc{https://doi.org/10.48550/arxiv.2110.07205,
doi = {10.48550/ARXIV.2110.07205},
url = {https://arxiv.org/abs/2110.07205},
author = {Ao, Junyi and Wang, Rui and Zhou, Long and Wang, Chengyi and Ren, Shuo and Wu, Yu and Liu, Shujie and Ko, Tom and Li, Qing and Zhang, Yu and Wei, Zhihua and Qian, Yao and Li, Jinyu and Wei, Furu},
title = {SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing},
year = {2021},
}
How to load the pretrained models in pytorch
Multilingual and Multimodal Learning for Brazilian Portuguese
RoomReader: A Multimodal Corpus of Online Multiparty Conversational Interactions
Investigating Independence vs. Control: Agenda-Setting in Russian News Coverage on Social Media
Diachronic Parsing of Pre-Standard Irish
probabilisticai/probai-2022, videos
Using AI to decode speech from brain activity
data2vec-vision Onnx ready-made configuration
Add a TF in-graph tokenizer for BERT
google/lyra — A Very Low-Bitrate Codec for Speech Compression
MKD: a Multi-Task Knowledge Distillation Approach for Pretrained Language Models
Transflower: probabilistic autoregressive dance generation with multimodal attention, code
Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data
An investigation of phone-based subword units for end-to-end speech recognition
Sequence-to-sequence learning with Transducers
Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition
ONLINE ASR WITH EMFORMER RNN-T
We published Tuda german model from https://t.co/4xPzWgW6fwhttps://t.co/7mdkimirTj
— AlphaCephei (@alphacep) August 10, 2022
it is big (4.4G) and slightly more accurate than Vosk on audiobooks and well covers CV test
9.48 (Tuda-de test), 25.82 (podcast) 4.97 (cv-test) 11.01 (mls) 35.20 (mtedx)
spaces/k2-fsa/automatic-speech-recognition
csukuangfj/optimized_transducer
Recurrent Neural Aligner: An Encoder-Decoder Neural Network Model for Sequence to Sequence Mapping
Integrating Lattice-Free MMI into End-to-End Speech Recognition
But what is the Fourier Transform? A visual introduction.
AudioLM: a Language Modeling Approach to Audio Generation
Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data
Layer-wise analysis of a self-supervised speech representation