Interesting links, 07/09/2022

Towards End-to-end Unsupervised Speech Recognition

@misc{https://doi.org/10.48550/arxiv.2204.02492,
  doi = {10.48550/ARXIV.2204.02492},
  url = {https://arxiv.org/abs/2204.02492},
  author = {Liu, Alexander H. and Hsu, Wei-Ning and Auli, Michael and Baevski, Alexei},
  title = {Towards End-to-end Unsupervised Speech Recognition},
  year = {2022},
}

Segmental Audio Word2Vec: Representing Utterances as Sequences of Vectors with Applications in Spoken Term Detection

@misc{https://doi.org/10.48550/arxiv.1808.02228,
  doi = {10.48550/ARXIV.1808.02228},
  url = {https://arxiv.org/abs/1808.02228},
  author = {Wang, Yu-Hsuan and Lee, Hung-yi and Lee, Lin-shan},
  title = {Segmental Audio Word2Vec: Representing Utterances as Sequences of Vectors with Applications in Spoken Term Detection},
  year = {2018},
}

zhenghuatan/rVADfast

SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing, microsoft/SpeechT5

@misc{https://doi.org/10.48550/arxiv.2110.07205,
  doi = {10.48550/ARXIV.2110.07205},
  url = {https://arxiv.org/abs/2110.07205},
  author = {Ao, Junyi and Wang, Rui and Zhou, Long and Wang, Chengyi and Ren, Shuo and Wu, Yu and Liu, Shujie and Ko, Tom and Li, Qing and Zhang, Yu and Wei, Zhihua and Qian, Yao and Li, Jinyu and Wei, Furu},
  title = {SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing},
  year = {2021},
}

How to load the pretrained models in pytorch

Multilingual and Multimodal Learning for Brazilian Portuguese

RoomReader: A Multimodal Corpus of Online Multiparty Conversational Interactions

Investigating Independence vs. Control: Agenda-Setting in Russian News Coverage on Social Media

Diachronic Parsing of Pre-Standard Irish

probabilisticai/probai-2022, videos

Using AI to decode speech from brain activity

add wav2vec2_alignment

Add fairseq FastSpeech2

Add Emformer

data2vec-vision Onnx ready-made configuration

Add a TF in-graph tokenizer for BERT

add MobileNetV2 model

Adding Omnivore Model to HF

Layoutlmv2 tesseractconfig

pyannote/embedding

ASR chunking

LITHME

CLARIN Annual Conference 2022

google/lyra — A Very Low-Bitrate Codec for Speech Compression

salesforce/awd-lstm-lm

MKD: a Multi-Task Knowledge Distillation Approach for Pretrained Language Models

Transflower: probabilistic autoregressive dance generation with multimodal attention, code

Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data

An investigation of phone-based subword units for end-to-end speech recognition

Sequence-to-sequence learning with Transducers

Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition

ONLINE ASR WITH EMFORMER RNN-T

We published Tuda german model from https://t.co/4xPzWgW6fw https://t.co/7mdkimirTj
it is big (4.4G) and slightly more accurate than Vosk on audiobooks and well covers CV test

9.48 (Tuda-de test), 25.82 (podcast) 4.97 (cv-test) 11.01 (mls) 35.20 (mtedx)
— AlphaCephei (@alphacep) August 10, 2022