Interesting links, 16/08/2023

TranUSR: Phoneme-to-word Transcoder Based Unified Speech Representation Learning for Cross-lingual Speech Recognition

@misc{xue2023tranusr,
      title={TranUSR: Phoneme-to-word Transcoder Based Unified Speech Representation Learning for Cross-lingual Speech Recognition}, 
      author={Hongfei Xue and Qijie Shao and Peikun Chen and Pengcheng Guo and Lei Xie and Jie Liu},
      year={2023},
      eprint={2305.13629},
}

On the Transferability of Whisper-based Representations for “In-the-Wild” Cross-Task Downstream Speech Applications

@misc{chemudupati2023transferability,
      title={On the Transferability of Whisper-based Representations for "In-the-Wild" Cross-Task Downstream Speech Applications}, 
      author={Vamsikrishna Chemudupati and Marzieh Tahaei and Heitor Guimaraes and Arthur Pimentel and Anderson Avila and Mehdi Rezagholizadeh and Boxing Chen and Tiago Falk},
      year={2023},
      eprint={2305.14546},
}

CASA-ASR: Context-Aware Speaker-Attributed ASR

@misc{shi2023casaasr,
      title={CASA-ASR: Context-Aware Speaker-Attributed ASR}, 
      author={Mohan Shi and Zhihao Du and Qian Chen and Fan Yu and Yangze Li and Shiliang Zhang and Jie Zhang and Li-Rong Dai},
      year={2023},
      eprint={2305.12459},
}

CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training, samples

@misc{ye2023clapspeech,
      title={CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training}, 
      author={Zhenhui Ye and Rongjie Huang and Yi Ren and Ziyue Jiang and Jinglin Liu and Jinzheng He and Xiang Yin and Zhou Zhao},
      year={2023},
      eprint={2305.10763},
}

Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition

@misc{rekesh2023fast,
      title={Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition}, 
      author={Dima Rekesh and Samuel Kriman and Somshubra Majumdar and Vahid Noroozi and He Huang and Oleksii Hrinchuk and Ankur Kumar and Boris Ginsburg},
      year={2023},
      eprint={2305.05084},
}

The HARPY Speech Recognition System

Multi-Task and Transfer Learning in Low-Resource Speech Recognition

PaLI: Scaling Language-Image Learning in 100+ Languages

google/mt5-small

google/matcha-base

google/matcha-chart2text-pew — This model is the MatCha model, fine-tuned on Chart2text-pew dataset. This fine-tuned checkpoint might be better suited for chart summarization task.

google/matcha-plotqa-v2 — This model is the MatCha model, fine-tuned on plotQA-v2 dataset. This fine-tuned checkpoint might be better suited for plots question answering tasks.

FLEURS Irish — all non-native, from what I’ve checked.

budzianowski/multiwoz — Source code for end-to-end dialogue model from the MultiWOZ paper

salesforce/DialogStudio — DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection and Instruction-Aware Models for Conversational AI

Implementation of the Branchformer

facebookresearch/Ego4d — Ego4d dataset repository. Download the dataset, visualize, extract features & example usage of the dataset. (Data has an awful licence).

ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context

@misc{han2020contextnet,
      title={ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context}, 
      author={Wei Han and Zhengdong Zhang and Yu Zhang and Jiahui Yu and Chung-Cheng Chiu and James Qin and Anmol Gulati and Ruoming Pang and Yonghui Wu},
      year={2020},
      eprint={2005.03191},
}

JOIST: A Joint Speech and Text Streaming Model For ASR

@misc{sainath2022joist,
      title={JOIST: A Joint Speech and Text Streaming Model For ASR}, 
      author={Tara N. Sainath and Rohit Prabhavalkar and Ankur Bapna and Yu Zhang and Zhouyuan Huo and Zhehuai Chen and Bo Li and Weiran Wang and Trevor Strohman},
      year={2022},
      eprint={2210.07353},
}

Improving Joint Speech-Text Representations Without Alignment

@misc{peyser2023improving,
      title={Improving Joint Speech-Text Representations Without Alignment}, 
      author={Cal Peyser and Zhong Meng and Ke Hu and Rohit Prabhavalkar and Andrew Rosenberg and Tara N. Sainath and Michael Picheny and Kyunghyun Cho},
      year={2023},
      eprint={2308.06125},
}