Interesting links, 13/04/2023

linto-ai/whisper-timestamped — Multilingual Automatic Speech Recognition with word-level timestamps and confidence

lucidrains/medical-chatgpt — Implementation of ChatGPT, but tailored towards primary care medicine, with the reward being able to collect patient histories in a thorough and efficient manner and come up with a reasonable differential diagnosis

DIVA-DIA/Text-Line-Segmentation-Method-for-Medieval-Manuscripts

personwhofloat/Line-Segmentation-Model — LSM is short for Line Segmentation Model. It is a model for text line segmentation in document images. The model is robust to color, brightness, page warping.

Tensor Calculus 0: Introduction

Matplotlib Graphs in Research Papers

“The duck pond”: showcase of TikZ-drawn animals/ducks

neonbjb/ocotillo — Performant and accurate speech recognition built on Pytorch

Customer Case Study: Building an end-to-end Speech Recognition model in PyTorch with AssemblyAI

alibaba-damo-academy/FunASR

Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition

@inproceedings{gao22b_interspeech,
  author={Zhifu Gao and ShiLiang Zhang and Ian McLoughlin and Zhijie Yan},
  title={Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={2063--2067},
  doi={10.21437/Interspeech.2022-9996}
}

Segment Anything, code

facebookresearch/habitat-lab

facebookresearch/myosuite

chroma-core/chroma — the AI-native open-source embedding database

Winfredy/SadTalker — Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation, demo

sdatkinson/neural-amp-modeler — Neural network emulator for guitar amplifiers.

facebookresearch/Aria_data_tools — Aria data tools provide the open-source toolkit in C++ and Python to interact with data from Project Aria

bootphon/articulatory_inversion — Inversion-articulatoire is a Python library for training/testing neural network models for the acoustic to articulatory reconstruction.

Acoustic-to-Articulatory Mapping With Joint Optimization of Deep Speech Enhancement and Articulatory Inversion Models

@ARTICLE{9640504,
  author={Shahrebabaki, Abdolreza Sabzi and Salvi, Giampiero and Svendsen, Torbjørn and Siniscalchi, Sabato Marco},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, 
  title={Acoustic-to-Articulatory Mapping With Joint Optimization of Deep Speech Enhancement and Articulatory Inversion Models}, 
  year={2022},
  volume={30},
  number={},
  pages={135-147},
  doi={10.1109/TASLP.2021.3133218}
}

databricks/dolly-v2-12b

AI is becoming powerful in 2023.

But most people feel left out with million things happening around AI.

Here's a MEGA THREAD🧵 (with resources) to keep you up-to-date:
— Barsee 🐶 (@heyBarsee) February 25, 2023

SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization, code

@misc{kim2022soda,
      title={SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization}, 
      author={Hyunwoo Kim and Jack Hessel and Liwei Jiang and Ximing Lu and Youngjae Yu and Pei Zhou and Ronan Le Bras and Malihe Alikhani and Gunhee Kim and Maarten Sap and Yejin Choi},
      year={2022},
      eprint={2212.10465},
      primaryClass={cs.CL}
}

This is a baby GPT with two tokens 0/1 and context length of 3, viewing it as a finite state markov chain. It was trained on the sequence "111101111011110" for 50 iterations. The parameters and the architecture of the Transformer modifies the probabilities on the arrows.

E.g. we… pic.twitter.com/vj10nZEXlH
— Andrej Karpathy (@karpathy) April 9, 2023

Differentiable Finite State Machines

jerryjliu/llama_index — a project that provides a central interface to connect your LLM’s with external data

auspicious3000/SpeechSplit — Unsupervised Speech Decomposition Via Triple Information Bottleneck

bjelkenhed/whisper-large-sv, train

KBLab/rixvox Finding Speeches in the Riksdag’s Debates RixVox: A Swedish Speech Corpus with 5500 Hours of Speech from Parliamentary Debates, code

Ultra fast ControlNet with Diffusers

Jzuluaga/wav2vec2-xls-r-300m-en-atc-uwb-atcc-and-atcosim

USC-TIMIT: a database of multimodal speech production data

Haskins_IEEE_Rate_Comparison_DB

mgnu0

mocha

IEMOCAP

lucidrains/lion-pytorch — Lion, new optimizer discovered by Google Brain using genetic algorithms that is purportedly better than Adam(w), in Pytorch

lucidrains/denoising-diffusion-pytorch — Implementation of Denoising Diffusion Probabilistic Model in Pytorch

lucidrains/audiolm-pytorch — Implementation of AudioLM, a SOTA Language Modeling Approach to Audio Generation out of Google Research, in Pytorch

lucidrains/robotic-transformer-pytorch — Implementation of RT1 (Robotic Transformer) in Pytorch

lucidrains/recurrent-interface-network-pytorch — Implementation of Recurrent Interface Network (RIN), for highly efficient generation of images and video without cascading networks, in Pytorch

lucidrains/memory-efficient-attention-pytorch — Implementation of a memory efficient multi-head attention as proposed in the paper, “Self-attention Does Not Need O(n²) Memory”

lucidrains/PaLM-rlhf-pytorch — Implementation of RLHF (Reinforcement Learning with Human Feedback) on top of the PaLM architecture. Basically ChatGPT but with PaLM

hazyResearch/flash-attention — Fast and memory-efficient exact attention

lucidrains/make-a-video-pytorch — Implementation of Make-A-Video, new SOTA text to video generator from Meta AI, in Pytorch

lucidrains/rvq-vae-gpt — My attempts at applying Soundstream design on learned tokenization of text and then applying hierarchical attention to text generation

lucidrains/imagen-pytorch — Implementation of Imagen, Google’s Text-to-Image Neural Network, in Pytorch

crowsonkb/v-diffusion-jax — v objective diffusion inference code for JAX.

Datasheet for the Pile

Cocktail HuBERT: Generalized Self-Supervised Pre-training for Mixture and Single-Source Speech

@misc{fazelzarandi2023cocktail,
      title={Cocktail HuBERT: Generalized Self-Supervised Pre-training for Mixture and Single-Source Speech}, 
      author={Maryam Fazel-Zarandi and Wei-Ning Hsu},
      year={2023},
      eprint={2303.11131},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

BlinkDL/RWKV-LM — RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it’s combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, “infinite” ctx_len, and free sentence embedding.

cohogain/whisper-large-v2-ga-IE, cohogain/whisper-medium-ga-IE-cv11-fleurs-livaud

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace, code

Efficient Audio Captioning Transformer with Patchout and Text Guidance

@misc{kouzelis2023efficient,
      title={Efficient Audio Captioning Transformer with Patchout and Text Guidance}, 
      author={Thodoris Kouzelis and Grigoris Bastas and Athanasios Katsamanis and Alexandros Potamianos},
      year={2023},
      eprint={2304.02916},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

Continuous Speech Separation with Conformer

@misc{chen2020continuous,
      title={Continuous Speech Separation with Conformer}, 
      author={Sanyuan Chen and Yu Wu and Zhuo Chen and Jian Wu and Jinyu Li and Takuya Yoshioka and Chengyi Wang and Shujie Liu and Ming Zhou},
      year={2020},
      eprint={2008.05773},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

From Undercomplete to Sparse Overcomplete Autoencoders to Improve LF-MMI based Speech Recognition

@inproceedings{handekabil22_interspeech,
  author={Selen {Hande Kabil} and Herve Bourlard},
  title={From Undercomplete to Sparse Overcomplete Autoencoders to Improve LF-MMI based Speech Recognition},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={1061--1065},
  doi={10.21437/Interspeech.2022-11390}
}

Dealing with Unknowns in Continual Learning for End-to-end Automatic Speech Recognition

@inproceedings{sustek22_interspeech,
  author={Martin Sustek and Samik Sadhu and Hynek Hermansky},
  title={Dealing with Unknowns in Continual Learning for End-to-end Automatic Speech Recognition},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={1046--1050},
  doi={10.21437/Interspeech.2022-11139}
}

A Temporal Extension of Latent Dirichlet Allocation for Unsupervised Acoustic Unit Discovery

@inproceedings{vandermerwe22_interspeech,
  author={Werner {van der Merwe} and Herman Kamper and Johan {Adam du Preez}},
  title={A Temporal Extension of Latent Dirichlet Allocation for Unsupervised Acoustic Unit Discovery},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={1426--1430},
  doi={10.21437/Interspeech.2022-11369}
}

DEFORMER: Coupling Deformed Localized Patterns with Global Context for Robust End-to-end Speech Recognition

@inproceedings{xie22b_interspeech,
  author={Jiamin Xie and John H.L. Hansen},
  title={DEFORMER: Coupling Deformed Localized Patterns with Global Context for Robust End-to-end Speech Recognition},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={1392--1396},
  doi={10.21437/Interspeech.2022-11172}
}

Knowledge of accent differences can be used to predict speech recognition

@inproceedings{szalay22_interspeech,
  author={Tuende Szalay and Mostafa Shahin and Beena Ahmed and Kirrie Ballard},
  title={Knowledge of accent differences can be used to predict speech recognition},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={1372--1376},
  doi={10.21437/Interspeech.2022-10162}
}

Improving Phonetic Transcriptions of Children’s Speech by Pronunciation Modelling with Constrained CTC-Decoding

@inproceedings{rumberg22b_interspeech,
  author={Lars Rumberg and Christopher Gebauer and Hanna Ehlert and Ulrike Lüdtke and Jörn Ostermann},
  title={Improving Phonetic Transcriptions of Children’s Speech by Pronunciation Modelling with Constrained CTC-Decoding},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={1357--1361},
  doi={10.21437/Interspeech.2022-332}
}

Automatic Pronunciation Assessment using Self-Supervised Speech Representation Learning

@inproceedings{kim22k_interspeech,
  author={Eesung Kim and Jae-Jin Jeon and Hyeji Seo and Hoon Kim},
  title={Automatic Pronunciation Assessment using Self-Supervised Speech Representation Learning},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={1411--1415},
  doi={10.21437/Interspeech.2022-10245}
}

Probing phoneme, language and speaker information in unsupervised speech representations

@inproceedings{deseyssel22_interspeech,
  author={Maureen {de Seyssel} and Marvin Lavechin and Yossi Adi and Emmanuel Dupoux and Guillaume Wisniewski},
  title={Probing phoneme, language and speaker information in unsupervised speech representations},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={1402--1406},
  doi={10.21437/Interspeech.2022-373}
}

VQ-T: RNN Transducers using Vector-Quantized Prediction Network States

@inproceedings{shi22b_interspeech,
  author={Jiatong Shi and George Saon and David Haws and Shinji Watanabe and Brian Kingsbury},
  title={VQ-T: RNN Transducers using Vector-Quantized Prediction Network States},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={1656--1660},
  doi={10.21437/Interspeech.2022-414}
}

Source-Filter HiFi-GAN: Fast and Pitch Controllable High-Fidelity Neural Vocoder, chomeyama/SiFiGAN

@misc{yoneyama2023sourcefilter,
      title={Source-Filter HiFi-GAN: Fast and Pitch Controllable High-Fidelity Neural Vocoder}, 
      author={Reo Yoneyama and Yi-Chiao Wu and Tomoki Toda},
      year={2023},
      eprint={2210.15533},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

Textless Speech-to-Music Retrieval Using Emotion Similarity

UL2 20B: An Open Source Unified Language Learner, checkpoints

DeBERTa: Decoding-enhanced BERT with Disentangled Attention, code

@misc{he2021deberta,
      title={DeBERTa: Decoding-enhanced BERT with Disentangled Attention}, 
      author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
      year={2021},
      eprint={2006.03654},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation

@misc{bengio2021flow,
      title={Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation}, 
      author={Emmanuel Bengio and Moksh Jain and Maksym Korablyov and Doina Precup and Yoshua Bengio},
      year={2021},
      eprint={2106.04399},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}