Interesting links, 13/04/2023
Misc. interesting things.
Conjunctive/Imperative/Subjunctive mood in Hungarian
Splitting coverbs from verb root
linto-ai/whisper-timestamped — Multilingual Automatic Speech Recognition with word-level timestamps and confidence
lucidrains/medical-chatgpt — Implementation of ChatGPT, but tailored towards primary care medicine, with the reward being able to collect patient histories in a thorough and efficient manner and come up with a reasonable differential diagnosis
DIVA-DIA/Text-Line-Segmentation-Method-for-Medieval-Manuscripts
personwhofloat/Line-Segmentation-Model — LSM is short for Line Segmentation Model. It is a model for text line segmentation in document images. The model is robust to color, brightness, page warping.
Tensor Calculus 0: Introduction
Matplotlib Graphs in Research Papers
“The duck pond”: showcase of TikZ-drawn animals/ducks
neonbjb/ocotillo — Performant and accurate speech recognition built on Pytorch
Customer Case Study: Building an end-to-end Speech Recognition model in PyTorch with AssemblyAI
@inproceedings{gao22b_interspeech,
author={Zhifu Gao and ShiLiang Zhang and Ian McLoughlin and Zhijie Yan},
title={Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition},
year=2022,
booktitle={Proc. Interspeech 2022},
pages={2063--2067},
doi={10.21437/Interspeech.2022-9996}
}
chroma-core/chroma — the AI-native open-source embedding database
Winfredy/SadTalker — Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation, demo
sdatkinson/neural-amp-modeler — Neural network emulator for guitar amplifiers.
facebookresearch/Aria_data_tools — Aria data tools provide the open-source toolkit in C++ and Python to interact with data from Project Aria
bootphon/articulatory_inversion — Inversion-articulatoire is a Python library for training/testing neural network models for the acoustic to articulatory reconstruction.
@ARTICLE{9640504,
author={Shahrebabaki, Abdolreza Sabzi and Salvi, Giampiero and Svendsen, Torbjørn and Siniscalchi, Sabato Marco},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
title={Acoustic-to-Articulatory Mapping With Joint Optimization of Deep Speech Enhancement and Articulatory Inversion Models},
year={2022},
volume={30},
number={},
pages={135-147},
doi={10.1109/TASLP.2021.3133218}
}
AI is becoming powerful in 2023.
— Barsee 🐶 (@heyBarsee) February 25, 2023
But most people feel left out with million things happening around AI.
Here's a MEGA THREAD🧵 (with resources) to keep you up-to-date:
SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization, code
@misc{kim2022soda,
title={SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization},
author={Hyunwoo Kim and Jack Hessel and Liwei Jiang and Ximing Lu and Youngjae Yu and Pei Zhou and Ronan Le Bras and Malihe Alikhani and Gunhee Kim and Maarten Sap and Yejin Choi},
year={2022},
eprint={2212.10465},
primaryClass={cs.CL}
}
This is a baby GPT with two tokens 0/1 and context length of 3, viewing it as a finite state markov chain. It was trained on the sequence "111101111011110" for 50 iterations. The parameters and the architecture of the Transformer modifies the probabilities on the arrows.
— Andrej Karpathy (@karpathy) April 9, 2023
E.g. we… pic.twitter.com/vj10nZEXlH
Differentiable Finite State Machines
jerryjliu/llama_index — a project that provides a central interface to connect your LLM’s with external data
auspicious3000/SpeechSplit — Unsupervised Speech Decomposition Via Triple Information Bottleneck
bjelkenhed/whisper-large-sv, train
KBLab/rixvox Finding Speeches in the Riksdag’s Debates RixVox: A Swedish Speech Corpus with 5500 Hours of Speech from Parliamentary Debates, code
Ultra fast ControlNet with Diffusers
Jzuluaga/wav2vec2-xls-r-300m-en-atc-uwb-atcc-and-atcosim
USC-TIMIT: a database of multimodal speech production data
Haskins_IEEE_Rate_Comparison_DB
lucidrains/lion-pytorch — Lion, new optimizer discovered by Google Brain using genetic algorithms that is purportedly better than Adam(w), in Pytorch
lucidrains/denoising-diffusion-pytorch — Implementation of Denoising Diffusion Probabilistic Model in Pytorch
lucidrains/audiolm-pytorch — Implementation of AudioLM, a SOTA Language Modeling Approach to Audio Generation out of Google Research, in Pytorch
lucidrains/robotic-transformer-pytorch — Implementation of RT1 (Robotic Transformer) in Pytorch
lucidrains/recurrent-interface-network-pytorch — Implementation of Recurrent Interface Network (RIN), for highly efficient generation of images and video without cascading networks, in Pytorch
lucidrains/memory-efficient-attention-pytorch — Implementation of a memory efficient multi-head attention as proposed in the paper, “Self-attention Does Not Need O(n²) Memory”
lucidrains/PaLM-rlhf-pytorch — Implementation of RLHF (Reinforcement Learning with Human Feedback) on top of the PaLM architecture. Basically ChatGPT but with PaLM
hazyResearch/flash-attention — Fast and memory-efficient exact attention
lucidrains/make-a-video-pytorch — Implementation of Make-A-Video, new SOTA text to video generator from Meta AI, in Pytorch
lucidrains/rvq-vae-gpt — My attempts at applying Soundstream design on learned tokenization of text and then applying hierarchical attention to text generation
lucidrains/imagen-pytorch — Implementation of Imagen, Google’s Text-to-Image Neural Network, in Pytorch
crowsonkb/v-diffusion-jax — v objective diffusion inference code for JAX.
Cocktail HuBERT: Generalized Self-Supervised Pre-training for Mixture and Single-Source Speech
@misc{fazelzarandi2023cocktail,
title={Cocktail HuBERT: Generalized Self-Supervised Pre-training for Mixture and Single-Source Speech},
author={Maryam Fazel-Zarandi and Wei-Ning Hsu},
year={2023},
eprint={2303.11131},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
BlinkDL/RWKV-LM — RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it’s combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, “infinite” ctx_len, and free sentence embedding.
cohogain/whisper-large-v2-ga-IE, cohogain/whisper-medium-ga-IE-cv11-fleurs-livaud
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace, code
Efficient Audio Captioning Transformer with Patchout and Text Guidance
@misc{kouzelis2023efficient,
title={Efficient Audio Captioning Transformer with Patchout and Text Guidance},
author={Thodoris Kouzelis and Grigoris Bastas and Athanasios Katsamanis and Alexandros Potamianos},
year={2023},
eprint={2304.02916},
archivePrefix={arXiv},
primaryClass={cs.SD}
}
Continuous Speech Separation with Conformer
@misc{chen2020continuous,
title={Continuous Speech Separation with Conformer},
author={Sanyuan Chen and Yu Wu and Zhuo Chen and Jian Wu and Jinyu Li and Takuya Yoshioka and Chengyi Wang and Shujie Liu and Ming Zhou},
year={2020},
eprint={2008.05773},
archivePrefix={arXiv},
primaryClass={eess.AS}
}
From Undercomplete to Sparse Overcomplete Autoencoders to Improve LF-MMI based Speech Recognition
@inproceedings{handekabil22_interspeech,
author={Selen {Hande Kabil} and Herve Bourlard},
title={From Undercomplete to Sparse Overcomplete Autoencoders to Improve LF-MMI based Speech Recognition},
year=2022,
booktitle={Proc. Interspeech 2022},
pages={1061--1065},
doi={10.21437/Interspeech.2022-11390}
}
Dealing with Unknowns in Continual Learning for End-to-end Automatic Speech Recognition
@inproceedings{sustek22_interspeech,
author={Martin Sustek and Samik Sadhu and Hynek Hermansky},
title={Dealing with Unknowns in Continual Learning for End-to-end Automatic Speech Recognition},
year=2022,
booktitle={Proc. Interspeech 2022},
pages={1046--1050},
doi={10.21437/Interspeech.2022-11139}
}
A Temporal Extension of Latent Dirichlet Allocation for Unsupervised Acoustic Unit Discovery
@inproceedings{vandermerwe22_interspeech,
author={Werner {van der Merwe} and Herman Kamper and Johan {Adam du Preez}},
title={A Temporal Extension of Latent Dirichlet Allocation for Unsupervised Acoustic Unit Discovery},
year=2022,
booktitle={Proc. Interspeech 2022},
pages={1426--1430},
doi={10.21437/Interspeech.2022-11369}
}
@inproceedings{xie22b_interspeech,
author={Jiamin Xie and John H.L. Hansen},
title={DEFORMER: Coupling Deformed Localized Patterns with Global Context for Robust End-to-end Speech Recognition},
year=2022,
booktitle={Proc. Interspeech 2022},
pages={1392--1396},
doi={10.21437/Interspeech.2022-11172}
}
Knowledge of accent differences can be used to predict speech recognition
@inproceedings{szalay22_interspeech,
author={Tuende Szalay and Mostafa Shahin and Beena Ahmed and Kirrie Ballard},
title={Knowledge of accent differences can be used to predict speech recognition},
year=2022,
booktitle={Proc. Interspeech 2022},
pages={1372--1376},
doi={10.21437/Interspeech.2022-10162}
}
@inproceedings{rumberg22b_interspeech,
author={Lars Rumberg and Christopher Gebauer and Hanna Ehlert and Ulrike Lüdtke and Jörn Ostermann},
title={Improving Phonetic Transcriptions of Children’s Speech by Pronunciation Modelling with Constrained CTC-Decoding},
year=2022,
booktitle={Proc. Interspeech 2022},
pages={1357--1361},
doi={10.21437/Interspeech.2022-332}
}
Automatic Pronunciation Assessment using Self-Supervised Speech Representation Learning
@inproceedings{kim22k_interspeech,
author={Eesung Kim and Jae-Jin Jeon and Hyeji Seo and Hoon Kim},
title={Automatic Pronunciation Assessment using Self-Supervised Speech Representation Learning},
year=2022,
booktitle={Proc. Interspeech 2022},
pages={1411--1415},
doi={10.21437/Interspeech.2022-10245}
}
Probing phoneme, language and speaker information in unsupervised speech representations
@inproceedings{deseyssel22_interspeech,
author={Maureen {de Seyssel} and Marvin Lavechin and Yossi Adi and Emmanuel Dupoux and Guillaume Wisniewski},
title={Probing phoneme, language and speaker information in unsupervised speech representations},
year=2022,
booktitle={Proc. Interspeech 2022},
pages={1402--1406},
doi={10.21437/Interspeech.2022-373}
}
VQ-T: RNN Transducers using Vector-Quantized Prediction Network States
@inproceedings{shi22b_interspeech,
author={Jiatong Shi and George Saon and David Haws and Shinji Watanabe and Brian Kingsbury},
title={VQ-T: RNN Transducers using Vector-Quantized Prediction Network States},
year=2022,
booktitle={Proc. Interspeech 2022},
pages={1656--1660},
doi={10.21437/Interspeech.2022-414}
}
Source-Filter HiFi-GAN: Fast and Pitch Controllable High-Fidelity Neural Vocoder, chomeyama/SiFiGAN
@misc{yoneyama2023sourcefilter,
title={Source-Filter HiFi-GAN: Fast and Pitch Controllable High-Fidelity Neural Vocoder},
author={Reo Yoneyama and Yi-Chiao Wu and Tomoki Toda},
year={2023},
eprint={2210.15533},
archivePrefix={arXiv},
primaryClass={cs.SD}
}
Textless Speech-to-Music Retrieval Using Emotion Similarity
UL2 20B: An Open Source Unified Language Learner, checkpoints
DeBERTa: Decoding-enhanced BERT with Disentangled Attention, code
@misc{he2021deberta,
title={DeBERTa: Decoding-enhanced BERT with Disentangled Attention},
author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
year={2021},
eprint={2006.03654},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation
@misc{bengio2021flow,
title={Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation},
author={Emmanuel Bengio and Moksh Jain and Maksym Korablyov and Doina Precup and Yoshua Bengio},
year={2021},
eprint={2106.04399},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
thammegowda/016-many-eng-v2 — Many-English v2
Sequence Modeling With CTC, Example CTC Decoder in Python
Announcing OpenChatKit, togethercomputer/OpenChatKit, spaces
microsoft/computervision-recipes
An Ultrasound Investigation of Irish Palatalization
lohku, Modèle:se-décl-pari, Modèle:se-décl-contract, Buddhisma, 1700-lohku, okr, Modèle:se-décl-impari-sans-alt, ceahkki, lihtter, hávvi, Fága 6a ja 6á, Conjugaison:same_du_Nord/háddjet,