Interesting links, 04/01/2024

Dao-AILab/flash-attention — Fast and memory-efficient exact attention

facebookincubator/velox — A C++ vectorized database acceleration library aimed to optimizing query engines and data processing systems.

How 🤗 Accelerate runs very large models thanks to PyTorch

A comparative analysis of speech signal processing algorithms for Parkinson’s disease classification and the use of the tunable Q-factor wavelet transform

karkirowle/relative_phoneme_analysis — Repository for phoneme analysis on word-level Kaldi/ESPNet ASR transcripts

Irish Gaelic/seanchló print

prajdabre/yanmtt — Yet Another Neural Machine Translation Toolkit

google-research-datasets/TextNormalizationCoveringGrammars — Covering grammars for English and Russian text normalization

Language Model Inversion

WavJourney: Compositional Audio Creation with Large Language Models

Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition

It wasn't on my bingo card for 2024-W1, but MSFT dropped a decoder-only embedding model based on Mistral7B-instruct, trained on synthetic retrieval data (+ a bunch of train splits from datasets in BEIR & co...), claiming SotA on MTEB.

Here are a few things that caught my eye: https://t.co/ObRUkmDgwg
— dinos (@din0s_) January 2, 2024

VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation

thuhcsi/VAENAR-TTS — The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

Automatic Generation of Subtitles for Videos of the Government of La Rioja

The Properly Illustrated Transformer

The Illustrated Transformer

The Annotated Transformer

Efficient Sequence Transduction by Jointly Predicting Tokens and Durations

Instant3D: Instant Text-to-3D Generation

LRM: Large Reconstruction Model for Single Image to 3D

Can we model syntax from speech?

Most models of syntax are text-based.

Here we propose that basic syntax can be modeled from raw speech.

GANs trained on individual words start to concatenate them into multiple-word outputs.

Sometimes the model even concatenates three words: pic.twitter.com/rZXAhEulmN
— Gašper Beguš (@begusgasper) May 9, 2023

Basic syntax from speech: Spontaneous concatenation in unsupervised deep neural networks

‘Dair’ (Live @ Urban Assault 2018)

Train T5 Model From Scratch

lingjzhu/CharsiuG2P — Multilingual G2P in 100 languages

kNN-CTC: Enhancing ASR via Retrieval of CTC Pseudo Labels, “code”

A🧵on beating the hardware lottery for retrieval: the internals of the late interaction stack.

ColBERT introduced a quirky multi-vector retrieval architecture. It does wonders for quality.

But how can it search 100M docs in 0.1 sec on CPU? Or store 1 billion embeddings in 20GB? pic.twitter.com/Nc3MDFxrj6
— Omar Khattab (@lateinteraction) December 20, 2023

Speculative Decoding for 2x Faster Whisper Inference

SHI-Labs/VCoder — VCoder: Versatile Vision Encoders for Multimodal Large Language Models, arXiv 2023

ConvNets Match Vision Transformers at Scale

SD-HuBERT: Self-Distillation Induces Syllabic Organization in HuBERT

An Introduction to Transformers

MobileASR: A resource-aware on-device learning framework for user voice personalization applications on mobile phones

Writing a good paper intro is difficult. I mostly recommend a 4-paragraph intro:

1) Motivation: Task description / why is it important?
2) Challenge: Why is problem so difficult?
3) Trends: How does SotA approach it? What's missing?
4) Method: How do you solve it? Contributions!
— Matthias Niessner (@MattNiessner) November 15, 2023

Improving Large-scale Deep Biasing with Phoneme Features and Text-only Data in Streaming Transducer

Training Distil-Whisper

Want high-quality Audio embeddings? CLAP! 👏

We support the latest general, music and speech CLAP models in Transformers! Use it for Text-to-Speech/ Text-to-Music training and more.

What is CLAP?

CLAP (Contrastive Language-Audio Pretraining) is a neural network trained on… pic.twitter.com/iQNF6Um9yJ
— Vaibhav (VB) Srivastav (@reach_vb) November 20, 2023

Open Whisper-style Speech Model (OWSM) 🔉

OWSM reproduces Whisper training using an open-source toolkit (ESPNet) and publicly available datasets. OWSM is much more efficient in training and is robust at multi-directional translations.

Open source training, inference scripts and… pic.twitter.com/v9exxwevnO
— Vaibhav (VB) Srivastav (@reach_vb) November 21, 2023

What is Mixture-of-Experts (MoE)?

MoE is a neural network architecture design that integrates layers of experts/models within the Transformer block. As data flows through the MoE layers, each input token is dynamically routed to a subset of the experts for computation. This… pic.twitter.com/AnYeITgHVi
— Sophia Yang, Ph.D. (@sophiamyang) December 9, 2023

wellecks/ntptutorial — Tutorial on neural theorem proving

Since Mixture of Expert (MoE) LLMs are all the rage as of this weekend, thanks to the Mixtral-8x-7B release, here's a quick explainer. The figure below shows the architecture behind the Switch Transformer (https://t.co/g3Awj99h24), a great intro to MoEs.

The model depicted in… pic.twitter.com/2Wg5zjeFXU
— Sebastian Raschka (@rasbt) December 11, 2023

THE LITTLE BOOK OF DEEP LEARNING

Solar rises ☀️ @upstageai just released Solar a 10B an open LLM outperforming other LLMs up to 30B parameters, including Mistral 7B. 🤯 Solar achieves an MMLU score of 65.48, which is only 4 points lower than Meta Llama 2 while being 7x smaller.

TL;DR;

🦙 Llama 2 architecture… pic.twitter.com/tqgVExY8Yx
— Philipp Schmid (@_philschmid) December 13, 2023

fun idea I tested out this morning: Language model fine-tuning in embedding space

here's the idea: learn a model of *embeddings* of a certain text distribution; then, to generate text, sample embedding and map back to text with vec2text

this lets us generate language without… pic.twitter.com/9PPI9q5KiM
— jack morris (@jxmnop) December 13, 2023

open-mmlab/Amphion — Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.

yangdongchao/AcademiCodec

SpeechAct: Towards Generating Whole-body Motion from Speech

Fine-tuning Whisper for Dutch Language: The Crucial Role of Size

Introduction to Speech Processing

LibriSpeech Alignments

OML-Team/open-metric-learning — Library for metric learning pipelines and models.

haotian-liu/LLaVA — [NeurIPS’23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

Flip - Glass Animals

Simplifying Transformer Blocks

Advanced RAG Techniques: an Illustrated Overview

Nvidia presents Incremental FastPitch

Chunk-based High Quality Text to Speech

paper page: https://t.co/v1FxDzo7uM

Parallel text-to-speech models have been widely applied for real-time speech synthesis, and they offer more controllability and a much faster synthesis process… pic.twitter.com/pM4fnSdMAo
— AK (@_akhaliq) January 4, 2024

I created my YouTube series on Reinforcement Learning because I saw it applied profitably at Lyft. It was a counterexample to the stigma: "RL is only good for scenarios where a perfect simulator can be accessed endlessly. It's general-but-slow trial-and-error."

There's truth… pic.twitter.com/wowDxJaUWy
— DJ (@DuaneJRich) January 4, 2024

Parakeet RNNT & CTC models top the Open ASR Leaderboard! 👑

Brought to you by @NVIDIAAI and @suno_ai_, parakeet beats Whisper and regains its first place.

The models are released under a commercially permissive license! 🥳

The models inherit the same FastConformer… pic.twitter.com/jF96yecZ1t
— Vaibhav (VB) Srivastav (@reach_vb) January 2, 2024

nvidia/parakeet-rnnt-1.1b

The RAG wave is here to stay, but in practice, it's hard to retrieve the right docs w/ embdings, & better IR models are hard to use!

Let's fix that: Introducing 🪤RAGatouille, a lib to train&use SotA retrieval model, ColBERT, in just a few lines of code!https://t.co/VRHiGQl0Xv pic.twitter.com/0EpOfV6UWn
— Ben Clavié (@bclavie) January 4, 2024

bclavie/RAGatouille

colbert-ir/colbertv2.0

Progress on dense retrievers is saturating.

The best retrievers in 2024 will apply new forms of late interaction, i.e. scalable attention-like scoring for multi-vector embeddings.

A🧵on late interaction, how it works efficiently, and why/where it's been shown to improve quality pic.twitter.com/2XG33TtM9R
— Omar Khattab (@lateinteraction) December 18, 2023

Hallucinations in Neural Automatic Speech Recognition: Identifying Errors and Hallucinatory Models

LLM Augmented LLMs: Expanding Capabilities through Composition

Everyone building RAG uses dense embedding retrieval, but simply doing cosine distance doesn’t always capture fine-grained similarity.

That’s why SOTA retrieval like ColBERT models are so important; these new architectures are fast but more powerful than pure dense retrieval.… pic.twitter.com/W2RPBBxml4
— Jerry Liu (@jerryjliu0) January 5, 2024

Mathematical Introduction to Deep Learning: Methods, Implementations, and Theory

Is it possible to teach LLMs a different language? 🤔 Can we transfer the capabilities of LLMs, like Llama, from English to non-English language?

A group of researchers from Fudan University tried to answer those questions by running vast experiments on extending vocabulary… pic.twitter.com/fJLYFyQOqP
— Philipp Schmid (@_philschmid) January 4, 2024

This AI Paper from Meta Introduces Hyper-VolTran: A Novel Neural Network for Transformative 3D Reconstruction and Rendering, paper

Phi-2: The surprising power of small language models

From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations

MotionScript: Natural Language Descriptions for Expressive 3D Human Motions

pjyazdian/Gesture2Vec — This is an official PyTorch implementation of “Gesture2Vec: Clustering Gestures using Representation Learning Methods for Co-speech Gesture Generation” (IROS 2022).

neuromorphs/NIR — Neuromorphic Intermediate Representation reference implementation

Better Explained

PEFT for Speech: Unveiling Optimal Placement, Merging Strategies, and Ensemble Techniques

What You See is What You GAN: Rendering Every Pixel for High-Fidelity Geometry in 3D GANs

Love how RAGatouille makes it so easy to train new ColBERTs.

ColBERT's real power is you can train it with as little as a few hundred queries. Other dense retrievers need tens of thousands!

Maybe the test for @bclavie's library is whether we see an uptick in ColBERT downloads😆 https://t.co/TnTPT0smff pic.twitter.com/n4hnHQODqB
— Omar Khattab (@lateinteraction) January 4, 2024

100 tiny changes to transform your life: from the one-minute rule to pyjama yoga

This Paper from MIT and Microsoft Introduces ‘LASER’: A Novel Machine Learning Approach that can Simultaneously Enhance an LLM’s Task Performance and Reduce its Size with no Additional Training

The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction

Decoder-only Architecture for Speech Recognition with CTC Prompts and Text Data Augmentation

LiteLlama: Reduced-Scale Llama — We present an open-source reproduction of Meta AI’s LLaMa 2. However, with significantly reduced model sizes, LiteLlama-460M-1T has 460M parameters trained with 1T tokens.

Token 1.3: What is Retrieval-Augmented Generation (RAG)?

VikParuchuri/surya — Accurate line-level text detection and recognition (OCR) in any language

gchrupala/neurospoken — Neural models of spoken language - LOT Winter school 2024

I just found a great introduction to embedding.

The book is comprehensive yet short. Historical encoding tools, neural nets, and production - all covered.

Fantastic job by @vboykis. Thanks for making it free to read!

Looking forward to diving in.https://t.co/uFwaSjaysn pic.twitter.com/SKl2ExOJaw
— Christoph Molnar 🦋 christophmolnar.bsky.social (@ChristophMolnar) January 12, 2024

My AI Timelines Have Sped Up (Again)

NeMo - ASR with Transducers

Mixtral 8x7B is currently the best open-source LLM, surpassing GPT-3.5

Foundations of Vector Retrieval

GARField: Group Anything with Radiance Fields

AlphaGeometry: An Olympiad-level AI system for geometry, code

Less horror. Probably full of typo.

Source tex there:https://t.co/M1CPZs1kPl https://t.co/Spiy0JvC3f pic.twitter.com/9e4FdQol3b
— François Fleuret (@francoisfleuret) January 18, 2024

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

Tuning Language Models by Proxy