Interesting links, 04/01/2024
Misc. interesting things.
Dao-AILab/flash-attention — Fast and memory-efficient exact attention
facebookincubator/velox — A C++ vectorized database acceleration library aimed to optimizing query engines and data processing systems.
How 🤗 Accelerate runs very large models thanks to PyTorch
karkirowle/relative_phoneme_analysis — Repository for phoneme analysis on word-level Kaldi/ESPNet ASR transcripts
prajdabre/yanmtt — Yet Another Neural Machine Translation Toolkit
google-research-datasets/TextNormalizationCoveringGrammars — Covering grammars for English and Russian text normalization
WavJourney: Compositional Audio Creation with Large Language Models
Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition
It wasn't on my bingo card for 2024-W1, but MSFT dropped a decoder-only embedding model based on Mistral7B-instruct, trained on synthetic retrieval data (+ a bunch of train splits from datasets in BEIR & co...), claiming SotA on MTEB.
— dinos (@din0s_) January 2, 2024
Here are a few things that caught my eye: https://t.co/ObRUkmDgwg
VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
thuhcsi/VAENAR-TTS — The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.
Automatic Generation of Subtitles for Videos of the Government of La Rioja
The Properly Illustrated Transformer
Efficient Sequence Transduction by Jointly Predicting Tokens and Durations
Instant3D: Instant Text-to-3D Generation
LRM: Large Reconstruction Model for Single Image to 3D
Can we model syntax from speech?
— Gašper Beguš (@begusgasper) May 9, 2023
Most models of syntax are text-based.
Here we propose that basic syntax can be modeled from raw speech.
GANs trained on individual words start to concatenate them into multiple-word outputs.
Sometimes the model even concatenates three words: pic.twitter.com/rZXAhEulmN
Basic syntax from speech: Spontaneous concatenation in unsupervised deep neural networks
‘Dair’ (Live @ Urban Assault 2018)
lingjzhu/CharsiuG2P — Multilingual G2P in 100 languages
kNN-CTC: Enhancing ASR via Retrieval of CTC Pseudo Labels, “code”
A🧵on beating the hardware lottery for retrieval: the internals of the late interaction stack.
— Omar Khattab (@lateinteraction) December 20, 2023
ColBERT introduced a quirky multi-vector retrieval architecture. It does wonders for quality.
But how can it search 100M docs in 0.1 sec on CPU? Or store 1 billion embeddings in 20GB? pic.twitter.com/Nc3MDFxrj6
Speculative Decoding for 2x Faster Whisper Inference
SHI-Labs/VCoder — VCoder: Versatile Vision Encoders for Multimodal Large Language Models, arXiv 2023
ConvNets Match Vision Transformers at Scale
SD-HuBERT: Self-Distillation Induces Syllabic Organization in HuBERT
An Introduction to Transformers
Writing a good paper intro is difficult. I mostly recommend a 4-paragraph intro:
— Matthias Niessner (@MattNiessner) November 15, 2023
1) Motivation: Task description / why is it important?
2) Challenge: Why is problem so difficult?
3) Trends: How does SotA approach it? What's missing?
4) Method: How do you solve it? Contributions!
Improving Large-scale Deep Biasing with Phoneme Features and Text-only Data in Streaming Transducer
Want high-quality Audio embeddings? CLAP! 👏
— Vaibhav (VB) Srivastav (@reach_vb) November 20, 2023
We support the latest general, music and speech CLAP models in Transformers! Use it for Text-to-Speech/ Text-to-Music training and more.
What is CLAP?
CLAP (Contrastive Language-Audio Pretraining) is a neural network trained on… pic.twitter.com/iQNF6Um9yJ
Open Whisper-style Speech Model (OWSM) 🔉
— Vaibhav (VB) Srivastav (@reach_vb) November 21, 2023
OWSM reproduces Whisper training using an open-source toolkit (ESPNet) and publicly available datasets. OWSM is much more efficient in training and is robust at multi-directional translations.
Open source training, inference scripts and… pic.twitter.com/v9exxwevnO
What is Mixture-of-Experts (MoE)?
— Sophia Yang, Ph.D. (@sophiamyang) December 9, 2023
MoE is a neural network architecture design that integrates layers of experts/models within the Transformer block. As data flows through the MoE layers, each input token is dynamically routed to a subset of the experts for computation. This… pic.twitter.com/AnYeITgHVi
wellecks/ntptutorial — Tutorial on neural theorem proving
Since Mixture of Expert (MoE) LLMs are all the rage as of this weekend, thanks to the Mixtral-8x-7B release, here's a quick explainer. The figure below shows the architecture behind the Switch Transformer (https://t.co/g3Awj99h24), a great intro to MoEs.
— Sebastian Raschka (@rasbt) December 11, 2023
The model depicted in… pic.twitter.com/2Wg5zjeFXU
THE LITTLE BOOK OF DEEP LEARNING
Solar rises ☀️ @upstageai just released Solar a 10B an open LLM outperforming other LLMs up to 30B parameters, including Mistral 7B. 🤯 Solar achieves an MMLU score of 65.48, which is only 4 points lower than Meta Llama 2 while being 7x smaller.
— Philipp Schmid (@_philschmid) December 13, 2023
TL;DR;
🦙 Llama 2 architecture… pic.twitter.com/tqgVExY8Yx
fun idea I tested out this morning: Language model fine-tuning in embedding space
— jack morris (@jxmnop) December 13, 2023
here's the idea: learn a model of *embeddings* of a certain text distribution; then, to generate text, sample embedding and map back to text with vec2text
this lets us generate language without… pic.twitter.com/9PPI9q5KiM
open-mmlab/Amphion — Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.
SpeechAct: Towards Generating Whole-body Motion from Speech
Fine-tuning Whisper for Dutch Language: The Crucial Role of Size
Introduction to Speech Processing
OML-Team/open-metric-learning — Library for metric learning pipelines and models.
haotian-liu/LLaVA — [NeurIPS’23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
Simplifying Transformer Blocks
Advanced RAG Techniques: an Illustrated Overview
Nvidia presents Incremental FastPitch
— AK (@_akhaliq) January 4, 2024
Chunk-based High Quality Text to Speech
paper page: https://t.co/v1FxDzo7uM
Parallel text-to-speech models have been widely applied for real-time speech synthesis, and they offer more controllability and a much faster synthesis process… pic.twitter.com/pM4fnSdMAo
I created my YouTube series on Reinforcement Learning because I saw it applied profitably at Lyft. It was a counterexample to the stigma: "RL is only good for scenarios where a perfect simulator can be accessed endlessly. It's general-but-slow trial-and-error."
— DJ (@DuaneJRich) January 4, 2024
There's truth… pic.twitter.com/wowDxJaUWy
Parakeet RNNT & CTC models top the Open ASR Leaderboard! 👑
— Vaibhav (VB) Srivastav (@reach_vb) January 2, 2024
Brought to you by @NVIDIAAI and @suno_ai_, parakeet beats Whisper and regains its first place.
The models are released under a commercially permissive license! 🥳
The models inherit the same FastConformer… pic.twitter.com/jF96yecZ1t
The RAG wave is here to stay, but in practice, it's hard to retrieve the right docs w/ embdings, & better IR models are hard to use!
— Ben Clavié (@bclavie) January 4, 2024
Let's fix that: Introducing 🪤RAGatouille, a lib to train&use SotA retrieval model, ColBERT, in just a few lines of code!https://t.co/VRHiGQl0Xv pic.twitter.com/0EpOfV6UWn
Progress on dense retrievers is saturating.
— Omar Khattab (@lateinteraction) December 18, 2023
The best retrievers in 2024 will apply new forms of late interaction, i.e. scalable attention-like scoring for multi-vector embeddings.
A🧵on late interaction, how it works efficiently, and why/where it's been shown to improve quality pic.twitter.com/2XG33TtM9R
Hallucinations in Neural Automatic Speech Recognition: Identifying Errors and Hallucinatory Models
LLM Augmented LLMs: Expanding Capabilities through Composition
Everyone building RAG uses dense embedding retrieval, but simply doing cosine distance doesn’t always capture fine-grained similarity.
— Jerry Liu (@jerryjliu0) January 5, 2024
That’s why SOTA retrieval like ColBERT models are so important; these new architectures are fast but more powerful than pure dense retrieval.… pic.twitter.com/W2RPBBxml4
Mathematical Introduction to Deep Learning: Methods, Implementations, and Theory
Is it possible to teach LLMs a different language? 🤔 Can we transfer the capabilities of LLMs, like Llama, from English to non-English language?
— Philipp Schmid (@_philschmid) January 4, 2024
A group of researchers from Fudan University tried to answer those questions by running vast experiments on extending vocabulary… pic.twitter.com/fJLYFyQOqP
This AI Paper from Meta Introduces Hyper-VolTran: A Novel Neural Network for Transformative 3D Reconstruction and Rendering, paper
Phi-2: The surprising power of small language models
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
MotionScript: Natural Language Descriptions for Expressive 3D Human Motions
pjyazdian/Gesture2Vec — This is an official PyTorch implementation of “Gesture2Vec: Clustering Gestures using Representation Learning Methods for Co-speech Gesture Generation” (IROS 2022).
neuromorphs/NIR — Neuromorphic Intermediate Representation reference implementation
PEFT for Speech: Unveiling Optimal Placement, Merging Strategies, and Ensemble Techniques
What You See is What You GAN: Rendering Every Pixel for High-Fidelity Geometry in 3D GANs
Love how RAGatouille makes it so easy to train new ColBERTs.
— Omar Khattab (@lateinteraction) January 4, 2024
ColBERT's real power is you can train it with as little as a few hundred queries. Other dense retrievers need tens of thousands!
Maybe the test for @bclavie's library is whether we see an uptick in ColBERT downloads😆 https://t.co/TnTPT0smff pic.twitter.com/n4hnHQODqB
100 tiny changes to transform your life: from the one-minute rule to pyjama yoga
The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction
Decoder-only Architecture for Speech Recognition with CTC Prompts and Text Data Augmentation
LiteLlama: Reduced-Scale Llama — We present an open-source reproduction of Meta AI’s LLaMa 2. However, with significantly reduced model sizes, LiteLlama-460M-1T has 460M parameters trained with 1T tokens.
Token 1.3: What is Retrieval-Augmented Generation (RAG)?
VikParuchuri/surya — Accurate line-level text detection and recognition (OCR) in any language
gchrupala/neurospoken — Neural models of spoken language - LOT Winter school 2024
I just found a great introduction to embedding.
— Christoph Molnar 🦋 christophmolnar.bsky.social (@ChristophMolnar) January 12, 2024
The book is comprehensive yet short. Historical encoding tools, neural nets, and production - all covered.
Fantastic job by @vboykis. Thanks for making it free to read!
Looking forward to diving in.https://t.co/uFwaSjaysn pic.twitter.com/SKl2ExOJaw
My AI Timelines Have Sped Up (Again)
Mixtral 8x7B is currently the best open-source LLM, surpassing GPT-3.5
Foundations of Vector Retrieval
GARField: Group Anything with Radiance Fields
AlphaGeometry: An Olympiad-level AI system for geometry, code
Less horror. Probably full of typo.
— François Fleuret (@francoisfleuret) January 18, 2024
Source tex there:https://t.co/M1CPZs1kPl https://t.co/Spiy0JvC3f pic.twitter.com/9e4FdQol3b
SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding