Interesting links, 29/06/2024

here is my meticulously curated (and highly biased) summer paper reading list 📚:

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020)
╰╴ https://t.co/UmRVv3YVua

LoRA: Low-Rank Adaptation of Large Language Models (2021)
╰╴ https://t.co/A7VHVnjMPt

Ring… https://t.co/WWgcbVK601
— dr. jack morris (@jxmnop) June 10, 2024

sherpa-onnx - audio-tagging-from-a-file

A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech Enhancement

Evaluating Gammatone Frequency Cepstral Coefficients with Neural Networks for Emotion Recognition from Speech

Foundation Transformers

This book is a MUST read if you’re working in the field of maths.
Link below👇 pic.twitter.com/H8S4HMvvzB
— ₕₐₘₚₜₒₙ (@hamptonism) June 19, 2024

llm.c by Hand✍️

C programming + matrix multiplication by hand

This combination is perhaps as low as we can get to explain how the Transformer works.

Special thanks to @karpathy for encouraging early feedback and @7etsuo for helping me understand the pragma magic.

I hope… pic.twitter.com/jx1Ye0r0ei
— Tom Yeh (@ProfTomYeh) June 4, 2024

These 94 lines of code are everything that is needed to train a neural network. Everything else is just efficiency.

This is my earlier project Micrograd. It implements a scalar-valued auto-grad engine. You start with some numbers at the leafs (usually the input data and the… pic.twitter.com/2zVJP3cNJ0
— Andrej Karpathy (@karpathy) June 21, 2024

Everything You Always Wanted To Know About Mathematics But didn’t even know to ask

In the last weeks, multiple works on diffusion-based language models were released. You might be wondering if you should consider them for your NLP tasks. In our latest preprint, we argue that text-based diffusion models have several properties that deserve your attention. 🧵1/8 pic.twitter.com/RnrfVyV0Dh
— Justin Deschenaux (@jdeschena) June 18, 2024

Flow Matching is SOOOO simple

GG denoising diffusion? pic.twitter.com/fdArTRk9Z1
— Cristian Garcia (@cgarciae88) June 18, 2024

Conditional flow matching

How Much Context Does My Attention-Based ASR System Need?, code

SkalskiP/top-cvpr-2024-papers

Evaluating Gesture Generation in a Large-scale Open Challenge: The GENEA Challenge 2022

Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation, code

Understanding FAISS

An Empirical Study of Mamba-based Language Models, model, code

This talk by @bclavie is the highest value per second talk I have ever watched on RAG

Chapter summaries and additional links in next tweet pic.twitter.com/5uzmSbU6pa
— Hamel Husain (@HamelHusain) June 15, 2024

pkufool/librilight-text

Open CLIP - SigLipLoss

Very interesting Paper - "Mixture-of-Agents (MoA) Enhances Large Language Model Capabilities": - MoA using only open-source LLMs is the leader of AlpacaEval 2.0 by a substantial gap, achieving a score of 65.1% compared to 57.5% by GPT-4 Omni. 🔥

📌 The paper introduces the… pic.twitter.com/P09kddjZMt
— Rohan Paul (@rohanpaul_ai) June 10, 2024

Faiss: A library for efficient similarity search

Faiss - Brute force search without an index

Robust solutions for audio fingerprinting

Neural Audio Fingerprint for High-Specific Audio Retrieval Based on Contrastive Learning, code

Audio Fingerprinting with Holographic Reduced Representations

MahmudulAlam/Holographic-Reduced-Representations

Learning with Holographic Reduced Representations, code

so this is nuts, if you're cool with the high frequncy details of an image being reinterpreted/stochastic, you can encode an image quite faithfully into 32 tokens...
with a codebook size of 1024 as they use this is just 320bits, new upper bound for the information in an image… pic.twitter.com/DSZcmlWQf0
— Ethan (@torchcompiled) June 14, 2024

spotify/basic-pitch — A lightweight yet powerful audio-to-MIDI converter with pitch bend detection

Step-by-Step Diffusion: An Elementary Tutorial

Fourier Diffusion Models: A Method to Control MTF and NPS in Score-Based Stochastic Image Generation

Time Series Diffusion in the Frequency Domain, code

Data Augmentation in Time and Doppler Frequency Domain for Radar-based Gesture Recognition

Frequency Domain Audio Synthesis – With IFFT and Oscillators

Trajectories and revolutions in popular melody based on U.S. charts from 1950 to 2023

Speech Recognition and Multi-Speaker Diarization of Long Conversations, data

Vision Language Models Explained

Model	Actually open
!————————	——————-
LLaVA 1.6 (Hermes 34B)
deepseek-vl-7b-base	❌
DeepSeek-VL-Chat	❌
moondream2
CogVLM-base	(❌)[https://github.com/THUDM/CogVLM/raw/main/MODEL_LICENSE]
CogVLM-Chat	(❌)[https://github.com/THUDM/CogVLM/raw/main/MODEL_LICENSE]
Fuyu-8B	❌
KOSMOS-2	✅
Qwen-VL	❌
Qwen-VL-Chat	❌
Yi-VL-34B	✅

nmslib/nmslib — Non-Metric Space Library (NMSLIB): An efficient similarity search library and a toolkit for evaluation of k-NN methods for generic non-metric spaces.

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers, code

Large Language Models are Efficient Learners of Noise-Robust Speech Recognition, code

It’s Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition, code

BAT: Learning to Reason about Spatial Sounds with Large Language Models

SpiRit-LM: Interleaved Spoken and Written Language Model

WavLLM: Towards Robust and Adaptive Speech Large Language Model, code

An Embarrassingly Simple Approach for LLM with Strong ASR Capacity, code

Speech Trident - Awesome Speech LM

SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models

Creating a Pipeline for Generating Synthetic Data for Fine-Tuning Custom Embedding Models. 👀

Step 1 Create a Knowledge Base: Start with preparing your domain specific knowledge base, such as PDFs or other documents containing information. Convert the content of these documents… pic.twitter.com/0mYDJKMylY
— Philipp Schmid (@_philschmid) June 5, 2024

Investigating the Effects of Large-Scale Pseudo-Stereo Data and Different Speech Foundation Model on Dialogue Generative Spoken Language Model

There was a 'Not Found' error fetching URL: 'https://x.com/tradingMaxiSL/status/1809857803177324992'

Memory3 : Language Modeling with Explicit Memory

These ties should be noted somewhere for everyone's benefit ✌️ pic.twitter.com/jh5G9QoKoG
— Learn Something (@cooltechtipz) July 7, 2024

Information Theory: A Tutorial Introduction

Data curation via joint example selection further accelerates multimodal learning

Depth Anything V2, code, demo, coreml, model

Alice’s Adventures in a differentiable wonderland

supabase/supabase — The open source Firebase alternative. Supabase gives you a dedicated Postgres database to build your web, mobile, and AI applications.

Richard Feynman's Lectures on Physics are timeless: their main strength is in demonstrating how to reason about physics. You may not know all the lectures are completely online:

Volume 1: https://t.co/yDpyRVjdVz
Volume 2: https://t.co/oEctaDi5Sv
Volume 3: https://t.co/eXS03nuH5c pic.twitter.com/SsNOerIzoq
— Massimo (@Rainmaker1973) July 9, 2024

HazyResearch/flash-fft-conv — FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores

Less is More: Accurate Speech Recognition & Translation without Web-Scale Data, code in NeMo, model closed but available.

23606 Workshop on Human Motion Generation, key moment here

microsoft/graphrag

leaningtech/webvm — Virtual Machine for the Web

Cannot believe this finally happened! Over the last 1.5 years, we have been developing a new LLM architecture, with linear complexity and expressive hidden states, for long-context modeling. The following plots show our model trained from Books scale better (from 125M to 1.3B)… pic.twitter.com/Ku0oi8vqvX
— Xiaolong Wang (@xiaolonw) July 8, 2024

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

After going through 100s of AI papers in the past couple of weeks, I am noticing the deeper integration of ideas (e.g., Mixture of Million Experts and Internet of Agents) and the utility of simple yet very effective methods (e.g., RouteLLM and RankRAG).

If you are looking for… pic.twitter.com/hTVafuLbxQ
— elvis (@omarsar0) July 13, 2024

Audio Spotforming Using Nonnegative Tensor Factorization with Attractor-Based Regularization

Beyond Euclid: An Illustrated Guide to Modern Machine Learning with Geometric, Topological, and Algebraic Structures

Understanding Transformers via N-Gram Statistics

facebookincubator/submitit — Python 3.8+ toolbox for submitting jobs to Slurm

Video Diffusion Alignment via Reward Gradients, model

Deep Dive into LSTMs and xLSTMs by Hand

Chronos: Learning the Language of Time Series, code

lm-sys/FastChat

EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning, code

Introducing Triton: Open-source GPU programming for neural networks

CUDA kernels in PyTorch made easy with Numba, notebook

a-brassard/ACORN — Home repository for the ACORN dataset: 3,500 explanations with aspect-wise human ratings of their quality.

Let’s reproduce GPT-2

facebookresearch/data2vec_vision

Fine-tuning Florence-2 - Microsoft’s Cutting-edge Vision Language Models

ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data, code

ColPali: Efficient Document Retrieval with Vision Language Models, code

Block Transformer: Global-to-Local Language Modeling for Fast Inference, code

AND: Audio Network Dissection for Interpreting Deep Acoustic Models

Sound Field Synthesis with Acoustic Waves

SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for Efficient Audio Synthesis and Beyond

Mapping the Past: Geographically Linking an Early 20th Century Swedish Encyclopedia with Wikidata

A multi-speaker multi-lingual voice cloning system based on vits2 for limmits 2024 challenge

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs, code, model

AI by Hand:

1. Dot Product - AI by Hand✍️Workbook Series

I share original hand calculation exercises like this, with 36K followers on LinkedIn.

I just started to share on X.

If you find this post helpful,
[Follow] me for more! 🙌 pic.twitter.com/rbqWVZCmlP
— Tom Yeh (@ProfTomYeh) May 23, 2024

3. Linear Layer - AI by Hand✍️Workbook Series

I share original hand calculation exercises like this, with 36K followers on LinkedIn.

I just started sharing on X.

If you find this workbook helpful, [Follow] me for more! pic.twitter.com/X24n6PmydJ
— Tom Yeh (@ProfTomYeh) May 25, 2024

4. Activation - AI by Hand✍️Workbook Series

I share original hand calculation exercises like this, with 36K followers on LinkedIn. I just started sharing on X.

If you find this workbook helpful,
[Follow] me for more! pic.twitter.com/iqzGP1uSUj
— Tom Yeh (@ProfTomYeh) May 26, 2024

5. Artificial Neuron - AI by Hand✍️Workbook Series

Previous Workbooks:
4. Activation: https://t.co/8btQ2n5AAf
3. Linear Layer: https://t.co/V571mpwnTq
2. Matrix Multiplication: https://t.co/EqfK6AEutb
1. Dot Product: https://t.co/ou9GFdTV1f

I share original hand… https://t.co/dO0Ff5x4kQ pic.twitter.com/3vlR4EUoxZ
— Tom Yeh (@ProfTomYeh) May 27, 2024

Vector Database by Hand ✍️

Vector databases are revolutionizing how we search and analyze complex data. They have become the backbone of Retrieval Augmented Generation (#RAG).

How do vector databases work?

[1] Given
↳ A dataset of three sentences, each has 3 words (or tokens)… pic.twitter.com/IIJwqnVjaK
— Tom Yeh (@ProfTomYeh) May 27, 2024

GAN by Hand ✍️

Goal: Generate realistic 4-D data from 2-D noise.

[1] Given
↳ 4 noise vectors in 2D (N)
↳ 4 real data vectors in 4D (X)

[2] 🟩 Generator: First Layer
↳ Multiply the noise vectors with weights and biases to obtain new feature vectors

[3] 🟩 Generator: ReLU
↳… pic.twitter.com/7ECTTqzOJL
— Tom Yeh (@ProfTomYeh) May 24, 2024

[Transformer] by Hand✍️📺
5-minute Video Tutorial

Anna Rahn made this short video to explain the Transformer exercise for my Computer Vision course last spring.

In 5 minutes, she demonstrates the key calculations of the Transformer by hand with pen and paper!

Anna is a… pic.twitter.com/NAkESZKaQH
— Tom Yeh (@ProfTomYeh) July 10, 2024

SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation, code, model

Zielon/PBRVulkan — Vulkan Real-time Path Tracer Engine

UCxn: Typologically Informed Annotation of Constructions Atop Universal Dependencies

Generative AI Handbook: A Roadmap for Learning Resources

Crossmodal ASR Error Correction with Discrete Speech Units

CMU-MOSI Dataset — The Multimodal Corpus of Sentiment Intensity (CMU-MOSI) dataset is a collection of 2199 opinion video clips.

The essence of calculus

Imitation and Mechanisms of Joint Attention: A Developmental Structure for Building Social Skills on a Humanoid Robot

Polish Public Domain Works

USER-LLM: Efficient LLM contextualization with user embeddings, arXiv

Perceiver: General Perception with Iterative Attention

Reverse the auditory processing pathway: Coarse-to-fine audio reconstruction from fMRI

Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis, code, vocos-mel-24khz, vocos-encodec-24khz

Improving Speech Decoding from ECoG with Self-Supervised Pretraining

BLSP-KD: Bootstrapping Language-Speech Pre-training via Knowledge Distillation

TF-SepNet: An Efficient 1D Kernel Design in CNNs for Low-Complexity Acoustic Scene Classification

Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion, code

Adapting Frechet Audio Distance for Generative Music Evaluation, arXiv, code

TRAMBA: A Hybrid Transformer and Mamba Architecture for Practical Audio and Bone Conduction Speech Super Resolution and Enhancement on Mobile and Wearable Platforms

CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations

Spiketrum: An FPGA-based Implementation of a Neuromorphic Cochlea

Contextual Position Encoding: Learning to Count What’s Important

The Raven: Hungarian

IN LOVE WITH THE CZARINA

Simplified Grammar of the Hungarian Language

In Love With the Czarina, and Other Stories, A VAKMERŐ

The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry

netease-youdao/EmotiVoice — a Multi-Voice and Prompt-Controlled TTS Engine

Voice Cloning with your personal data

ricosjp/truck — Truck is a rust CAD kernel

Hungarian Body Parts Flashcards

Hungarian Flashcards

OpenVoice: Versatile Instant Voice Cloning, code

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

A Complete Guide to Write your own Transformers

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models, FACodec model, code

Decades-Old Beer Ads Stitched Straight Into Original Star Wars Movies Go Viral

Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR

I Felt a Little Homosexual Today, So I Called in Sick: The Formation of “Reverse Discourse” by Swedish Gay Activists in the 1970s

SeamlessExpressiveLM: Speech Language Model for Expressive Speech-to-Speech Translation with Chain-of-Thought

UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge

Robots Beyond Borders: The Role of Social Robots in Spoken Second Language Practice

Once “too scary” to release, GPT-2 gets squeezed into an Excel spreadsheet

ianand/spreadsheets-are-all-you-need

The best CUDA intro course by @nvidia with 460 bite sized videos. It was the course released with Udacity 9 yrs ago.

It is kinda old, but you can grasp core ideas around it.https://t.co/OcRqwJ6phf
— chansung (@algo_diver) April 29, 2024

How do you teach a Large Language Model to understand images?

This paper proposes a technique called Visual Instruction Tuning that is now used by many of the language vision models we see in the field such as GPT4-Vision and Gemini etc.

In Short:
The paper introduces a method… pic.twitter.com/HJ6iRtnD3m
— Zain (@ZainHasan6) April 28, 2024

Open sourcing MS-DOS 4.0

microsoft/MS-DOS

sarah-walker-pcem/pcem/ — PC emulator

Infinite Mac — Infinite Mac is a collection of classic Macintosh and NeXT system releases and software, all easily accessible from the comfort of a (modern) web browser.

previous - NeXT emulator

dingusdev/dingusppc — PowerPC Mac emulator

Basilisk II, github

autc04/executor — A modern fork of the classic Mac emulator

Can Learned Optimization Make Reinforcement Learning Less Difficult?, code

google/learned_optimization — Meta-learning optimizers and more with JAX

Click-Gaussian: Interactive Segmentation to Any 3D Gaussians

Mistral NeMo

E2GAN: Efficient Training of Efficient GANs for Image-to-Image Translation

nerfstudio-project/nerfstudio — A collaboration friendly studio for NeRFs

Context Embeddings for Efficient Answer Generation in RAG

SpreadsheetLLM: Encoding Spreadsheets for Large Language Models