Interesting links, 29/06/2024
Misc. interesting things.
Analyzing Open AI’s Whisper ASR Accuracy: Word Error Rates Across Languages and Model Sizes
swerik-project/riksdagen-records
here is my meticulously curated (and highly biased) summer paper reading list 📚:
— jack morris (@jxmnop) June 10, 2024
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020)
╰╴ https://t.co/UmRVv3YVua
LoRA: Low-Rank Adaptation of Large Language Models (2021)
╰╴ https://t.co/A7VHVnjMPt
Ring… https://t.co/WWgcbVK601
sherpa-onnx - audio-tagging-from-a-file
A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech Enhancement
This book is a MUST read if you’re working in the field of maths.
— ₕₐₘₚₜₒₙ — e/acc (@hamptonism) June 19, 2024
Link below👇 pic.twitter.com/H8S4HMvvzB
llm.c by Hand✍️
— Tom Yeh (@ProfTomYeh) June 4, 2024
C programming + matrix multiplication by hand
This combination is perhaps as low as we can get to explain how the Transformer works.
Special thanks to @karpathy for encouraging early feedback and @7etsuo for helping me understand the pragma magic.
I hope… pic.twitter.com/jx1Ye0r0ei
These 94 lines of code are everything that is needed to train a neural network. Everything else is just efficiency.
— Andrej Karpathy (@karpathy) June 21, 2024
This is my earlier project Micrograd. It implements a scalar-valued auto-grad engine. You start with some numbers at the leafs (usually the input data and the… pic.twitter.com/2zVJP3cNJ0
Everything You Always Wanted To Know About Mathematics But didn’t even know to ask
In the last weeks, multiple works on diffusion-based language models were released. You might be wondering if you should consider them for your NLP tasks. In our latest preprint, we argue that text-based diffusion models have several properties that deserve your attention. 🧵1/8 pic.twitter.com/RnrfVyV0Dh
— Justin Deschenaux (@jdeschena) June 18, 2024
Flow Matching is SOOOO simple
— Cristian Garcia (@cgarciae88) June 18, 2024
GG denoising diffusion? pic.twitter.com/fdArTRk9Z1
How Much Context Does My Attention-Based ASR System Need?, code
Evaluating Gesture Generation in a Large-scale Open Challenge: The GENEA Challenge 2022
Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation, code
An Empirical Study of Mamba-based Language Models, model, code
This talk by @bclavie is the highest value per second talk I have ever watched on RAG
— Hamel Husain (@HamelHusain) June 15, 2024
Chapter summaries and additional links in next tweet pic.twitter.com/5uzmSbU6pa
Very interesting Paper - "Mixture-of-Agents (MoA) Enhances Large Language Model Capabilities": - MoA using only open-source LLMs is the leader of AlpacaEval 2.0 by a substantial gap, achieving a score of 65.1% compared to 57.5% by GPT-4 Omni. 🔥
— Rohan Paul (@rohanpaul_ai) June 10, 2024
📌 The paper introduces the… pic.twitter.com/P09kddjZMt
Faiss: A library for efficient similarity search
Faiss - Brute force search without an index
Robust solutions for audio fingerprinting
Neural Audio Fingerprint for High-Specific Audio Retrieval Based on Contrastive Learning, code
Audio Fingerprinting with Holographic Reduced Representations
MahmudulAlam/Holographic-Reduced-Representations
Learning with Holographic Reduced Representations, code
so this is nuts, if you're cool with the high frequncy details of an image being reinterpreted/stochastic, you can encode an image quite faithfully into 32 tokens...
— Ethan is in Sydney (@torchcompiled) June 14, 2024
with a codebook size of 1024 as they use this is just 320bits, new upper bound for the information in an image… pic.twitter.com/DSZcmlWQf0
spotify/basic-pitch — A lightweight yet powerful audio-to-MIDI converter with pitch bend detection
Step-by-Step Diffusion: An Elementary Tutorial
Fourier Diffusion Models: A Method to Control MTF and NPS in Score-Based Stochastic Image Generation
Time Series Diffusion in the Frequency Domain, code
Data Augmentation in Time and Doppler Frequency Domain for Radar-based Gesture Recognition
Frequency Domain Audio Synthesis – With IFFT and Oscillators
Trajectories and revolutions in popular melody based on U.S. charts from 1950 to 2023
Speech Recognition and Multi-Speaker Diarization of Long Conversations, data
Vision Language Models Explained
Model | Actually open |
!———————— | ——————- |
LLaVA 1.6 (Hermes 34B) | |
deepseek-vl-7b-base | ❌ |
DeepSeek-VL-Chat | ❌ |
moondream2 | |
CogVLM-base | (❌)[https://github.com/THUDM/CogVLM/raw/main/MODEL_LICENSE] |
CogVLM-Chat | (❌)[https://github.com/THUDM/CogVLM/raw/main/MODEL_LICENSE] |
Fuyu-8B | ❌ |
KOSMOS-2 | ✅ |
Qwen-VL | ❌ |
Qwen-VL-Chat | ❌ |
Yi-VL-34B | ✅ |
nmslib/nmslib — Non-Metric Space Library (NMSLIB): An efficient similarity search library and a toolkit for evaluation of k-NN methods for generic non-metric spaces.
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers, code
Large Language Models are Efficient Learners of Noise-Robust Speech Recognition, code
It’s Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition, code
BAT: Learning to Reason about Spatial Sounds with Large Language Models
SpiRit-LM: Interleaved Spoken and Written Language Model
WavLLM: Towards Robust and Adaptive Speech Large Language Model, code
An Embarrassingly Simple Approach for LLM with Strong ASR Capacity, code
Speech Trident - Awesome Speech LM
Creating a Pipeline for Generating Synthetic Data for Fine-Tuning Custom Embedding Models. 👀
— Philipp Schmid (@_philschmid) June 5, 2024
Step 1 Create a Knowledge Base: Start with preparing your domain specific knowledge base, such as PDFs or other documents containing information. Convert the content of these documents… pic.twitter.com/0mYDJKMylY
There was a 'Not Found' error fetching URL: 'https://x.com/tradingMaxiSL/status/1809857803177324992'
Memory3 : Language Modeling with Explicit Memory
These ties should be noted somewhere for everyone's benefit ✌️ pic.twitter.com/jh5G9QoKoG
— Learn Something (@cooltechtipz) July 7, 2024
Information Theory: A Tutorial Introduction
Data curation via joint example selection further accelerates multimodal learning
Depth Anything V2, code, demo, coreml, model
Alice’s Adventures in a differentiable wonderland
supabase/supabase — The open source Firebase alternative. Supabase gives you a dedicated Postgres database to build your web, mobile, and AI applications.
Richard Feynman's Lectures on Physics are timeless: their main strength is in demonstrating how to reason about physics. You may not know all the lectures are completely online:
— Massimo (@Rainmaker1973) July 9, 2024
Volume 1: https://t.co/yDpyRVjdVz
Volume 2: https://t.co/oEctaDi5Sv
Volume 3: https://t.co/eXS03nuH5c pic.twitter.com/SsNOerIzoq
HazyResearch/flash-fft-conv — FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores
Less is More: Accurate Speech Recognition & Translation without Web-Scale Data, code in NeMo, model closed but available.
23606 Workshop on Human Motion Generation, key moment here
leaningtech/webvm — Virtual Machine for the Web
Cannot believe this finally happened! Over the last 1.5 years, we have been developing a new LLM architecture, with linear complexity and expressive hidden states, for long-context modeling. The following plots show our model trained from Books scale better (from 125M to 1.3B)… pic.twitter.com/Ku0oi8vqvX
— Xiaolong Wang (@xiaolonw) July 8, 2024
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
After going through 100s of AI papers in the past couple of weeks, I am noticing the deeper integration of ideas (e.g., Mixture of Million Experts and Internet of Agents) and the utility of simple yet very effective methods (e.g., RouteLLM and RankRAG).
— elvis (@omarsar0) July 13, 2024
If you are looking for… pic.twitter.com/hTVafuLbxQ
Audio Spotforming Using Nonnegative Tensor Factorization with Attractor-Based Regularization
Understanding Transformers via N-Gram Statistics
facebookincubator/submitit — Python 3.8+ toolbox for submitting jobs to Slurm
Video Diffusion Alignment via Reward Gradients, model
Deep Dive into LSTMs and xLSTMs by Hand
Chronos: Learning the Language of Time Series, code
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning, code
Introducing Triton: Open-source GPU programming for neural networks
CUDA kernels in PyTorch made easy with Numba, notebook
a-brassard/ACORN — Home repository for the ACORN dataset: 3,500 explanations with aspect-wise human ratings of their quality.
facebookresearch/data2vec_vision
Fine-tuning Florence-2 - Microsoft’s Cutting-edge Vision Language Models
ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data, code
ColPali: Efficient Document Retrieval with Vision Language Models, code
Block Transformer: Global-to-Local Language Modeling for Fast Inference, code
AND: Audio Network Dissection for Interpreting Deep Acoustic Models
Sound Field Synthesis with Acoustic Waves
Mapping the Past: Geographically Linking an Early 20th Century Swedish Encyclopedia with Wikidata
A multi-speaker multi-lingual voice cloning system based on vits2 for limmits 2024 challenge
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs, code, model
AI by Hand:
1. Dot Product - AI by Hand✍️Workbook Series
— Tom Yeh (@ProfTomYeh) May 23, 2024
I share original hand calculation exercises like this, with 36K followers on LinkedIn.
I just started to share on X.
If you find this post helpful,
[Follow] me for more! 🙌 pic.twitter.com/rbqWVZCmlP
3. Linear Layer - AI by Hand✍️Workbook Series
— Tom Yeh (@ProfTomYeh) May 25, 2024
I share original hand calculation exercises like this, with 36K followers on LinkedIn.
I just started sharing on X.
If you find this workbook helpful, [Follow] me for more! pic.twitter.com/X24n6PmydJ
4. Activation - AI by Hand✍️Workbook Series
— Tom Yeh (@ProfTomYeh) May 26, 2024
I share original hand calculation exercises like this, with 36K followers on LinkedIn. I just started sharing on X.
If you find this workbook helpful,
[Follow] me for more! pic.twitter.com/iqzGP1uSUj
5. Artificial Neuron - AI by Hand✍️Workbook Series
— Tom Yeh (@ProfTomYeh) May 27, 2024
Previous Workbooks:
4. Activation: https://t.co/8btQ2n5AAf
3. Linear Layer: https://t.co/V571mpwnTq
2. Matrix Multiplication: https://t.co/EqfK6AEutb
1. Dot Product: https://t.co/ou9GFdTV1f
I share original hand… https://t.co/dO0Ff5x4kQ pic.twitter.com/3vlR4EUoxZ
Vector Database by Hand ✍️
— Tom Yeh (@ProfTomYeh) May 27, 2024
Vector databases are revolutionizing how we search and analyze complex data. They have become the backbone of Retrieval Augmented Generation (#RAG).
How do vector databases work?
[1] Given
↳ A dataset of three sentences, each has 3 words (or tokens)… pic.twitter.com/IIJwqnVjaK
GAN by Hand ✍️
— Tom Yeh (@ProfTomYeh) May 24, 2024
Goal: Generate realistic 4-D data from 2-D noise.
[1] Given
↳ 4 noise vectors in 2D (N)
↳ 4 real data vectors in 4D (X)
[2] 🟩 Generator: First Layer
↳ Multiply the noise vectors with weights and biases to obtain new feature vectors
[3] 🟩 Generator: ReLU
↳… pic.twitter.com/7ECTTqzOJL
[Transformer] by Hand✍️📺
— Tom Yeh (@ProfTomYeh) July 10, 2024
5-minute Video Tutorial
Anna Rahn made this short video to explain the Transformer exercise for my Computer Vision course last spring.
In 5 minutes, she demonstrates the key calculations of the Transformer by hand with pen and paper!
Anna is a… pic.twitter.com/NAkESZKaQH
SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation, code, model
Zielon/PBRVulkan — Vulkan Real-time Path Tracer Engine
UCxn: Typologically Informed Annotation of Constructions Atop Universal Dependencies
Generative AI Handbook: A Roadmap for Learning Resources
Crossmodal ASR Error Correction with Discrete Speech Units
CMU-MOSI Dataset — The Multimodal Corpus of Sentiment Intensity (CMU-MOSI) dataset is a collection of 2199 opinion video clips.
USER-LLM: Efficient LLM contextualization with user embeddings, arXiv
Perceiver: General Perception with Iterative Attention
Reverse the auditory processing pathway: Coarse-to-fine audio reconstruction from fMRI
Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis, code, vocos-mel-24khz, vocos-encodec-24khz
Improving Speech Decoding from ECoG with Self-Supervised Pretraining
BLSP-KD: Bootstrapping Language-Speech Pre-training via Knowledge Distillation
TF-SepNet: An Efficient 1D Kernel Design in CNNs for Low-Complexity Acoustic Scene Classification
Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion, code
Adapting Frechet Audio Distance for Generative Music Evaluation, arXiv, code
CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations
Spiketrum: An FPGA-based Implementation of a Neuromorphic Cochlea
Contextual Position Encoding: Learning to Count What’s Important
Simplified Grammar of the Hungarian Language
In Love With the Czarina, and Other Stories, A VAKMERŐ
The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry
netease-youdao/EmotiVoice — a Multi-Voice and Prompt-Controlled TTS Engine
Voice Cloning with your personal data
ricosjp/truck — Truck is a rust CAD kernel
Hungarian Body Parts Flashcards
OpenVoice: Versatile Instant Voice Cloning, code
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
A Complete Guide to Write your own Transformers
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models, FACodec model, code
Decades-Old Beer Ads Stitched Straight Into Original Star Wars Movies Go Viral
Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR
Robots Beyond Borders: The Role of Social Robots in Spoken Second Language Practice
Once “too scary” to release, GPT-2 gets squeezed into an Excel spreadsheet
ianand/spreadsheets-are-all-you-need
The best CUDA intro course by @nvidia with 460 bite sized videos. It was the course released with Udacity 9 yrs ago.
— chansung (@algo_diver) April 29, 2024
It is kinda old, but you can grasp core ideas around it.https://t.co/OcRqwJ6phf
How do you teach a Large Language Model to understand images?
— Zain (@ZainHasan6) April 28, 2024
This paper proposes a technique called Visual Instruction Tuning that is now used by many of the language vision models we see in the field such as GPT4-Vision and Gemini etc.
In Short:
The paper introduces a method… pic.twitter.com/HJ6iRtnD3m
sarah-walker-pcem/pcem/ — PC emulator
Infinite Mac — Infinite Mac is a collection of classic Macintosh and NeXT system releases and software, all easily accessible from the comfort of a (modern) web browser.
dingusdev/dingusppc — PowerPC Mac emulator
autc04/executor — A modern fork of the classic Mac emulator
Can Learned Optimization Make Reinforcement Learning Less Difficult?, code
google/learned_optimization — Meta-learning optimizers and more with JAX
Click-Gaussian: Interactive Segmentation to Any 3D Gaussians
E2GAN: Efficient Training of Efficient GANs for Image-to-Image Translation
nerfstudio-project/nerfstudio — A collaboration friendly studio for NeRFs
Context Embeddings for Efficient Answer Generation in RAG
SpreadsheetLLM: Encoding Spreadsheets for Large Language Models