Interesting links, 11/05/2024
Misc. interesting things.
xbpeng/DeepMimic — Motion imitation with deep reinforcement learning.
facebookresearch/t2motion — open source, but needs SMPLH.
ricsinaruto/gutenberg-dialog — Build a dialog dataset from online books in many languages
However you define "woke," anti-woke means being a cunt who wants to indulge bigots.
— regular steve albini (@electricalWSOP) June 4, 2023
Finetuning Whisper for dynamic audio context robustness
andrewgcodes/xlstm — my attempts at implementing various bits of Sepp Hochreiter’s new xLSTM architecture
CMU Graphics Lab Motion Capture Database
bulletphysics/bullet3 — Bullet Physics SDK: real-time collision detection and multi-physics simulation for VR, games, visual effects, robotics, machine learning etc.
ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition
Mathematicians throw shade like no others pic.twitter.com/5s6Ctmamkk
— Anthony Bonato (@Anthony_Bonato) May 6, 2024
Many in edtech are inspired by the AI education in The Diamond Age.
— Matt Bateman (@mbateman) May 5, 2024
The fictional edtech is a nanotech pseudointelligent book, The Young Lady’s Illustrated Primer. It bonds to a child at ~4 and educates them until ~16.
Features of interest of the Primer, then general thoughts… pic.twitter.com/z81QSntpJf
haosulab/ManiSkill — SAPIEN Manipulation Skill Framework, a GPU parallelized robotics simulator and benchmark (Code is open, assets are not)
XFeat: Accelerated Features for Lightweight Image Matching, code
sato-team/Stable-Text-to-motion-Framework
Generating Diverse and Natural 3D Human Motions from Text, code
Simple rules to decide when to stop fine-tuning.
— Boris Dayma 🖍️ (@borisdayma) May 2, 2024
Here we have 2 plots:
- chart 1: validation loss
- chart 2: training + validation loss pic.twitter.com/BRSxlHFqHK
Kerry King- Trophies of The Tyrant/ Chemical Warfare
The evolution of Steve Albini: ‘If the dumbest person is on your side, you’re on the wrong side’
muditbhargava66/PyxLSTM — PyxLSTM is a Python library that provides an efficient and extensible implementation of the Extended Long Short-Term Memory (xLSTM) architecture. xLSTM enhances the traditional LSTM by introducing exponential gating, memory mixing, and a matrix memory structure, enabling improved performance and scalability for sequence modeling tasks.
SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models
Simple and Efficient Quantization Techniques for Neural Speech Coding
SpeechVerse: A Large-scale Generalizable Audio Language Model
Coin3D: Controllable and Interactive 3D Assets Generation with Proxy-Guided Conditioning
No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding, code
Robots with the help of neuroimplants helped a paralyzed man! 🦾
— Lukas Ziegler (@lukas_m_ziegler) May 15, 2024
A 76-year-old paralyzed man has made history by using his thoughts to write 8 Chinese characters!
This incredible feat marks the first successful use of Zhejiang University's brain implants to enable writing… pic.twitter.com/OvlBfl5lVy
Naturalistic Music Decoding from EEG Data via Latent Diffusion Models
A vector quantized masked autoencoder for audiovisual speech emotion recognition
(1/5) @CKT_Conner, @dill_pkl, @emilyzsh, and I are excited to introduce Shard - a proof-of-concept for an infinitely scalable distributed system composed of consumer hardware for training and running ML models!
— Aksh Garg (@AkshGarg03) May 15, 2024
Features:
- Data + Pipeline Parallel for handling arbitrarily large… pic.twitter.com/LkVwrvU3it
There was a 'Not Found' error fetching URL: 'https://x.com/getnormality/status/1790942454688145484'
Jan Kasprowicz - Krzak dzikiej róży w Ciemnych Smreczynach
Evolutionary Optimization of Model Merging Recipes, code
A Large-Scale Evaluation of Speech Foundation Models
Video ReCap: Recursive Captioning of Hour-Long Videos, code
Target Speech Extraction with Pre-trained Self-supervised Learning Models
Semi-Autoregressive Streaming ASR With Label Context
Towards audio language modeling – an overview
Probing Self-supervised Learning Models with Target Speech Extraction
A multimodal dynamical variational autoencoder for audiovisual speech representation learning
On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models
MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning
Best Practices for Robot Death
MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning
Understanding, Using, and Finetuning Gemma
Sigmoid Loss for Language Image Pre-Training, code
ChatGPT suspends Scarlett Johansson-like voice as actor speaks out against OpenAI
CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor, code
From Motor Control to Team Play in Simulated Humanoid Football
Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing
Cross-modal Contrastive Learning for Speech Translation, code
nateraw/hf-hub-lightning — A PyTorch Lightning Callback for pushing models to the Hugging Face Hub
Images that Sound: Composing Images and Sounds on a Single Canvas, code, (model is not open).
The Doge Meme dog, Kabosu has died.
— Dexerto (@Dexerto) May 24, 2024
She was 18 years old. pic.twitter.com/ScMhYn2kuF
we've officially reached AGI pic.twitter.com/fKcex4ZFLH
— gaut (@0xgaut) May 24, 2024
An Introduction to Vision-Language Modeling
Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One, code
MyoHub/myoconverter — A tool to convert opensim 4.0+ MSK models into MuJoCo format with optimized muscle kinematics and kinetics
MoCapAct: A Multi-Task Dataset for Simulated Humanoid Control, code
From motor control to team play in simulated humanoid football
Multi-Modal Data Augmentation for End-to-End ASR
Big announcement: @pleiasfr releases a massive open corpus of 2 million Youtube videos in Creative Commons (CC-By) on @huggingface. Youtube-Commons features 30 billion words of audio transcriptions in multiple languages, and soon other modalities https://t.co/BevSENB7KZ pic.twitter.com/31Ya7utO7D
— Alexander Doria (@Dorialexander) April 18, 2024
(Dataset is noisy, no attempt made to determine if transcript in original language in any way matches speech — or if there even is speech — and often original transcript is omitted in favour of a translation).
Self-Rewarding Language Models
Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval
Revisiting Self-supervised Learning of Speech Representation from a Mutual Information Perspective
Phonemes based detection of parkinson’s disease for telehealth applications
JaColBERT and Hard Negatives, Towards Better Japanese-First Embeddings for Retrieval: Early Technical Report, model
ColBERT relies on fine-grained contextual late interaction: it encodes each passage into a matrix of token-level embeddings (shown above in blue). Then at search time, it embeds every query into another matrix (shown in green) and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators.
From Coarse to Fine: Efficient Training for Audio Spectrogram Transformers
The seven sins of memory: Insights from psychology and cognitive neuroscience
mcdermottLab/pycochleagram — Generate cochleagrams natively in Python. Ported from Josh McDermott’s MATLAB code.
RefFusion: Reference Adapted Diffusion Models for 3D Scene Inpainting, paper
General-purpose, long-context autoregressive modeling with Perceiver AR, code
Uneasy on the Ear: An Interview with Lola De La Mata, Left Ear, Right Ear
Using 𝚝𝚘𝚛𝚌𝚑.𝚌𝚘𝚖𝚙𝚒𝚕𝚎 makes KANs as fast as MLPs!
— Thomas Ahle (@thomasahle) June 5, 2024
I never thought I would be a fan, but they are starting to look pretty appetizing. pic.twitter.com/ti031u18YF
4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders
Audio Mamba: Bidirectional State Space Model for Audio Representation Learning, code
dgreenheck/tree-js — Procedural tree generator written with JavaScript and Three.js
TextureDreamer: Image-guided Texture Synthesis through Geometry-aware Diffusion
Self-Imagine: Effective Unimodal Reasoning with Multimodal Models using Self-Imagination
MarcusLoppe/meshgpt-pytorch, model — based on lucidrains/meshgpt-pytorch
SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance, code
Language Table — Suite of human-collected datasets and a multi-task continuous control benchmark for open vocabulary visuolinguomotor learning.
Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers,
code
(depends on diff-gaussian-rasterization
which is not open source)
Voice in Parkinson’s Disease: A Machine Learning Study
Parkinson’s Disease Detection Based on Running Speech Data From Phone Calls
lucidrains/BS-RoFormer — Implementation of Band Split Roformer, SOTA Attention network for music source separation out of ByteDance AI Labs
Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling, code
LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning, data
ECAPA2: A Hybrid Neural Network Architecture and Training Strategy for Robust Speaker Embeddings
Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers, code
Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion,
hokema/Pop2Talk — Pop2Talk foreign language prounnciation learning game. Code for the unity client app.
kyegomez/VisionMamba — Implementation of Vision Mamba from the paper: “Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model” It’s 2.8x faster than DeiT and saves 86.8% GPU memory when performing batch inference to extract features on high-res images
We embedded 250,000 works of art 🎨 from The Met using @nomic_ai's new SOTA #multimodal embeddings model!
— andrew gao (@itsandrewgao) June 5, 2024
It's the *first ever* semantic search tool of its kind 👩🎨 🔎
Search with smart queries like "oil painting with flowers & dogs".
How we did it & how to use it👇 pic.twitter.com/sWjW78zUtI
We recently open-sourced a relatively minimal implementation example of Transformer language model training in JAX, called NanoDO.
— Peter J. Liu (@peterjliu) June 5, 2024
If you stick to vanilla JAX components, the code is relatively straightforward to read -- the model file is <150 lines. We found it useful as a…
lucidrains/soundstorm-pytorch — Implementation of SoundStorm, Efficient Parallel Audio Generation from Google Deepmind, in Pytorch
lucidrains/mogrifier — Usable implementation of Mogrifier, a circuit for enhancing LSTMs and potentially other networks, from Deepmind
ina-foss/inaSpeechSegmenter — CNN-based audio segmentation toolkit. Allows to detect speech, music, noise and speaker gender. Has been designed for large scale gender equality studies based on speech time per gender.
mHuBERT-147: A Compact Multilingual HuBERT Model, fairseq fork, pre-processing scripts
A virtual rodent predicts the structure of neural activity across behaviors
openvla/openvla — OpenVLA: An Open-Source Vision-Language-Action Model (based on Llama, so model is not open)
TRI-ML/prismatic-vlms — A flexible and efficient codebase for training visually-conditioned language models (VLMs)
New paper just dropped, showing how to greatly increase math scores on LLMs by combining monte-carlo tree search (MCTS) with a language model.
— Jeremy Howard (@jeremyphoward) June 12, 2024
Nice! But... what if instead, we simply tell the LLM to read the paper, and *pretend* it followed those steps? pic.twitter.com/CizH4UnRwi
VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation
Scalable MatMul-free Language Modeling, code
Contextual and combinatorial structure in sperm whale vocalisations
Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR
juletx/BertaQA — BertaQA: How Much Do Language Models Know About Local Culture?
Scientists have transplanted memory from one snail to another. So, what does it mean for humans?
Very nice paper - bGPT - Byte-Level Transformer. a model that processes data at the byte level and learns to simulate the digital world through next byte prediction.
— Rohan Paul (@rohanpaul_ai) June 10, 2024
Unlike traditional deep learning models that focus on human-interpretable data like text, audio and images, bGPT… pic.twitter.com/xzL6AruFhz
Beyond Language Models: Byte Models are Digital World Simulators, code
How Should We Extract Discrete Audio Tokens from Self-Supervised Models?
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
Towards Lightweight Speaker Verification via Adaptive Neural Network Quantization
Multilingual Prosody Transfer: Comparing Supervised & Transfer Learning
CrossVoice: Crosslingual Prosody Preserving Cascade-S2ST using Transfer Learning
RawBMamba: End-to-End Bidirectional State Space Model for Audio Deepfake Detection
Performant ASR Models for Medical Entities in Accented Speech
Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting
A Mel Spectrogram Enhancement Paradigm Based on CWT in Speech Synthesis
Interface Design for Self-Supervised Speech Models
Towards Audio Codec-based Speech Separation
Logistic regression defines fuzzy classification boundaries using the softmax operator. At the heart of many supervised learning classification approaches. Introduced by David Cox in 1956. https://t.co/8G451RDsVo https://t.co/z3CItiyeFA pic.twitter.com/6eqmoEIOGC
— Gabriel Peyré (@gabrielpeyre) June 18, 2024
Susan Caplin, the voice behind Alexa confuses Alexa pic.twitter.com/KjAD2rseKV
— Historic Vids (@historyinmemes) June 18, 2024
OK, time for some tweets about distances between Markov chains! Actually this is about a preprint we've just posted on arxiv with Sergio Calo, Anders Jonsson, Ludovic Schwartz & Javier Segovia-Aguas. FFO optimal transport & bisimulation. Let's dig in!https://t.co/bwtcBqCcHg
— Gergely Neu (@neu_rips) June 18, 2024
1/n pic.twitter.com/jYbXSrITYs
Bisimulation Metrics are Optimal Transport Distances, and Can be Computed Efficiently
Check out the mind-blowing experiments of @dante_leoncini, a 3D Artist and Programmer, who managed to run Blender on an 18-year-old Nokia phone.
— 80 LEVEL (@80Level) June 19, 2024
Now I've seen everything: https://t.co/h7E1cXKCzT#blender #blender3d #b3d #blendercommunity #nokia #mobilephone #3dsoftware pic.twitter.com/fvW4ckCgvF
This Madlad Programmer Managed to Run Blender on a Nokia Phone
Flow-matching implementation:https://t.co/sP5DXLr4jI
— Kevin Frans (@kvfrans) June 17, 2024
Flow-matching is very similar to diffusion, but simplifies things. Noised images are linear interpolations between (data, noise) pairs, and the network predicts *velocity* of this trajectory. pic.twitter.com/KHsxhPJvV6
kvfrans/jax-flow — Flow-matching algorithms in JAX
Audio Signal Processing for Machine Learning, slides
Domain-Informed Probing of wav2vec 2.0 Embeddings for Phonetic Features
neongeckocom/cv-tts-clean — TTS dataset from Common Voice
neongeckocom — multilingual ViTS models
Not all ‘open source’ AI models are actually open: here’s a ranking
A letter from Rosa (Ní Dhochartaigh) Uí Néill, wife of Eoghan Ruadh Ó Néill, 1642, Louvain, Belgium. Published in Gilbert's ‘Affairs of Ireland’ vol 1, part 2 (1879). Transcription and translation work done by @silmeth. pic.twitter.com/jxSya2vdph
— Corbmacc (@erisceres) June 23, 2024
Wikipedia redirect for language code: https://en.wikipedia.org/wiki/ISO_639:$CODE
e.g.: gle - Irish
OmniGlue: Generalizable Feature Matching with Foundation Model Guidance, code
Exploring the Capability of Mamba in Speech Applications
Researchers craft smiling robot face from living human skin cells