Interesting links, 11/05/2024

xbpeng/DeepMimic — Motion imitation with deep reinforcement learning.

facebookresearch/t2motion — open source, but needs SMPLH.

ricsinaruto/gutenberg-dialog — Build a dialog dataset from online books in many languages

However you define "woke," anti-woke means being a cunt who wants to indulge bigots.
— regular steve albini (@electricalWSOP) June 4, 2023

Finetuning Whisper for dynamic audio context robustness

andrewgcodes/xlstm — my attempts at implementing various bits of Sepp Hochreiter’s new xLSTM architecture

CMU Graphics Lab Motion Capture Database

bulletphysics/bullet3 — Bullet Physics SDK: real-time collision detection and multi-physics simulation for VR, games, visual effects, robotics, machine learning etc.

ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition

WhisperForCTC #26242

Mathematicians throw shade like no others pic.twitter.com/5s6Ctmamkk
— Anthony Bonato (@Anthony_Bonato) May 6, 2024

BlinkDL/rwkv-4-world

Many in edtech are inspired by the AI education in The Diamond Age.

The fictional edtech is a nanotech pseudointelligent book, The Young Lady’s Illustrated Primer. It bonds to a child at ~4 and educates them until ~16.

Features of interest of the Primer, then general thoughts… pic.twitter.com/z81QSntpJf
— Matt Bateman (@mbateman) May 5, 2024

Open Language Model: OLMo

haosulab/ManiSkill — SAPIEN Manipulation Skill Framework, a GPU parallelized robotics simulator and benchmark (Code is open, assets are not)

XFeat: Accelerated Features for Lightweight Image Matching, code

TRAMBA: A Hybrid Transformer and Mamba Architecture for Practical Audio and Bone Conduction Speech Super Resolution and Enhancement on Mobile and Wearable Platforms

sato-team/Stable-Text-to-motion-Framework

Generating Diverse and Natural 3D Human Motions from Text, code

Simple rules to decide when to stop fine-tuning.

Here we have 2 plots:
- chart 1: validation loss
- chart 2: training + validation loss pic.twitter.com/BRSxlHFqHK
— Boris Dayma 🖍️ (@borisdayma) May 2, 2024

Kerry King- Trophies of The Tyrant/ Chemical Warfare

The evolution of Steve Albini: ‘If the dumbest person is on your side, you’re on the wrong side’

muditbhargava66/PyxLSTM — PyxLSTM is a Python library that provides an efficient and extensible implementation of the Extended Long Short-Term Memory (xLSTM) architecture. xLSTM enhances the traditional LSTM by introducing exponential gating, memory mixing, and a matrix memory structure, enabling improved performance and scalability for sequence modeling tasks.

cepstrum, quefrency, rahmonic

SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models

Simple and Efficient Quantization Techniques for Neural Speech Coding

SpeechVerse: A Large-scale Generalizable Audio Language Model

A predictive learning model can simulate temporal dynamics and context effects found in neural representations of continuous speech

Learning the Meanings of Function Words From Grounded Language Using a Visual Question Answering Model

Coin3D: Controllable and Interactive 3D Assets Generation with Proxy-Guided Conditioning

No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding, code

Robots with the help of neuroimplants helped a paralyzed man! 🦾

A 76-year-old paralyzed man has made history by using his thoughts to write 8 Chinese characters!

This incredible feat marks the first successful use of Zhejiang University's brain implants to enable writing… pic.twitter.com/OvlBfl5lVy
— Lukas Ziegler (@lukas_m_ziegler) May 15, 2024

Naturalistic Music Decoding from EEG Data via Latent Diffusion Models

A vector quantized masked autoencoder for audiovisual speech emotion recognition

(1/5) @CKT_Conner, @dill_pkl, @emilyzsh, and I are excited to introduce Shard - a proof-of-concept for an infinitely scalable distributed system composed of consumer hardware for training and running ML models!

Features:
- Data + Pipeline Parallel for handling arbitrarily large… pic.twitter.com/LkVwrvU3it
— Aksh Garg (@AkshGarg03) May 15, 2024

There was a 'Not Found' error fetching URL: 'https://x.com/getnormality/status/1790942454688145484'

Jan Kasprowicz - Krzak dzikiej róży w Ciemnych Smreczynach

polska-poezja.pl

Evolutionary Optimization of Model Merging Recipes, code

Swedish Kelly list

A Large-Scale Evaluation of Speech Foundation Models

Video ReCap: Recursive Captioning of Hour-Long Videos, code

Target Speech Extraction with Pre-trained Self-supervised Learning Models

Semi-Autoregressive Streaming ASR With Label Context

Towards audio language modeling – an overview

Probing Self-supervised Learning Models with Target Speech Extraction

A multimodal dynamical variational autoencoder for audiovisual speech representation learning

On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models

MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning

Best Practices for Robot Death

OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification

MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning

Understanding, Using, and Finetuning Gemma

Blind estimation of audio effects using an auto-encoder approach and differentiable digital signal processing

Sigmoid Loss for Language Image Pre-Training, code

ChatGPT suspends Scarlett Johansson-like voice as actor speaks out against OpenAI

CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor, code

From Motor Control to Team Play in Simulated Humanoid Football

Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing

Cross-modal Contrastive Learning for Speech Translation, code

nateraw/hf-hub-lightning — A PyTorch Lightning Callback for pushing models to the Hugging Face Hub

Learning Transformer Programs

Images that Sound: Composing Images and Sounds on a Single Canvas, code, (model is not open).

The Doge Meme dog, Kabosu has died.

She was 18 years old. pic.twitter.com/ScMhYn2kuF
— Dexerto (@Dexerto) May 24, 2024

we've officially reached AGI pic.twitter.com/fKcex4ZFLH
— gaut (@0xgaut) May 24, 2024

2BP: 2-Stage Backpropagation

An Introduction to Vision-Language Modeling

Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One, code

MyoHub/myoconverter — A tool to convert opensim 4.0+ MSK models into MuJoCo format with optimized muscle kinematics and kinetics

MoCapAct: A Multi-Task Dataset for Simulated Humanoid Control, code

From motor control to team play in simulated humanoid football

langchain-ai/langchain

Multi-Modal Data Augmentation for End-to-End ASR

Big announcement: @pleiasfr releases a massive open corpus of 2 million Youtube videos in Creative Commons (CC-By) on @huggingface. Youtube-Commons features 30 billion words of audio transcriptions in multiple languages, and soon other modalities https://t.co/BevSENB7KZ pic.twitter.com/31Ya7utO7D
— Alexander Doria (@Dorialexander) April 18, 2024

PleIAs/YouTube-Commons

(Dataset is noisy, no attempt made to determine if transcript in original language in any way matches speech — or if there even is speech — and often original transcript is omitted in favour of a translation).

Self-Rewarding Language Models

Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval

Revisiting Self-supervised Learning of Speech Representation from a Mutual Information Perspective

Phonemes based detection of parkinson’s disease for telehealth applications

JaColBERT and Hard Negatives, Towards Better Japanese-First Embeddings for Retrieval: Early Technical Report, model

colbert-ir/colbertv2.0

ColBERT relies on fine-grained contextual late interaction: it encodes each passage into a matrix of token-level embeddings (shown above in blue). Then at search time, it embeds every query into another matrix (shown in green) and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators.

From Coarse to Fine: Efficient Training for Audio Spectrogram Transformers

The seven sins of memory: Insights from psychology and cognitive neuroscience

mcdermottLab/pycochleagram — Generate cochleagrams natively in Python. Ported from Josh McDermott’s MATLAB code.

Codifying the Debates of the Riksdag: Towards a Framework for Semi-automatic Annotation of Swedish Parliamentary Discourse

RefFusion: Reference Adapted Diffusion Models for 3D Scene Inpainting, paper

jaywalnut310/vits

Dutch people

SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding

General-purpose, long-context autoregressive modeling with Perceiver AR, code

Uneasy on the Ear: An Interview with Lola De La Mata, Left Ear, Right Ear

Using 𝚝𝚘𝚛𝚌𝚑.𝚌𝚘𝚖𝚙𝚒𝚕𝚎 makes KANs as fast as MLPs!

I never thought I would be a fan, but they are starting to look pretty appetizing. pic.twitter.com/ti031u18YF
— Thomas Ahle (@thomasahle) June 5, 2024

4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders

Audio Mamba: Bidirectional State Space Model for Audio Representation Learning, code

dgreenheck/tree-js — Procedural tree generator written with JavaScript and Three.js

xenova/transformers.js

TextureDreamer: Image-guided Texture Synthesis through Geometry-aware Diffusion

Self-Imagine: Effective Unimodal Reasoning with Multimodal Models using Self-Imagination

MarcusLoppe/meshgpt-pytorch, model — based on lucidrains/meshgpt-pytorch

SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance, code

Scaling Spherical CNNs, code

Language Table — Suite of human-collected datasets and a multi-task continuous control benchmark for open vocabulary visuolinguomotor learning.

Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers, code (depends on diff-gaussian-rasterization which is not open source)

Voice in Parkinson’s Disease: A Machine Learning Study

Parkinson’s Disease Detection Based on Running Speech Data From Phone Calls

lucidrains/BS-RoFormer — Implementation of Band Split Roformer, SOTA Attention network for music source separation out of ByteDance AI Labs

Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling, code

LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning, data

ECAPA2: A Hybrid Neural Network Architecture and Training Strategy for Robust Speaker Embeddings

Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers, code

Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion,

ReLU-KAN: New Kolmogorov-Arnold Networks that Only Need Matrix Addition, Dot Multiplication, and ReLU

hokema/Pop2Talk — Pop2Talk foreign language prounnciation learning game. Code for the unity client app.

kyegomez/VisionMamba — Implementation of Vision Mamba from the paper: “Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model” It’s 2.8x faster than DeiT and saves 86.8% GPU memory when performing batch inference to extract features on high-res images

We embedded 250,000 works of art 🎨 from The Met using @nomic_ai's new SOTA #multimodal embeddings model!

It's the *first ever* semantic search tool of its kind 👩‍🎨 🔎
Search with smart queries like "oil painting with flowers & dogs".

How we did it & how to use it👇 pic.twitter.com/sWjW78zUtI
— andrew gao (@itsandrewgao) June 5, 2024

We recently open-sourced a relatively minimal implementation example of Transformer language model training in JAX, called NanoDO.

If you stick to vanilla JAX components, the code is relatively straightforward to read -- the model file is <150 lines. We found it useful as a…
— Peter J. Liu (@peterjliu) June 5, 2024

The maze is in the mouse

lucidrains/soundstorm-pytorch — Implementation of SoundStorm, Efficient Parallel Audio Generation from Google Deepmind, in Pytorch

lucidrains/mogrifier — Usable implementation of Mogrifier, a circuit for enhancing LSTMs and potentially other networks, from Deepmind

ina-foss/inaSpeechSegmenter — CNN-based audio segmentation toolkit. Allows to detect speech, music, noise and speaker gender. Has been designed for large scale gender equality studies based on speech time per gender.

mHuBERT-147: A Compact Multilingual HuBERT Model, fairseq fork, pre-processing scripts

A virtual rodent predicts the structure of neural activity across behaviors

openvla/openvla — OpenVLA: An Open-Source Vision-Language-Action Model (based on Llama, so model is not open)

TRI-ML/prismatic-vlms — A flexible and efficient codebase for training visually-conditioned language models (VLMs)

New paper just dropped, showing how to greatly increase math scores on LLMs by combining monte-carlo tree search (MCTS) with a language model.

Nice! But... what if instead, we simply tell the LLM to read the paper, and *pretend* it followed those steps? pic.twitter.com/CizH4UnRwi
— Jeremy Howard (@jeremyphoward) June 12, 2024

ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets

VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation

Scalable MatMul-free Language Modeling, code

Contextual and combinatorial structure in sperm whale vocalisations

Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR

SpokeN-100: A Cross-Lingual Benchmarking Dataset for The Classification of Spoken Numbers in Different Languages

juletx/BertaQA — BertaQA: How Much Do Language Models Know About Local Culture?

Scientists have transplanted memory from one snail to another. So, what does it mean for humans?

Very nice paper - bGPT - Byte-Level Transformer. a model that processes data at the byte level and learns to simulate the digital world through next byte prediction.

Unlike traditional deep learning models that focus on human-interpretable data like text, audio and images, bGPT… pic.twitter.com/xzL6AruFhz
— Rohan Paul (@rohanpaul_ai) June 10, 2024

Beyond Language Models: Byte Models are Digital World Simulators, code

How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

Towards Lightweight Speaker Verification via Adaptive Neural Network Quantization

Multilingual Prosody Transfer: Comparing Supervised & Transfer Learning

CrossVoice: Crosslingual Prosody Preserving Cascade-S2ST using Transfer Learning

RawBMamba: End-to-End Bidirectional State Space Model for Audio Deepfake Detection

Performant ASR Models for Medical Entities in Accented Speech

A dual task learning approach to fine-tune a multilingual semantic speech encoder for Spoken Language Understanding

Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting

A Mel Spectrogram Enhancement Paradigm Based on CWT in Speech Synthesis

Interface Design for Self-Supervised Speech Models

Towards Audio Codec-based Speech Separation

Logistic regression defines fuzzy classification boundaries using the softmax operator. At the heart of many supervised learning classification approaches. Introduced by David Cox in 1956. https://t.co/8G451RDsVo https://t.co/z3CItiyeFA pic.twitter.com/6eqmoEIOGC
— Gabriel Peyré (@gabrielpeyre) June 18, 2024

Notebook

I Felt a Little Homosexual Today, So I Called in Sick: The Formation of “Reverse Discourse” by Swedish Gay Activists in the 1970s

Susan Caplin, the voice behind Alexa confuses Alexa pic.twitter.com/KjAD2rseKV
— Historic Vids (@historyinmemes) June 18, 2024

OK, time for some tweets about distances between Markov chains! Actually this is about a preprint we've just posted on arxiv with Sergio Calo, Anders Jonsson, Ludovic Schwartz & Javier Segovia-Aguas. FFO optimal transport & bisimulation. Let's dig in!https://t.co/bwtcBqCcHg
1/n pic.twitter.com/jYbXSrITYs
— Gergely Neu (@neu_rips) June 18, 2024

Bisimulation Metrics are Optimal Transport Distances, and Can be Computed Efficiently

Check out the mind-blowing experiments of @dante_leoncini, a 3D Artist and Programmer, who managed to run Blender on an 18-year-old Nokia phone.

Now I've seen everything: https://t.co/h7E1cXKCzT #blender #blender3d #b3d #blendercommunity #nokia #mobilephone #3dsoftware pic.twitter.com/fvW4ckCgvF
— 80 LEVEL (@80Level) June 19, 2024

This Madlad Programmer Managed to Run Blender on a Nokia Phone

Flow-matching implementation:https://t.co/sP5DXLr4jI

Flow-matching is very similar to diffusion, but simplifies things. Noised images are linear interpolations between (data, noise) pairs, and the network predicts *velocity* of this trajectory. pic.twitter.com/KHsxhPJvV6
— Kevin Frans (@kvfrans) June 17, 2024

kvfrans/jax-flow — Flow-matching algorithms in JAX

Audio Signal Processing for Machine Learning, slides

Domain-Informed Probing of wav2vec 2.0 Embeddings for Phonetic Features

neongeckocom/cv-tts-clean — TTS dataset from Common Voice

neongeckocom — multilingual ViTS models

Not all ‘open source’ AI models are actually open: here’s a ranking

Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding

A letter from Rosa (Ní Dhochartaigh) Uí Néill, wife of Eoghan Ruadh Ó Néill, 1642, Louvain, Belgium. Published in Gilbert's ‘Affairs of Ireland’ vol 1, part 2 (1879). Transcription and translation work done by @silmeth. pic.twitter.com/jxSya2vdph
— Corbmacc (@erisceres) June 23, 2024

Wikipedia redirect for language code: https://en.wikipedia.org/wiki/ISO_639:$CODE e.g.: gle - Irish

OmniGlue: Generalizable Feature Matching with Foundation Model Guidance, code

Exploring the Capability of Mamba in Speech Applications

Researchers craft smiling robot face from living human skin cells