word_prediction_criterion.py

xbpeng/DeepMimic — Motion imitation with deep reinforcement learning.

facebookresearch/fairmotion

facebookresearch/t2motion — open source, but needs SMPLH.

ricsinaruto/gutenberg-dialog — Build a dialog dataset from online books in many languages

Finetuning Whisper for dynamic audio context robustness

andrewgcodes/xlstm — my attempts at implementing various bits of Sepp Hochreiter’s new xLSTM architecture

CMU Graphics Lab Motion Capture Database

bulletphysics/bullet3 — Bullet Physics SDK: real-time collision detection and multi-physics simulation for VR, games, visual effects, robotics, machine learning etc.

ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition

WhisperForCTC #26242

BlinkDL/rwkv-4-world

Open Language Model: OLMo

haosulab/ManiSkill — SAPIEN Manipulation Skill Framework, a GPU parallelized robotics simulator and benchmark (Code is open, assets are not)

XFeat: Accelerated Features for Lightweight Image Matching, code

TRAMBA: A Hybrid Transformer and Mamba Architecture for Practical Audio and Bone Conduction Speech Super Resolution and Enhancement on Mobile and Wearable Platforms

sato-team/Stable-Text-to-motion-Framework

Generating Diverse and Natural 3D Human Motions from Text, code

Kerry King- Trophies of The Tyrant/ Chemical Warfare

The evolution of Steve Albini: ‘If the dumbest person is on your side, you’re on the wrong side’

muditbhargava66/PyxLSTM — PyxLSTM is a Python library that provides an efficient and extensible implementation of the Extended Long Short-Term Memory (xLSTM) architecture. xLSTM enhances the traditional LSTM by introducing exponential gating, memory mixing, and a matrix memory structure, enabling improved performance and scalability for sequence modeling tasks.

cepstrum, quefrency, rahmonic

SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models

Simple and Efficient Quantization Techniques for Neural Speech Coding

SpeechVerse: A Large-scale Generalizable Audio Language Model

A predictive learning model can simulate temporal dynamics and context effects found in neural representations of continuous speech

Learning the Meanings of Function Words From Grounded Language Using a Visual Question Answering Model

Coin3D: Controllable and Interactive 3D Assets Generation with Proxy-Guided Conditioning

No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding, code

Naturalistic Music Decoding from EEG Data via Latent Diffusion Models

A vector quantized masked autoencoder for audiovisual speech emotion recognition

Jan Kasprowicz - Krzak dzikiej róży w Ciemnych Smreczynach

polska-poezja.pl

Evolutionary Optimization of Model Merging Recipes, code

Swedish Kelly list

A Large-Scale Evaluation of Speech Foundation Models

Video ReCap: Recursive Captioning of Hour-Long Videos, code

Target Speech Extraction with Pre-trained Self-supervised Learning Models

Semi-Autoregressive Streaming ASR With Label Context

Towards audio language modeling – an overview

Probing Self-supervised Learning Models with Target Speech Extraction

A multimodal dynamical variational autoencoder for audiovisual speech representation learning

On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models

MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning

Best Practices for Robot Death

OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification

MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning

Understanding, Using, and Finetuning Gemma

Blind estimation of audio effects using an auto-encoder approach and differentiable digital signal processing

Sigmoid Loss for Language Image Pre-Training, code

ChatGPT suspends Scarlett Johansson-like voice as actor speaks out against OpenAI

CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor, code

From Motor Control to Team Play in Simulated Humanoid Football

Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing

Cross-modal Contrastive Learning for Speech Translation, code

nateraw/hf-hub-lightning — A PyTorch Lightning Callback for pushing models to the Hugging Face Hub

Learning Transformer Programs

Images that Sound: Composing Images and Sounds on a Single Canvas, code, (model is not open).

2BP: 2-Stage Backpropagation

An Introduction to Vision-Language Modeling

Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One, code

MyoHub/myoconverter — A tool to convert opensim 4.0+ MSK models into MuJoCo format with optimized muscle kinematics and kinetics

MoCapAct: A Multi-Task Dataset for Simulated Humanoid Control, code

From motor control to team play in simulated humanoid football

langchain-ai/langchain

Multi-Modal Data Augmentation for End-to-End ASR

PleIAs/YouTube-Commons

(Dataset is noisy, no attempt made to determine if transcript in original language in any way matches speech — or if there even is speech — and often original transcript is omitted in favour of a translation).

Self-Rewarding Language Models

Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval

Revisiting Self-supervised Learning of Speech Representation from a Mutual Information Perspective

Phonemes based detection of parkinson’s disease for telehealth applications

JaColBERT and Hard Negatives, Towards Better Japanese-First Embeddings for Retrieval: Early Technical Report, model

colbert-ir/colbertv2.0

ColBERT relies on fine-grained contextual late interaction: it encodes each passage into a matrix of token-level embeddings (shown above in blue). Then at search time, it embeds every query into another matrix (shown in green) and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators.

From Coarse to Fine: Efficient Training for Audio Spectrogram Transformers

The seven sins of memory: Insights from psychology and cognitive neuroscience

mcdermottLab/pycochleagram — Generate cochleagrams natively in Python. Ported from Josh McDermott’s MATLAB code.

Codifying the Debates of the Riksdag: Towards a Framework for Semi-automatic Annotation of Swedish Parliamentary Discourse

RefFusion: Reference Adapted Diffusion Models for 3D Scene Inpainting, paper

jaywalnut310/vits

Dutch people

SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding

General-purpose, long-context autoregressive modeling with Perceiver AR, code

Uneasy on the Ear: An Interview with Lola De La Mata, Left Ear, Right Ear

4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders

Audio Mamba: Bidirectional State Space Model for Audio Representation Learning, code

dgreenheck/tree-js — Procedural tree generator written with JavaScript and Three.js

xenova/transformers.js

TextureDreamer: Image-guided Texture Synthesis through Geometry-aware Diffusion

Self-Imagine: Effective Unimodal Reasoning with Multimodal Models using Self-Imagination

MarcusLoppe/meshgpt-pytorch, model — based on lucidrains/meshgpt-pytorch

SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance, code

Scaling Spherical CNNs, code

Language Table — Suite of human-collected datasets and a multi-task continuous control benchmark for open vocabulary visuolinguomotor learning.

Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers, code (depends on diff-gaussian-rasterization which is not open source)

Voice in Parkinson’s Disease: A Machine Learning Study

Parkinson’s Disease Detection Based on Running Speech Data From Phone Calls

lucidrains/BS-RoFormer — Implementation of Band Split Roformer, SOTA Attention network for music source separation out of ByteDance AI Labs

Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling, code

LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning, data

ECAPA2: A Hybrid Neural Network Architecture and Training Strategy for Robust Speaker Embeddings

Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers, code

Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion,

ReLU-KAN: New Kolmogorov-Arnold Networks that Only Need Matrix Addition, Dot Multiplication, and ReLU

hokema/Pop2Talk — Pop2Talk foreign language prounnciation learning game. Code for the unity client app.

kyegomez/VisionMamba — Implementation of Vision Mamba from the paper: “Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model” It’s 2.8x faster than DeiT and saves 86.8% GPU memory when performing batch inference to extract features on high-res images

The maze is in the mouse

lucidrains/soundstorm-pytorch — Implementation of SoundStorm, Efficient Parallel Audio Generation from Google Deepmind, in Pytorch

lucidrains/mogrifier — Usable implementation of Mogrifier, a circuit for enhancing LSTMs and potentially other networks, from Deepmind

ina-foss/inaSpeechSegmenter — CNN-based audio segmentation toolkit. Allows to detect speech, music, noise and speaker gender. Has been designed for large scale gender equality studies based on speech time per gender.

mHuBERT-147: A Compact Multilingual HuBERT Model, fairseq fork, pre-processing scripts

A virtual rodent predicts the structure of neural activity across behaviors

openvla/openvla — OpenVLA: An Open-Source Vision-Language-Action Model (based on Llama, so model is not open)

TRI-ML/prismatic-vlms — A flexible and efficient codebase for training visually-conditioned language models (VLMs)

ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets

VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation

Scalable MatMul-free Language Modeling, code

Contextual and combinatorial structure in sperm whale vocalisations

Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR

SpokeN-100: A Cross-Lingual Benchmarking Dataset for The Classification of Spoken Numbers in Different Languages

juletx/BertaQA — BertaQA: How Much Do Language Models Know About Local Culture?

Scientists have transplanted memory from one snail to another. So, what does it mean for humans?

Beyond Language Models: Byte Models are Digital World Simulators, code

How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

Towards Lightweight Speaker Verification via Adaptive Neural Network Quantization

Multilingual Prosody Transfer: Comparing Supervised & Transfer Learning

CrossVoice: Crosslingual Prosody Preserving Cascade-S2ST using Transfer Learning

RawBMamba: End-to-End Bidirectional State Space Model for Audio Deepfake Detection

Performant ASR Models for Medical Entities in Accented Speech

A dual task learning approach to fine-tune a multilingual semantic speech encoder for Spoken Language Understanding

Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting

A Mel Spectrogram Enhancement Paradigm Based on CWT in Speech Synthesis

Interface Design for Self-Supervised Speech Models

Towards Audio Codec-based Speech Separation

Notebook

I Felt a Little Homosexual Today, So I Called in Sick: The Formation of “Reverse Discourse” by Swedish Gay Activists in the 1970s

Bisimulation Metrics are Optimal Transport Distances, and Can be Computed Efficiently

This Madlad Programmer Managed to Run Blender on a Nokia Phone

kvfrans/jax-flow — Flow-matching algorithms in JAX

Audio Signal Processing for Machine Learning, slides

Domain-Informed Probing of wav2vec 2.0 Embeddings for Phonetic Features

neongeckocom/cv-tts-clean — TTS dataset from Common Voice

neongeckocom — multilingual ViTS models

Not all ‘open source’ AI models are actually open: here’s a ranking

Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding

Wikipedia redirect for language code: https://en.wikipedia.org/wiki/ISO_639:$CODE e.g.: gle - Irish

OmniGlue: Generalizable Feature Matching with Foundation Model Guidance, code

Exploring the Capability of Mamba in Speech Applications

Researchers craft smiling robot face from living human skin cells