Interesting links, 23/03/2025
Misc. interesting things.
The big idea: what do we really mean by free speech?
Speaker Change Detection for Transformer Transducer ASR
Büszkeség és balítélet Audio, card, Text
KnugiHK/WhatsApp-Chat-Exporter
EliteAndroidApps/WhatsApp-Key-DB-Extractor
Finetuning for ESPNet OWSM Model
Liquid AI’s new STAR model architecture outshines Transformer efficiency, Automated Architecture Synthesis via Targeted Evolution, arxiv
ESPnet pull requests:
- Classification Task and AudioSet-20K #5998
- add SASV support #5980
- Add RATS dataset for SV task
- Add SWBD text processing fix
The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language
Cold-start Active Learning through Self-supervised Language Modeling
CMU Researchers Introduce TNNGen
How Neural Networks Learn: A Probabilistic Viewpoint
FStarLang/karamel — KaRaMeL is a tool for extracting low-level F* programs to readable C code
Single directional chamfer distance and non-absolute cosine similarity
Render DensePose — Needs SMPL (which is not open source)
RVC-Boss/GPT-SoVITS — 1 min voice data can also be used to train a good TTS model! (few shot voice cloning)
acids-ircam/ddsp_pytorch — Implementation of Differentiable Digital Signal Processing (DDSP) in Pytorch
DDSP-based Neural Waveform Synthesis of Polyphonic Guitar Performance from String-wise MIDI Input, code
Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics
mixup: Data-Dependent Data Augmentation
Gaussian Distributions are Soap Bubbles
nebius/kvax — A FlashAttention implementation for JAX with support for efficient document mask computation and context parallelism.
DiffRhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion, code, arxiv
PeterGriffinJin/Search-R1 — Search-R1: An Efficient, Scalable RL Training Framework for Reasoning & Search Engine Calling interleaved LLM based on veRL
MatthewCYM/VoiceBench — VoiceBench: Benchmarking LLM-Based Voice Assistants
IDEA-Research/Grounded-SAM-2 — Grounded SAM 2: Ground and Track Anything in Videos with Grounding DINO, Florence-2 and SAM 2
New 3h31m video on YouTube:
— Andrej Karpathy (@karpathy) February 5, 2025
"Deep Dive into LLMs like ChatGPT"
This is a general audience deep dive into the Large Language Model (LLM) AI technology that powers ChatGPT and related products. It is covers the full training stack of how the models are developed, along with mental… pic.twitter.com/Di0XNgdlwC
Deep Dive into LLMs like ChatGPT
Must-Watch Hungarian TV Series to Improve Your Hungarian
ml-explore/mlx-lm — Run LLMs with MLX
MoshiVis — Teaching Moshi to Converse about Images
Moshi: a speech-text foundation model for real-time dialogue, code
mistralai/Mistral-Small-3.1-24B-Instruct-2503
Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment
fasttransform: Reversible Pipelines Made Simple
nvidia/canary-1b-flash — CC-BY
Empowering innovation: The next generation of the Phi family, microsoft/Phi-4-multimodal-instruct
Traveling Waves Integrate Spatial Information Through Time
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval, code
UniSep: Universal Target Audio Separation with Language Models at Scale
Qieemo: Speech Is All You Need in the Emotion Recognition in Conversations
Scaling Rich Style-Prompted Text-to-Speech Datasets, code — models and data not open.
Dolphin: A Large-Scale Automatic Speech Recognition Model for Eastern Languages, code, small, base
NeMo - AED Decoding with N-Gram LM
AudioComposer: Towards Fine-grained Audio Generation with Natural Language Descriptions
FireRedTTS-1S: An Upgraded Streamable Foundation Text-to-Speech System, code
RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics, dataset, eval code
CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition, code, espnet code
llava-hf/llava-v1.6-mistral-7b-hf
Prompt-based Alignment of Headlines and Images Using OpenCLIP, workshop
Optimizing Visual Pairings: A CLIP Framework for Precision News Image Rematching, workshop
rom1504/clip-retrieval — Easily compute clip embeddings and build a clip retrieval system with them
Wav2CLIP: Learning Robust Audio Representations From CLIP, arXiv, code, demo
Fine-tuned CLIP Models are Efficient Video Learners, code
dmlc/decord — An efficient video loader for deep learning with smart shuffling that’s super easy to digest
ibm-granite/granite-speech-3.2-8b, based on granite-3.1-8b-base
ishine/PnG-BERT – unofficial implementation of PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS
People are poorly equipped to detect AI-powered voice clones
r9y9/pysptk — A python wrapper for Speech Signal Processing Toolkit (SPTK).
jameslyons/python_speech_features — provides common speech features for ASR including MFCCs and filterbank energies