Interesting links, 23/03/2025

Misc. interesting things.

Mar 23, 2025 • 3 min read

The big idea: what do we really mean by free speech?

Speaker Change Detection for Transformer Transducer ASR

Büszkeség és balítélet Audio, card, Text

Audiobooks in MP3

KnugiHK/WhatsApp-Chat-Exporter

EliteAndroidApps/WhatsApp-Key-DB-Extractor

Finetuning for ESPNet OWSM Model

Liquid AI’s new STAR model architecture outshines Transformer efficiency, Automated Architecture Synthesis via Targeted Evolution, arxiv

ESPnet pull requests:

The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language

Cold-start Active Learning through Self-supervised Language Modeling

CMU Researchers Introduce TNNGen

Open sourcing h3i

How Neural Networks Learn: A Probabilistic Viewpoint

FStarLang/karamel — KaRaMeL is a tool for extracting low-level F* programs to readable C code

Single directional chamfer distance and non-absolute cosine similarity

Render DensePose — Needs SMPL (which is not open source)

RVC-Boss/GPT-SoVITS — 1 min voice data can also be used to train a good TTS model! (few shot voice cloning)

acids-ircam/ddsp_pytorch — Implementation of Differentiable Digital Signal Processing (DDSP) in Pytorch

DDSP-based Neural Waveform Synthesis of Polyphonic Guitar Performance from String-wise MIDI Input, code

jryban/frechet-music-distance

Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics

huggingface/speech-to-speech

mixup: Data-Dependent Data Augmentation

Gaussian Distributions are Soap Bubbles

Musings on Typicality

nebius/kvax — A FlashAttention implementation for JAX with support for efficient document mask computation and context parallelism.

DiffRhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion, code, arxiv

PeterGriffinJin/Search-R1 — Search-R1: An Efficient, Scalable RL Training Framework for Reasoning & Search Engine Calling interleaved LLM based on veRL

MatthewCYM/VoiceBench — VoiceBench: Benchmarking LLM-Based Voice Assistants

IDEA-Research/Grounded-SAM-2 — Grounded SAM 2: Ground and Track Anything in Videos with Grounding DINO, Florence-2 and SAM 2

New 3h31m video on YouTube:
"Deep Dive into LLMs like ChatGPT"

This is a general audience deep dive into the Large Language Model (LLM) AI technology that powers ChatGPT and related products. It is covers the full training stack of how the models are developed, along with mental… pic.twitter.com/Di0XNgdlwC
— Andrej Karpathy (@karpathy) February 5, 2025

Deep Dive into LLMs like ChatGPT

Must-Watch Hungarian TV Series to Improve Your Hungarian

r/hungarian resources

Speaking Hungarian S01

ml-explore/mlx-lm — Run LLMs with MLX

MoshiVis — Teaching Moshi to Converse about Images

Announcing Pixtral 12B

Moshi: a speech-text foundation model for real-time dialogue, code

PyTorch internals

Inductive Moment Matching

mistralai/Mistral-Small-3.1-24B-Instruct-2503

Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment

fasttransform: Reversible Pipelines Made Simple

nvidia/canary-1b-flash — CC-BY

Empowering innovation: The next generation of the Phi family, microsoft/Phi-4-multimodal-instruct

Traveling Waves Integrate Spatial Information Through Time

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval, code

yzhuoning/Awesome-CLIP

UniSep: Universal Target Audio Separation with Language Models at Scale

Qieemo: Speech Is All You Need in the Emotion Recognition in Conversations

Scaling Rich Style-Prompted Text-to-Speech Datasets, code — models and data not open.

Dolphin: A Large-Scale Automatic Speech Recognition Model for Eastern Languages, code, small, base

NeMo - AED Decoding with N-Gram LM

Audio Compression using Periodic Gabor with Biorthogonal Exchange: Implementation Using the Zak Transform

AudioComposer: Towards Fine-grained Audio Generation with Natural Language Descriptions

FireRedTTS-1S: An Upgraded Streamable Foundation Text-to-Speech System, code

Enhance Generation Quality of Flow Matching V2A Model via Multi-Step CoT-Like Guidance and Combined Preference Optimization

RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics, dataset, eval code

OSWM CTC, OSWM CTC aligner

e-branchformer encoder

CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition, code, espnet code

llava-hf/llava-v1.6-mistral-7b-hf

Prompt-based Alignment of Headlines and Images Using OpenCLIP, workshop

Optimizing Visual Pairings: A CLIP Framework for Precision News Image Rematching, workshop

rom1504/clip-retrieval — Easily compute clip embeddings and build a clip retrieval system with them

Wav2CLIP: Learning Robust Audio Representations From CLIP, arXiv, code, demo

Fine-tuned CLIP Models are Efficient Video Learners, code

dmlc/decord — An efficient video loader for deep learning with smart shuffling that’s super easy to digest

ibm-granite/granite-speech-3.2-8b, based on granite-3.1-8b-base

Qwen/Qwen2.5-Omni-7B, code

KBLab/kb-whisper-large

KBLab/rixvox-v2

ishine/PnG-BERT – unofficial implementation of PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS

People are poorly equipped to detect AI-powered voice clones

r9y9/pysptk — A python wrapper for Speech Signal Processing Toolkit (SPTK).

jameslyons/python_speech_features — provides common speech features for ASR including MFCCs and filterbank energies