The big idea: what do we really mean by free speech?

Speaker Change Detection for Transformer Transducer ASR

Büszkeség és balítélet Audio, card, Text

Audiobooks in MP3

KnugiHK/WhatsApp-Chat-Exporter

EliteAndroidApps/WhatsApp-Key-DB-Extractor

BBC Basic

Finetuning for ESPNet OWSM Model

Liquid AI’s new STAR model architecture outshines Transformer efficiency, Automated Architecture Synthesis via Targeted Evolution, arxiv

ESPnet pull requests:

The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language

Cold-start Active Learning through Self-supervised Language Modeling

ModernBERT

CMU Researchers Introduce TNNGen

Open sourcing h3i

How Neural Networks Learn: A Probabilistic Viewpoint

FStarLang/karamel — KaRaMeL is a tool for extracting low-level F* programs to readable C code

Single directional chamfer distance and non-absolute cosine similarity

Render DensePose — Needs SMPL (which is not open source)

RVC-Boss/GPT-SoVITS — 1 min voice data can also be used to train a good TTS model! (few shot voice cloning)

acids-ircam/ddsp_pytorch — Implementation of Differentiable Digital Signal Processing (DDSP) in Pytorch

DDSP-based Neural Waveform Synthesis of Polyphonic Guitar Performance from String-wise MIDI Input, code

jryban/frechet-music-distance

Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics

huggingface/speech-to-speech

mixup: Data-Dependent Data Augmentation

Gaussian Distributions are Soap Bubbles

Musings on Typicality

nebius/kvax — A FlashAttention implementation for JAX with support for efficient document mask computation and context parallelism.

DiffRhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion, code, arxiv

PeterGriffinJin/Search-R1 — Search-R1: An Efficient, Scalable RL Training Framework for Reasoning & Search Engine Calling interleaved LLM based on veRL

MatthewCYM/VoiceBench — VoiceBench: Benchmarking LLM-Based Voice Assistants

IDEA-Research/Grounded-SAM-2 — Grounded SAM 2: Ground and Track Anything in Videos with Grounding DINO, Florence-2 and SAM 2

Deep Dive into LLMs like ChatGPT

Must-Watch Hungarian TV Series to Improve Your Hungarian

r/hungarian resources

Speaking Hungarian S01

ml-explore/mlx-lm — Run LLMs with MLX

MoshiVis — Teaching Moshi to Converse about Images

Announcing Pixtral 12B

Moshi: a speech-text foundation model for real-time dialogue, code

PyTorch internals

Inductive Moment Matching

mistralai/Mistral-Small-3.1-24B-Instruct-2503

Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment

fasttransform: Reversible Pipelines Made Simple

nvidia/canary-1b-flash — CC-BY

Empowering innovation: The next generation of the Phi family, microsoft/Phi-4-multimodal-instruct

Traveling Waves Integrate Spatial Information Through Time

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval, code

yzhuoning/Awesome-CLIP

UniSep: Universal Target Audio Separation with Language Models at Scale

Qieemo: Speech Is All You Need in the Emotion Recognition in Conversations

Scaling Rich Style-Prompted Text-to-Speech Datasets, code — models and data not open.

Dolphin: A Large-Scale Automatic Speech Recognition Model for Eastern Languages, code, small, base

NeMo - AED Decoding with N-Gram LM

Audio Compression using Periodic Gabor with Biorthogonal Exchange: Implementation Using the Zak Transform

AudioComposer: Towards Fine-grained Audio Generation with Natural Language Descriptions

FireRedTTS-1S: An Upgraded Streamable Foundation Text-to-Speech System, code

Enhance Generation Quality of Flow Matching V2A Model via Multi-Step CoT-Like Guidance and Combined Preference Optimization

RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics, dataset, eval code

OSWM CTC, OSWM CTC aligner

e-branchformer encoder

CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition, code, espnet code

llava-hf/llava-v1.6-mistral-7b-hf

01-ai/Yi-34B

Prompt-based Alignment of Headlines and Images Using OpenCLIP, workshop

Optimizing Visual Pairings: A CLIP Framework for Precision News Image Rematching, workshop

rom1504/clip-retrieval — Easily compute clip embeddings and build a clip retrieval system with them

Wav2CLIP: Learning Robust Audio Representations From CLIP, arXiv, code, demo

Fine-tuned CLIP Models are Efficient Video Learners, code

dmlc/decord — An efficient video loader for deep learning with smart shuffling that’s super easy to digest

ibm-granite/granite-speech-3.2-8b, based on granite-3.1-8b-base

Qwen/Qwen2.5-Omni-7B, code

KBLab/kb-whisper-large

KBLab/rixvox-v2

ishine/PnG-BERT – unofficial implementation of PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS

People are poorly equipped to detect AI-powered voice clones

r9y9/pysptk — A python wrapper for Speech Signal Processing Toolkit (SPTK).

jameslyons/python_speech_features — provides common speech features for ASR including MFCCs and filterbank energies