Interesting links, 19/01/2026
Misc. interesting things.
Integrating Lattice-Free MMI Into End-to-End Speech Recognition
Google discovers emergent temporal abstractions in autoregressive models
— DailyPapers (@HuggingPapers) December 26, 2025
These models learn linearly controllable action representations in their residual streams—activating them executes long-horizon behaviors. This enables Internal RL to solve sparse-reward hierarchical tasks… pic.twitter.com/GxOObljGcB
Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning
icicle-emu/icicle-emu — Icicle is an experimental fuzzing-specific, multi-architecture emulation framework.
language-based-audio-retrieval
FLAM: Frame-Wise Language-Audio Modeling
TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining, data
Italian-Ligurian Machine Translation in Its Cultural Context, data
xieh97/language-based-audio-retrieval
Tandem Long-Short Duration-based Modeling for Automatic Speech Recognition
BEA-Base: A Benchmark for ASR of Spontaneous Hungarian, arXiv
Bi-dialectal ASR of Armenian from Naturalistic and Read Speech
Byte Latent Transformer: Patches Scale Better Than Tokens
Hilbert - Foundations of Geometry
Calculus Made Easy
How to REALLY Learn a Language in 2026
pocketsphinx, clarinstudio, files
Deep Learning with PyTorch...
— Kirk Borne (@KirkDBorne) January 26, 2026
1) Cheat sheet [PDF]: https://t.co/TcRqfgqFOK
2) Learn fundamentals with hands-on coding [PDF]: https://t.co/IsXFjwVAhk
3) #GenerativeAI with Python and PyTorch: https://t.co/hfbERRk99u book v/ @PacktDataML pic.twitter.com/o8XVe0A7O1
MiMo-Audio: Audio Language Models are Few-Shot Learners, code, Tokenizer, 7B-Base, 7B-Instruct, demo
Evaluation of speech and speech synthesis
- Submission deadline: 30 June 2026
- Submission portal: https://www.editorialmanager.com/ycsla/default.aspx
- Guide for Authors
Keyword Mamba: Spoken keyword spotting with state space models
Is self-supervised learning enough to fill in the gap? A study on speech inpainting
Mispronunciation detection and diagnosis based on large language models
SemanticAudio: Audio Generation and Editing in Semantic Space
A Study of Data Selection Strategies for Pre-training Self-Supervised Speech Models
You can now run 70B LLMs on a 4GB GPU.
— Hasan Toor (@hasantoxr) January 31, 2026
AirLLM just killed the "you need expensive hardware" excuse.
It runs 70B models on 4GB VRAM.
It loads models one layer at a time, runs 405B Llama 3.1 on 8GB VRAM.
→ No quantization needed by default
→ Run Llama, Qwen, Mistral, Mixtral… pic.twitter.com/L697FHoeCi
The Principles of Diffusion Models
Omnilingual ASR, code, dataset, arXiv
Poor WER when trying to fine-tune Parakeet v2 TDT to other dataset than English, bug
finetuning-parakeet-on-hindi-dataset
HiMo: High-Speed Objects Motion Compensation in Point Clouds, code, dataset
Survey of end-to-end multi-speaker automatic speech recognition for monaural audio
Predict-and-Update Network: Audio-Visual Speech Recognition Inspired by Human Speech Perception
Disentangling Prosody Representations With Unsupervised Speech Reconstruction
Decoupling Speaker-Independent Emotions for Voice Conversion via Source-Filter Networks
Efficient Audiovisual Speech Processing via MUTUD: Multimodal Training and Unimodal Deployment, arXiv
Integrating Lattice-Free MMI Into End-to-End Speech Recognition