Interesting links, 16/06/2025
Misc. interesting things.
MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages, code
SLURP: A Spoken Language Understanding Resource Package, data — text is open, audio is not
ZeroSep: Separate Anything in Audio with Zero Training, arXiv, code
There are 6 forms of depression, study shows. Here’s how they’re different.
Marconi Union - Weightless — supposed to help with anxiety
Dependency Parsing Evaluation for Low-resource Spontaneous Speech
Insert OCRed text and annotations in DjVu
TaxoLLaMA: WordNet-based Model for Solving Multiple Lexical Semantic Tasks, code
Anthropic wins key US ruling on AI training in authors’ copyright lawsuit
CILI: the Collaborative Interlingual Index
omwn/omw-data — This packages up data for the Open Multilingual Wordnet
docker image save — Save one or more images to a tar archive
Introducing the V-JEPA 2 world model and new benchmarks for physical reasoning
facebook/vjepa2-vitl-fpc64-256 — actually open source.
Terrible things happen in life – but it is possible to recover from them
InteractAnything: Zero-shot Human Object-Interaction Synthesis via LLM Feedback and Object Affordance Parsing, arxiv
Magyar népmesék sorozat, Hungarian Folk Tales
Finnish-NLP/wav2vec2-xlsr-300m-finnish-lm
- Conference: October 19-22, 2025
- Paper submission (5 – 6 pages, IEEE format): July 7, 2025
- OpenReview
- Overleaf
- 5-6 pages including references
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
Neural Network Activation Functions pic.twitter.com/WYKOER1Ldz
— Dan | Machine Learning Engineer (@DanKornas) July 2, 2025
The single biggest argument about statistics: is probability frequentist or Bayesian?
— Tivadar Danka (@TivadarDanka) July 3, 2025
It's neither, and I'll explain why.
Buckle up. Deep-dive explanation incoming. pic.twitter.com/PYlvOAGyB6
timtadh/zhang-shasha — Tree edit distance using the Zhang Shasha algorithm
The ParlaMint corpora of parliamentary proceedings
Spoken Spanish PoS tagging: gold standard dataset
ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark, code, dataset
Self-supervised learning of speech representations with Dutch archival data
hitachi-speech/EEND — EEND (End-to-End Neural Diarization) is a neural-network-based speaker diarization method.
Add option to carry initial_prompt with the sliding window
myshell-ai/MeloTTS — High-quality multi-lingual text-to-speech library by MyShell.ai. Support English, Spanish, French, Chinese, Japanese and Korean.
AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss
Self-supervised Learning (SSL) vs Contrastive Language-Image (CLIP) models is a never-ending battle
— Gabriele Berton (@gabriberton) July 4, 2025
What about using both? This Google paper does exactly that, and results are really good on many different tasks
TIPS is a model trained with a CLIP loss and 2 SSL losses [1/9] pic.twitter.com/4tKDfM152W
TIPS: Text-Image Pretraining with Spatial awareness, code
Binary Latent Diffusion, ZeWang95/BinaryLatentDiffusion
Phone-Level Pronunciation Scoring for L1 Using Weighted-Dynamic Time Warping
An Investigation of the Relation Between Grapheme Embeddings and Pronunciation for Tacotron-based Systems, models
The SIWIS French Speech Synthesis Database
Towards Distributed Neural Architectures
https://github.com/openai/whisper/commit/31243bad24cc746f07d4c8bfdd2d974872cb1803 — Add option to carry initial_prompt with the sliding window
Voice Conversion With Just Nearest Neighbors, code
atong01/conditional-flow-matching — TorchCFM: a Conditional Flow Matching library
einspace: Searching for Neural Architectures from Fundamental Operations
Voxtral, mistralai/Voxtral-Mini-3B-2507
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training, code
Real-Time Textless Dialogue Generation
Prosody Labeling with Phoneme-BERT and Speech Foundation Models