Interesting links, 22/01/2025

OuteAI/OuteTTS-0.3-500M, code (1B model is not open)

HidekiKawahara/SparkNG — MATLAB real-time/interactive speech tools. This series is obsolete. SP3ARK is the up-to-date series (will be).

VocalTractLab

Hakarps kyrka: audio, revision ?

RandNet-Parareal: a time-parallel PDE solver using Random Neural Networks, OpenReview, code

SEL-BALD: Deep Bayesian Active Learning for Selective Labeling with Instance Rejection, OpenReview

Theoretical Foundations of Deep Selective State-Space Models, OpenReview

Task-recency bias strikes back: Adapting covariances in Exemplar-Free Class Incremental Learning, OpenReview

What if English actually SOUNDED like this??

rviz — ROS 3D Robot Visualizer

parler-tts, code, parler_tts_mini_v0.1,

HCI-LAB-UGSPEECHDATA/speech_data_ghana_ug — The dataset comprises of 5000 hours speech corpus in Akan, Ewe, Dagbani, Daagare, and Ikposo. Each language includes 1000 hours of audio speech from indigenous speakers of the language and 100 hours of transcription.

001 - Hungarian short narrative A0

microsoft/GW-BASIC — The original source code of Microsoft GW-BASIC from 1983

microsoft/MS-DOS — The original sources of MS-DOS 1.25, 2.0, and 4.0 for reference purposes

Standard-Intelligence/hertz-dev — first base model for full-duplex conversational audio

wav2gloss/fieldwork — Mostly open, but includes closed data

juice500ml/finetune_owsm

vllm-project/vllm — A high-throughput and memory-efficient inference and serving engine for LLMs

espnet - Phoneme Recognition with IPAPack

Active Learning with Task Adaptation Pre-training for Speech Emotion Recognition

SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound, inference code

How the RWKV language model works, RWKV_in_150_lines.py

clee704/audiodiff — A commandline tool that compares two audio files and prints the difference

PyGyat, code

torch.compile, the missing manual

Ways to use torch.compile

FaceFormer: Speech-Driven 3D Facial Animation with Transformers, code — Depends on Max Planck stuff, so probably not useable.

ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer, model, code

modelscope/scepter — SCEPTER is an open-source framework used for training, fine-tuning, and inference with generative models.

TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer

rusq/slackdump

black-forest-labs/flux — Official inference repo for FLUX.1 models. open model

Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio, data

deepseek-r1-webgpu

allenai/OLMo — Modeling, training, eval, and inference code for OLMo

m-a-p/Code-Feedback — OpenCodeInterpreter is a family of open-source code generation systems designed to bridge the gap between large language models and advanced proprietary systems like the GPT-4 Code Interpreter. It significantly advances code generation capabilities by integrating execution and iterative refinement functionalities.

persian-tts-dataset-male, persian-tts-dataset-famale

DiM-Gesture: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 framework

Dynamic Time Warping Notebook

kamperh/speech_dtw

You Only Cache Once: Decoder-Decoder Architectures for Language Models, code

hitz-zentroa/latxa — Latxa: An Open Language Model and Evaluation Suite for Basque

kamperh/VectorQuantizedCPC

Looking Backward: Streaming Video-to-Video Translation with Feature Banks

A light-weight and efficient punctuation and word casing prediction model for on-device streaming ASR, code

Büszkeség és balítélet

EvaByte/EvaByte — EvaByte is a 6.5B byte-level language model built upon an improved architecture with multibyte prediction and EVA – an efficient attention mechanism designed for scalability and performance.

HKUNLP/efficient-attention — [EVA ICLR’23; LARA ICML’22] Efficient attention mechanisms via control variates, random features, and importance sampling

Probing the 3D Awareness of Visual Foundation Models, mbanani/probe3d

OpenScene: 3D Scene Understanding with Open Vocabularies, code

Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding, OpenReview, code

Gaussian Graph Network: Learning Efficient and Generalizable Gaussian Representations from Multi-view Images, OpenReview

PAConv: Position Adaptive Convolution with Dynamic Kernel Assembling on Point Clouds, code

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework, code

Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound, code (CC-BY)

CTC Networks and Language Models: Prefix Beam Search Explained

A Study on Effects of Implicit and Explicit Language Model Information for DBLSTM-CTC Based Handwriting Recognition

Pronunciation modeling for speech technology

A Study on Effects of Implicit and Explicit Language Model Information for DBLSTM-CTC Based Handwriting Recognition

D-LUCEA: Curation of the UCU Accent Project Data

@inbook{orr2017dlucea,
author = {Orr, Rosemary and Quené, Hugo},
year = {2017},
month = {12},
pages = {181-193},
booktitle = {CLARIN in the Low Countries},
editor    = {Odijk, Jan and van~Hessen, Arjan},
publisher = {Ubiquity Press},
address   = {London},
title = {D-LUCEA: Curation of the UCU Accent Project data},
doi = {10.5334/bbi.15}
}