UI-TARS: Pioneering Automated GUI Interaction with Native Agents, no code yet, desktop app, model

Swedish terms with IPA pronunciation

Combiner: full attention transformer with sparse computation cost, pdf

OuteAI/OuteTTS-0.3-500M, code (1B model is not open)

An Sgéaluidhe Gaedhealach

Diffusion Models and Their Applications

HidekiKawahara/SparkNG — MATLAB real-time/interactive speech tools. This series is obsolete. SP3ARK is the up-to-date series (will be).

VocalTractLab

Hakarps kyrka: audio, revision ?

RandNet-Parareal: a time-parallel PDE solver using Random Neural Networks, OpenReview, code

SEL-BALD: Deep Bayesian Active Learning for Selective Labeling with Instance Rejection, OpenReview

Theoretical Foundations of Deep Selective State-Space Models, OpenReview

Task-recency bias strikes back: Adapting covariances in Exemplar-Free Class Incremental Learning, OpenReview

Related: Class-Incremental Learning: Survey and Performance Evaluation on Image Classification, code

What if English actually SOUNDED like this??

rviz — ROS 3D Robot Visualizer

parler-tts, code, parler_tts_mini_v0.1,

HCI-LAB-UGSPEECHDATA/speech_data_ghana_ug — The dataset comprises of 5000 hours speech corpus in Akan, Ewe, Dagbani, Daagare, and Ikposo. Each language includes 1000 hours of audio speech from indigenous speakers of the language and 100 hours of transcription.

001 - Hungarian short narrative A0

microsoft/GW-BASIC — The original source code of Microsoft GW-BASIC from 1983

microsoft/MS-DOS — The original sources of MS-DOS 1.25, 2.0, and 4.0 for reference purposes

Standard-Intelligence/hertz-dev — first base model for full-duplex conversational audio

wav2gloss/fieldwork — Mostly open, but includes closed data

juice500ml/finetune_owsm

vllm-project/vllm — A high-throughput and memory-efficient inference and serving engine for LLMs

espnet - Phoneme Recognition with IPAPack

Active Learning with Task Adaptation Pre-training for Speech Emotion Recognition

SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound, inference code

How the RWKV language model works, RWKV_in_150_lines.py

clee704/audiodiff — A commandline tool that compares two audio files and prints the difference

PyGyat, code

torch.compile, the missing manual

Ways to use torch.compile

FaceFormer: Speech-Driven 3D Facial Animation with Transformers, code — Depends on Max Planck stuff, so probably not useable.

ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer, model, code

modelscope/scepter — SCEPTER is an open-source framework used for training, fine-tuning, and inference with generative models.

TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer

rusq/slackdump

black-forest-labs/flux — Official inference repo for FLUX.1 models. open model

Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio, data

deepseek-r1-webgpu

allenai/OLMo — Modeling, training, eval, and inference code for OLMo

m-a-p/Code-Feedback — OpenCodeInterpreter is a family of open-source code generation systems designed to bridge the gap between large language models and advanced proprietary systems like the GPT-4 Code Interpreter. It significantly advances code generation capabilities by integrating execution and iterative refinement functionalities.

persian-tts-dataset-male, persian-tts-dataset-famale

DiM-Gesture: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 framework

Dynamic Time Warping Notebook

kamperh/speech_dtw

You Only Cache Once: Decoder-Decoder Architectures for Language Models, code

hitz-zentroa/latxa — Latxa: An Open Language Model and Evaluation Suite for Basque

kamperh/VectorQuantizedCPC

Looking Backward: Streaming Video-to-Video Translation with Feature Banks

A light-weight and efficient punctuation and word casing prediction model for on-device streaming ASR, code

Büszkeség és balítélet

EvaByte/EvaByte — EvaByte is a 6.5B byte-level language model built upon an improved architecture with multibyte prediction and EVA – an efficient attention mechanism designed for scalability and performance.

HKUNLP/efficient-attention — [EVA ICLR’23; LARA ICML’22] Efficient attention mechanisms via control variates, random features, and importance sampling

Probing the 3D Awareness of Visual Foundation Models, mbanani/probe3d

OpenScene: 3D Scene Understanding with Open Vocabularies, code

Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding, OpenReview, code

Gaussian Graph Network: Learning Efficient and Generalizable Gaussian Representations from Multi-view Images, OpenReview

PAConv: Position Adaptive Convolution with Dynamic Kernel Assembling on Point Clouds, code

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework, code

Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound, code (CC-BY)

CTC Networks and Language Models: Prefix Beam Search Explained

A Study on Effects of Implicit and Explicit Language Model Information for DBLSTM-CTC Based Handwriting Recognition

Pronunciation modeling for speech technology

A Study on Effects of Implicit and Explicit Language Model Information for DBLSTM-CTC Based Handwriting Recognition

D-LUCEA: Curation of the UCU Accent Project Data

@inbook{orr2017dlucea,
author = {Orr, Rosemary and Quené, Hugo},
year = {2017},
month = {12},
pages = {181-193},
booktitle = {CLARIN in the Low Countries},
editor    = {Odijk, Jan and van~Hessen, Arjan},
publisher = {Ubiquity Press},
address   = {London},
title = {D-LUCEA: Curation of the UCU Accent Project data},
doi = {10.5334/bbi.15}
}

rhasspy/sv_kaldi-rhasspy

Recent Advances in Discrete Speech Tokens: A Review

AbrahamSanders/codec-bpe — Implementation of Acoustic BPE (Shen et al., 2024), extended for RVQ-based Neural Audio Codecs

Undergraduate Upends a 40-Year-Old Data Science Conjecture

https://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q56216056&limit=50&offset=120%7C110214728&dir=next — Wikidata property to identify lexicographical entities (Q56216056)

implemente batch decode for owsm-ctc

Some of My Best Friends Are Linguists

Generative AI and the Automating of Academia

VoxCommunis: A Corpus for Cross-linguistic Phonetic Analysis

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model, model

Turn your Jupyter Notebook into interactive Presentation Slides using Anaconda

Introduction to Linear Prediction

The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language, code

Digital Archive of Pictures

FoQA: A Faroese Question-Answering Dataset, code

Self-Supervised Contrastive Learning for Unsupervised Phoneme Segmentation, code

LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation, code

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks, code

Prompt What You Need: Enhancing Segmentation in Rainy Scenes with Anchor-based Prompting

SAMPolyBuild: Adapting the Segment Anything Model for polygonal building extraction

The COUGHVID crowdsourcing dataset: A corpus for the study of large-scale cough analysis algorithms

Introduction to Flow Matching and Diffusion Models

Phi-4 Technical Report

microsoft/PhiCookBook

microsoft/dstoolkit-phi2-finetune — This repository contains step by step instructions on how to finetune Microsoft’s Phi-2 model with your own data.

microsoft/Phi-4-multimodal-instruct

Probabilistic Artificial Intelligence

sprakbankental/braxen

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens, code

March of the Penguins