Interesting links, 22/01/2025
Misc. interesting things.
UI-TARS: Pioneering Automated GUI Interaction with Native Agents, no code yet, desktop app, model
Swedish terms with IPA pronunciation
Combiner: full attention transformer with sparse computation cost, pdf
OuteAI/OuteTTS-0.3-500M, code (1B model is not open)
Diffusion Models and Their Applications
HidekiKawahara/SparkNG — MATLAB real-time/interactive speech tools. This series is obsolete. SP3ARK is the up-to-date series (will be).
Hakarps kyrka: audio, revision ?
RandNet-Parareal: a time-parallel PDE solver using Random Neural Networks, OpenReview, code
SEL-BALD: Deep Bayesian Active Learning for Selective Labeling with Instance Rejection, OpenReview
Theoretical Foundations of Deep Selective State-Space Models, OpenReview
Task-recency bias strikes back: Adapting covariances in Exemplar-Free Class Incremental Learning, OpenReview
Related: Class-Incremental Learning: Survey and Performance Evaluation on Image Classification, code
What if English actually SOUNDED like this??
rviz — ROS 3D Robot Visualizer
parler-tts, code, parler_tts_mini_v0.1,
HCI-LAB-UGSPEECHDATA/speech_data_ghana_ug — The dataset comprises of 5000 hours speech corpus in Akan, Ewe, Dagbani, Daagare, and Ikposo. Each language includes 1000 hours of audio speech from indigenous speakers of the language and 100 hours of transcription.
001 - Hungarian short narrative A0
microsoft/GW-BASIC — The original source code of Microsoft GW-BASIC from 1983
microsoft/MS-DOS — The original sources of MS-DOS 1.25, 2.0, and 4.0 for reference purposes
Standard-Intelligence/hertz-dev — first base model for full-duplex conversational audio
wav2gloss/fieldwork — Mostly open, but includes closed data
vllm-project/vllm — A high-throughput and memory-efficient inference and serving engine for LLMs
espnet - Phoneme Recognition with IPAPack
Active Learning with Task Adaptation Pre-training for Speech Emotion Recognition
SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound, inference code
How the RWKV language model works, RWKV_in_150_lines.py
clee704/audiodiff — A commandline tool that compares two audio files and prints the difference
torch.compile, the missing manual
FaceFormer: Speech-Driven 3D Facial Animation with Transformers, code — Depends on Max Planck stuff, so probably not useable.
ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer, model, code
modelscope/scepter — SCEPTER is an open-source framework used for training, fine-tuning, and inference with generative models.
TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer
black-forest-labs/flux — Official inference repo for FLUX.1 models. open model
Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio, data
allenai/OLMo — Modeling, training, eval, and inference code for OLMo
m-a-p/Code-Feedback — OpenCodeInterpreter is a family of open-source code generation systems designed to bridge the gap between large language models and advanced proprietary systems like the GPT-4 Code Interpreter. It significantly advances code generation capabilities by integrating execution and iterative refinement functionalities.
persian-tts-dataset-male, persian-tts-dataset-famale
DiM-Gesture: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 framework
You Only Cache Once: Decoder-Decoder Architectures for Language Models, code
hitz-zentroa/latxa — Latxa: An Open Language Model and Evaluation Suite for Basque
Looking Backward: Streaming Video-to-Video Translation with Feature Banks
A light-weight and efficient punctuation and word casing prediction model for on-device streaming ASR, code
EvaByte/EvaByte — EvaByte is a 6.5B byte-level language model built upon an improved architecture with multibyte prediction and EVA – an efficient attention mechanism designed for scalability and performance.
HKUNLP/efficient-attention — [EVA ICLR’23; LARA ICML’22] Efficient attention mechanisms via control variates, random features, and importance sampling
Probing the 3D Awareness of Visual Foundation Models, mbanani/probe3d
OpenScene: 3D Scene Understanding with Open Vocabularies, code
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding, OpenReview, code
Gaussian Graph Network: Learning Efficient and Generalizable Gaussian Representations from Multi-view Images, OpenReview
PAConv: Position Adaptive Convolution with Dynamic Kernel Assembling on Point Clouds, code
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework, code
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound, code (CC-BY)
CTC Networks and Language Models: Prefix Beam Search Explained
Pronunciation modeling for speech technology
D-LUCEA: Curation of the UCU Accent Project Data
@inbook{orr2017dlucea,
author = {Orr, Rosemary and Quené, Hugo},
year = {2017},
month = {12},
pages = {181-193},
booktitle = {CLARIN in the Low Countries},
editor = {Odijk, Jan and van~Hessen, Arjan},
publisher = {Ubiquity Press},
address = {London},
title = {D-LUCEA: Curation of the UCU Accent Project data},
doi = {10.5334/bbi.15}
}
Recent Advances in Discrete Speech Tokens: A Review
AbrahamSanders/codec-bpe — Implementation of Acoustic BPE (Shen et al., 2024), extended for RVQ-based Neural Audio Codecs
Undergraduate Upends a 40-Year-Old Data Science Conjecture
https://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q56216056&limit=50&offset=120%7C110214728&dir=next — Wikidata property to identify lexicographical entities (Q56216056)
implemente batch decode for owsm-ctc
Some of My Best Friends Are Linguists
Generative AI and the Automating of Academia
VoxCommunis: A Corpus for Cross-linguistic Phonetic Analysis
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model, model
Turn your Jupyter Notebook into interactive Presentation Slides using Anaconda
Introduction to Linear Prediction
The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language, code
FoQA: A Faroese Question-Answering Dataset, code
Self-Supervised Contrastive Learning for Unsupervised Phoneme Segmentation, code
LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation, code
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks, code
Prompt What You Need: Enhancing Segmentation in Rainy Scenes with Anchor-based Prompting
SAMPolyBuild: Adapting the Segment Anything Model for polygonal building extraction
The COUGHVID crowdsourcing dataset: A corpus for the study of large-scale cough analysis algorithms
Introduction to Flow Matching and Diffusion Models
microsoft/dstoolkit-phi2-finetune — This repository contains step by step instructions on how to finetune Microsoft’s Phi-2 model with your own data.
microsoft/Phi-4-multimodal-instruct
Probabilistic Artificial Intelligence
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens, code