Interesting links, 17/11/2024

Audiocite catalogue — French audiobooks.

OpenSLR 139 is built from Audiocite, but they made no attempt to distinguish between open and closed works.

ScanERU: Interactive 3D Visual Grounding based on Embodied Reference Understanding

LERF: Language Embedded Radiance Fields, code

NVIDIA/kvpress — LLM KV cache compression made easy

facebookresearch/vrs — VRS is a file format optimized to record & playback streams of sensor data, such as images, audio samples, and any other discrete sensors (IMU, temperature, etc), stored in per-device streams of timestamped records.

Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations, code

An Open-Source Gloss-Based Baseline for Spoken to Signed Language Translation, code

rapidsai/cudf — GPU DataFrame Library

Corpas Náisiúnta na Gaeilge

HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition

Aedh Wishes for the Cloths of Heaven

LibriVox API info

LibriVox Poetry weekly/fortnightly

Finally a good use case for voice cloning! #warcraft2 pic.twitter.com/turMGWwsHU
— Rasmus Bååth (@rabaath@fosstodon.org) (@rabaath) November 29, 2024

Vocal Remover and Isolation

Transformer in Excel

Unsupervised speech intelligibility assessment with utterance level alignment distance between teacher and learner Wav2Vec-2.0 representations

Unsupervised Uncertainty Measures of Automatic Speech Recognition for Non-intrusive Speech Intelligibility Prediction

Słownik podstawowy szkocko-polski

Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifier, code

Introduction to ggml

Encode, Tag, Realize: High-Precision Text Editing, code

Target Speaker ASR with Whisper

MotionCLIP: Exposing Human Motion Generation to CLIP Space, code

NB Uttale: A Norwegian Pronunciation Lexicon with Dialect Variation

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling, code

High-Fidelity Audio Compression with Improved RVQGAN, code

Multi-Stream Transformers

Project Aria tools, code

MEXMA: Token-level objectives improve sentence representations, code, model

edwko/OuteTTS

Sylber: Syllabic Embedding Representation of Speech from Raw Audio, code

Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech, code

Coding Speech through Vocal Tract Kinematics, model

Cross-domain Neural Pitch and Periodicity Estimation, code

maxrmorrison/torchcrepe — Pytorch implementation of the CREPE pitch tracker

A Novel Sampling Scheme for Text- and Image-Conditional Image Synthesis in Quantized Latent Spaces

Human Motion Diffusion Model, code — Uses SMPL-X, so not usably open source.

HUMANISE: Language-conditioned Human Motion Generation in 3D Scenes, code — Uses SMPL-X.

Uni3D: Exploring Unified 3D Representation at Scale, code

allenai/objaverse — Partially open dataset of annotated 3D objects.

OpenScene: 3D Scene Understanding with Open Vocabularies, code

Tkd-Alex/WhatsDump

DigitalPhonetics/IMS-Toucan — Controllable and fast Text-to-Speech for over 7000 languages! demo

Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains

Bilevel Joint Unsupervised and Supervised Training for Automatic Speech Recognition

Sprakbanken/norwegian_orthographic_mapping — script is ASL 2.

Sprakbanken/g2p-nb

SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR

Leipzig Glossing Rules

Summary for Hungarian verbs

brucemiller/LaTeXML

SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis, code

Finally, a Replacement for BERT

Wiki Loves Downloads — Divides the images of a category from Wikimedia Commons into a desired number of lists and generates these in the form of text files with links to the respective images so that they can be downloaded using a download manager.

Lingua Libre language gallery

The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language, code

World’s First MIDI Shellcode

Einsum in Depth

serengil/deepface

Cassette Tape Plays MP3s

TaoRuijie/ECAPA-TDNN — Unofficial reimplementation of ECAPA-TDNN for speaker recognition

ARBML/klaam — Arabic speech recognition, classification and text-to-speech.

hexgrad/Kokoro-82M — Kokoro 82M parameter model.

NVidia Sana — Now open source (code only, not models)

Emoji Kitchen

microsoft/phi-4

ByteDance/Sa2VA-8B — Sa2VA is an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels.

The GAN is dead; long live the GAN! A Modern GAN Baseline

Why discrete units

Ancient Language Learning DESTROYS Modern Methods — Clickbait title, but interesting video.

microsoft/Microsoft-3D-Movie-Maker — This is the source code for the original Microsoft 3D Movie Maker released in 1995

foone/SierraDeathGenerator — deathgenerator.com

Sex, Death, and Moral Outrage: remembering Bill Labov

ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes, code

3D-LLM: Injecting the 3D World into Large Language Models, code

SIMS: Simulating Human-Scene Interactions with Real World Script Planning

Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision, pytorch implementation

nGPT: Normalized Transformer with Representation Learning on the Hypersphere, code, lucidrains

LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias, lucidrains