Interesting links, 17/11/2024
Misc. interesting things.
Lawyer Reacts To AI COPYRIGHT CLAIMED MY LAST VIDEO
Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance, code
PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI
Audiocite catalogue — French audiobooks.
OpenSLR 139 is built from Audiocite, but they made no attempt to distinguish between open and closed works.
ScanERU: Interactive 3D Visual Grounding based on Embodied Reference Understanding
LERF: Language Embedded Radiance Fields, code
NVIDIA/kvpress — LLM KV cache compression made easy
facebookresearch/vrs — VRS is a file format optimized to record & playback streams of sensor data, such as images, audio samples, and any other discrete sensors (IMU, temperature, etc), stored in per-device streams of timestamped records.
MEXMA: Token-level objectives improve sentence representations, code
Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations, code
An Open-Source Gloss-Based Baseline for Spoken to Signed Language Translation, code
rapidsai/cudf — GPU DataFrame Library
HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition
Aedh Wishes for the Cloths of Heaven
LibriVox Poetry weekly/fortnightly
Finally a good use case for voice cloning! #warcraft2 pic.twitter.com/turMGWwsHU
— Rasmus Bååth (@rabaath@fosstodon.org) (@rabaath) November 29, 2024
Słownik podstawowy szkocko-polski
Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifier, code
Encode, Tag, Realize: High-Precision Text Editing, code
Target Speaker ASR with Whisper
MotionCLIP: Exposing Human Motion Generation to CLIP Space, code
NB Uttale: A Norwegian Pronunciation Lexicon with Dialect Variation
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling, code
High-Fidelity Audio Compression with Improved RVQGAN, code
MEXMA: Token-level objectives improve sentence representations, code, model
Sylber: Syllabic Embedding Representation of Speech from Raw Audio, code
Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech, code
Coding Speech through Vocal Tract Kinematics, model
Cross-domain Neural Pitch and Periodicity Estimation, code
maxrmorrison/torchcrepe — Pytorch implementation of the CREPE pitch tracker
A Novel Sampling Scheme for Text- and Image-Conditional Image Synthesis in Quantized Latent Spaces
Human Motion Diffusion Model, code — Uses SMPL-X, so not usably open source.
HUMANISE: Language-conditioned Human Motion Generation in 3D Scenes, code — Uses SMPL-X.
Uni3D: Exploring Unified 3D Representation at Scale, code
allenai/objaverse — Partially open dataset of annotated 3D objects.
OpenScene: 3D Scene Understanding with Open Vocabularies, code
DigitalPhonetics/IMS-Toucan — Controllable and fast Text-to-Speech for over 7000 languages! demo
Bilevel Joint Unsupervised and Supervised Training for Automatic Speech Recognition
Sprakbanken/norwegian_orthographic_mapping — script is ASL 2.
SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR
SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis, code
Finally, a Replacement for BERT
Wiki Loves Downloads — Divides the images of a category from Wikimedia Commons into a desired number of lists and generates these in the form of text files with links to the respective images so that they can be downloaded using a download manager.
The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language, code
TaoRuijie/ECAPA-TDNN — Unofficial reimplementation of ECAPA-TDNN for speaker recognition
ARBML/klaam — Arabic speech recognition, classification and text-to-speech.
hexgrad/Kokoro-82M — Kokoro 82M parameter model.
NVidia Sana — Now open source (code only, not models)
ByteDance/Sa2VA-8B — Sa2VA is an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels.
The GAN is dead; long live the GAN! A Modern GAN Baseline
Ancient Language Learning DESTROYS Modern Methods — Clickbait title, but interesting video.
microsoft/Microsoft-3D-Movie-Maker — This is the source code for the original Microsoft 3D Movie Maker released in 1995
foone/SierraDeathGenerator — deathgenerator.com
Sex, Death, and Moral Outrage: remembering Bill Labov
ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes, code
3D-LLM: Injecting the 3D World into Large Language Models, code
SIMS: Simulating Human-Scene Interactions with Real World Script Planning
Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision, pytorch implementation
nGPT: Normalized Transformer with Representation Learning on the Hypersphere, code, lucidrains
LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias, lucidrains
Ring Attention with Blockwise Transformers for Near-Infinite Context, lucidrains
Spline-based Transformers, lucidrains
zakaton/Pink-Trombone — A programmable version of Neil Thapen’s Pink Trombone
maxrmorrison/torchcrepe — Pytorch implementation of the CREPE pitch tracker
Sylber: Syllabic Embedding Representation of Speech from Raw Audio, code
IPS-LMU/octra — OCTRA is a web-application for the orthographic transcription of audio files. Manual
interactiveaudiolab/penn — Pitch Estimating Neural Networks (PENN)