Analyzing Open AI’s Whisper ASR Accuracy: Word Error Rates Across Languages and Model Sizes

Lauler/rixvox-alignments

swerik-project/riksdagen-records

sherpa-onnx - audio-tagging-from-a-file

A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech Enhancement

Evaluating Gammatone Frequency Cepstral Coefficients with Neural Networks for Emotion Recognition from Speech

Foundation Transformers

Everything You Always Wanted To Know About Mathematics But didn’t even know to ask

Conditional flow matching

How Much Context Does My Attention-Based ASR System Need?, code

SkalskiP/top-cvpr-2024-papers

Evaluating Gesture Generation in a Large-scale Open Challenge: The GENEA Challenge 2022

Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation, code

Understanding FAISS

An Empirical Study of Mamba-based Language Models, model, code

pkufool/librilight-text

Open CLIP - SigLipLoss

Faiss: A library for efficient similarity search

Faiss - Brute force search without an index

Robust solutions for audio fingerprinting

Neural Audio Fingerprint for High-Specific Audio Retrieval Based on Contrastive Learning, code

Audio Fingerprinting with Holographic Reduced Representations

MahmudulAlam/Holographic-Reduced-Representations

Learning with Holographic Reduced Representations, code

spotify/basic-pitch — A lightweight yet powerful audio-to-MIDI converter with pitch bend detection

Step-by-Step Diffusion: An Elementary Tutorial

Fourier Diffusion Models: A Method to Control MTF and NPS in Score-Based Stochastic Image Generation

Time Series Diffusion in the Frequency Domain, code

Data Augmentation in Time and Doppler Frequency Domain for Radar-based Gesture Recognition

Frequency Domain Audio Synthesis – With IFFT and Oscillators

Trajectories and revolutions in popular melody based on U.S. charts from 1950 to 2023

Speech Recognition and Multi-Speaker Diarization of Long Conversations, data

Vision Language Models Explained

Model Actually open
!———————— ——————-
LLaVA 1.6 (Hermes 34B)  
deepseek-vl-7b-base
DeepSeek-VL-Chat
moondream2  
CogVLM-base (❌)[https://github.com/THUDM/CogVLM/raw/main/MODEL_LICENSE]
CogVLM-Chat (❌)[https://github.com/THUDM/CogVLM/raw/main/MODEL_LICENSE]
Fuyu-8B
KOSMOS-2
Qwen-VL
Qwen-VL-Chat
Yi-VL-34B

nmslib/nmslib — Non-Metric Space Library (NMSLIB): An efficient similarity search library and a toolkit for evaluation of k-NN methods for generic non-metric spaces.

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers, code

Large Language Models are Efficient Learners of Noise-Robust Speech Recognition, code

It’s Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition, code

BAT: Learning to Reason about Spatial Sounds with Large Language Models

SpiRit-LM: Interleaved Spoken and Written Language Model

WavLLM: Towards Robust and Adaptive Speech Large Language Model, code

An Embarrassingly Simple Approach for LLM with Strong ASR Capacity, code

Speech Trident - Awesome Speech LM

SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models

Investigating the Effects of Large-Scale Pseudo-Stereo Data and Different Speech Foundation Model on Dialogue Generative Spoken Language Model

Memory3 : Language Modeling with Explicit Memory

Information Theory: A Tutorial Introduction

Data curation via joint example selection further accelerates multimodal learning

Depth Anything V2, code, demo, coreml, model

Alice’s Adventures in a differentiable wonderland

supabase/supabase — The open source Firebase alternative. Supabase gives you a dedicated Postgres database to build your web, mobile, and AI applications.

HazyResearch/flash-fft-conv — FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores

Less is More: Accurate Speech Recognition & Translation without Web-Scale Data, code in NeMo, model closed but available.

23606 Workshop on Human Motion Generation, key moment here

microsoft/graphrag

leaningtech/webvm — Virtual Machine for the Web

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Audio Spotforming Using Nonnegative Tensor Factorization with Attractor-Based Regularization

Beyond Euclid: An Illustrated Guide to Modern Machine Learning with Geometric, Topological, and Algebraic Structures

Understanding Transformers via N-Gram Statistics

facebookincubator/submitit — Python 3.8+ toolbox for submitting jobs to Slurm

Video Diffusion Alignment via Reward Gradients, model

Deep Dive into LSTMs and xLSTMs by Hand

Chronos: Learning the Language of Time Series, code

lm-sys/FastChat

EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning, code

Introducing Triton: Open-source GPU programming for neural networks

CUDA kernels in PyTorch made easy with Numba, notebook

a-brassard/ACORN — Home repository for the ACORN dataset: 3,500 explanations with aspect-wise human ratings of their quality.

Let’s reproduce GPT-2

facebookresearch/data2vec_vision

Fine-tuning Florence-2 - Microsoft’s Cutting-edge Vision Language Models

ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data, code

ColPali: Efficient Document Retrieval with Vision Language Models, code

Block Transformer: Global-to-Local Language Modeling for Fast Inference, code

AND: Audio Network Dissection for Interpreting Deep Acoustic Models

Sound Field Synthesis with Acoustic Waves

SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for Efficient Audio Synthesis and Beyond

Mapping the Past: Geographically Linking an Early 20th Century Swedish Encyclopedia with Wikidata

A multi-speaker multi-lingual voice cloning system based on vits2 for limmits 2024 challenge

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs, code, model

AI by Hand:

SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation, code, model

Zielon/PBRVulkan — Vulkan Real-time Path Tracer Engine

UCxn: Typologically Informed Annotation of Constructions Atop Universal Dependencies

Generative AI Handbook: A Roadmap for Learning Resources

Crossmodal ASR Error Correction with Discrete Speech Units

CMU-MOSI Dataset — The Multimodal Corpus of Sentiment Intensity (CMU-MOSI) dataset is a collection of 2199 opinion video clips.

The essence of calculus

Imitation and Mechanisms of Joint Attention: A Developmental Structure for Building Social Skills on a Humanoid Robot

Polish Public Domain Works

USER-LLM: Efficient LLM contextualization with user embeddings, arXiv

Perceiver: General Perception with Iterative Attention

Reverse the auditory processing pathway: Coarse-to-fine audio reconstruction from fMRI

Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis, code, vocos-mel-24khz, vocos-encodec-24khz

Improving Speech Decoding from ECoG with Self-Supervised Pretraining

BLSP-KD: Bootstrapping Language-Speech Pre-training via Knowledge Distillation

TF-SepNet: An Efficient 1D Kernel Design in CNNs for Low-Complexity Acoustic Scene Classification

Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion, code

Adapting Frechet Audio Distance for Generative Music Evaluation, arXiv, code

TRAMBA: A Hybrid Transformer and Mamba Architecture for Practical Audio and Bone Conduction Speech Super Resolution and Enhancement on Mobile and Wearable Platforms

CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations

Spiketrum: An FPGA-based Implementation of a Neuromorphic Cochlea

Contextual Position Encoding: Learning to Count What’s Important

The Raven: Hungarian

IN LOVE WITH THE CZARINA

Simplified Grammar of the Hungarian Language

In Love With the Czarina, and Other Stories, A VAKMERŐ

The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry

netease-youdao/EmotiVoice — a Multi-Voice and Prompt-Controlled TTS Engine

Voice Cloning with your personal data

ricosjp/truck — Truck is a rust CAD kernel

Hungarian Body Parts Flashcards

Hungarian Flashcards

OpenVoice: Versatile Instant Voice Cloning, code

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

A Complete Guide to Write your own Transformers

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models, FACodec model, code

Decades-Old Beer Ads Stitched Straight Into Original Star Wars Movies Go Viral

Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR

I Felt a Little Homosexual Today, So I Called in Sick: The Formation of “Reverse Discourse” by Swedish Gay Activists in the 1970s

SeamlessExpressiveLM: Speech Language Model for Expressive Speech-to-Speech Translation with Chain-of-Thought

UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge

Robots Beyond Borders: The Role of Social Robots in Spoken Second Language Practice

Once “too scary” to release, GPT-2 gets squeezed into an Excel spreadsheet

ianand/spreadsheets-are-all-you-need

Open sourcing MS-DOS 4.0

microsoft/MS-DOS

sarah-walker-pcem/pcem/ — PC emulator

Infinite Mac — Infinite Mac is a collection of classic Macintosh and NeXT system releases and software, all easily accessible from the comfort of a (modern) web browser.

previous - NeXT emulator

dingusdev/dingusppc — PowerPC Mac emulator

Basilisk II, github

autc04/executor — A modern fork of the classic Mac emulator

Can Learned Optimization Make Reinforcement Learning Less Difficult?, code

google/learned_optimization — Meta-learning optimizers and more with JAX

Click-Gaussian: Interactive Segmentation to Any 3D Gaussians

Mistral NeMo

E2GAN: Efficient Training of Efficient GANs for Image-to-Image Translation

nerfstudio-project/nerfstudio — A collaboration friendly studio for NeRFs

Context Embeddings for Efficient Answer Generation in RAG

SpreadsheetLLM: Encoding Spreadsheets for Large Language Models