oyvindln/vhs-decode

MatFormer: Nested Transformer for Elastic Inference

What Do Self-Supervised Speech Models Know About Words?

What Do Self-Supervised Speech and Speaker Models Learn? New Findings From a Cross Model Layer-Wise Analysis

RAIVNLab/MatFormer-OLMo — Code repository for the public reproduction of the language modelling experiments on “MatFormer: Nested Transformer for Elastic Inference”

arcee-ai/mergekit — Tools for merging pretrained large language models.

OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

E-Branchformer: Branchformer with Enhanced merging for speech recognition

PAM: Prompting Audio-Language Models for Audio Quality Assessment, no code yet

ChatQA: Building GPT-4 Level Conversational QA Models

3 Advanced Document Retrieval Techniques To Improve RAG Systems

Efficiently Modeling Long Sequences with Structured State Spaces, code

Add S4 decoder in ESPnet2

Alignment-Length Synchronous Decoding for RNN Transducer

init owsm v3.1 recipe

Lingit uttaleleksikon for nynorsk

NLB uttaleleksikon for bokmål

Tuva Taledatabase

NST uttaleleksikon for bokmål

NST uttaleleksikon for svensk

N-gram – svensk

LIA sápmi – LIA-korpuset for samiske dialekter

collabora/WhisperSpeech — An Open Source text-to-speech system built by inverting Whisper. Space

Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation

SentenceTransformer: A Model For Computing Sentence Embedding

CLAP Learning Audio Concepts from Natural Language Supervision, code

MambaByte: Token-free Selective State Space Model

kyegomez/MambaByte — Implementation of MambaByte in “MambaByte: Token-free Selective State Space Model” in Pytorch and Zeta

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model, code

Matryoshka Representation Learning, code

BlackMamba: Mixture of Experts for State-Space Models, code

V-IRL: Grounding Virtual Intelligence in Real Life

Layer-Wise Analysis of Self-Supervised Acoustic Word Embeddings: A Study on Speech Emotion Recognition

MM-LLMs: Recent Advances in MultiModal Large Language Models

Training-Free Consistent Text-to-Image Generation

RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval, no code yet

Open Language Model: OLMo

Magyar nyelvtan

Compressing Transformer-based self-supervised models for speech processing

AdANNS: A Framework for Adaptive Semantic Search, code

Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations

Listen, Chat, and Edit: Text-Guided Soundscape Modification for Enhanced Auditory Experience

Self-Discover: Large Language Models Self-Compose Reasoning Structures

REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR

Background Removal w/ 🤗 Transformers.js

Scaling Laws for Downstream Task Performance of Large Language Models

Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks

Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

tonyzhaozh/aloha

Fast Timing-Conditioned Latent Audio Diffusion, code

StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion

LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation, code, weights, space

segmind/segmoe — Segmind Mixture of Diffusion Experts, blog

Unified Speech-Text Pretraining for Spoken Dialog Modeling

Memory Consolidation Enables Long-Context Video Understanding

AV2Wav: Diffusion-Based Re-synthesis from Continuous Self-supervised Features for Audio-Visual Speech Enhancement

CNChTu/FCPE — fast pitch estimator using Transformer

SpiRit-LM: Interleaved Spoken and Written Language Model

Multilingual E5 Text Embeddings: A Technical Report, code

Learning to Route Among Specialized Experts for Zero-Shot Generalization, code

idT5: Indonesian Version of Multilingual T5 Transformer

mT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs

Self-Discover: Large Language Models Self-Compose Reasoning Structures

Efficient Exploration for LLMs

Can Large Language Models Understand Context?

Spectral State Space Models

K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters, code

CodeIt: Self-Improving Language Models with Prioritized Hindsight Replay

Hungarian - Geography

Texts - Easy Hungarian

Wordwall - Hungarian grammar

Resource List for Learning Hungarian, doc

Code LoRA from Scratch

Accelerating RNN Transducer Inference via Adaptive Expansion Search

CTC Segmentation for ESPnet 2

Implement wav2gloss

gemelo-ai/vocos — Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

CPJKU/onset_detection — Python implementation of the most common spectral based onset detection algorithms.

Taskmaster wiki

A Hackers’ Guide to Language Models

veeresht/CommPy — Digital Communication with Python

RVC-Project/Retrieval-based-Voice-Conversion-WebUI

Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation, demo

Affective and Dynamic Beam Search for Story Generation

Automatic vocal tract landmark localization from midsagittal MRI data, code

lucidrains/phenaki-pytorch — Implementation of Phenaki Video, which uses Mask GIT to produce text guided videos of up to 2 minutes in length, in Pytorch

Describing Differences in Image Sets with Natural Language, code

google-research/multinerf — A Code Release for Mip-NeRF 360, Ref-NeRF, and RawNeRF

vasistalodagala/whisper-finetune — Fine-tune and evaluate Whisper models for Automatic Speech Recognition (ASR) on custom datasets or datasets from huggingface.

lucidrains/CALM-pytorch — Implementation of CALM from the paper “LLM Augmented LLMs: Expanding Capabilities through Composition”, out of Google Deepmind

lucidrains/llama-qrlhf — Implementation of the Llama architecture with RLHF + Q-learning

Robust Speech Recognition via Large-Scale Weak Supervision

ActiveVisionLab/Awesome-LLM-3D — Awesome-LLM-3D: a curated list of Multi-modal Large Language Model in 3D world Resources

Are Emergent Abilities of Large Language Models a Mirage?

facebookresearch/Pearl — A Production-ready Reinforcement Learning AI Agent Library brought by the Applied Reinforcement Learning team at Meta.

metavoiceio/metavoice-src — Foundational model for human-like, expressive TTS

lucidrains/retro-pytorch — Implementation of RETRO, Deepmind’s Retrieval based Attention net, in Pytorch

The Illustrated Retrieval Transformer

lifeiteng/vall-e — PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo

Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning

cisnlp/simalign — Obtain Word Alignments using Pretrained Language Models (e.g., mBERT)

Speech Recognition for Minority Languages Using HuBERT and Model Adaptation

Textually Pretrained Speech Language Models

An Embarrassingly Simple Approach for LLM with Strong ASR Capacity

Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection

Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation, code

MiniCPM: Unveiling the Potential of End-side Large Language Models, code

Review — Flamingo: A Visual Language Model for Few-Shot Learning

Multimodal Language Models Explained: Visual Instruction Tuning

Mixtures of Experts Unlock Parameter Scaling for Deep RL

Cohere for AI launches open source LLM for 101 languages

100x less compute with GPT-level LLM performance: How a little known open source project could help solve the GPU power conundrum — RWKV looks promising but challenges remain

BAAI-DCAI/Bunny — A family of lightweight multimodal models.

theodorblackbird/lina-speech seems interesting, but it’s not open source, so I don’t care.

Large Language Models, GPT-2 — Language Models Are Unsupervised Multitask Learners

vosen/ZLUDA — CUDA on AMD GPUs