LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video Editing

karpathy/minbpe — Minimal, clean, code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.

MAGVIT: Masked Generative Video Transformer, code

DiffiT: Diffusion Vision Transformers for Image Generation

A Novel Sampling Scheme for Text- and Image-Conditional Image Synthesis in Quantized Latent Spaces

How to Train Data-Efficient LLMs

Fine-tuning Large Language Models for Adaptive Machine Translation

Robust agents learn causal world models

Mamba: The Hard Way

open-mmlab/Amphion — Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.

The effects of automatic speech recognition quality on human transcription latency — “We present results from 2 studies which indicate that starting with the ASR output is worse unless it is sufficiently accurate (Word Error Rate of under 30%).”

Lexicographical data/Statistics/Counts of various things by language

OLMo - Open Language Model

OpenAccess-AI-Collective/axolotl — Go ahead and axolotl questions

Listen, Think, and Understand

Neural Network Diffusion

mistralai/cookbook

OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification

Encoding of multi-modal emotional information via personalized skin-integrated wireless facial interface

alterebro/IPA-Keyboard

BBA: Bi-Modal Behavioral Alignment for Reasoning with Large Vision-Language Models

Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study

microsoft/torchscale — Foundation Architecture for (M)LLMs

A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models, code

lucidrains/flamingo-pytorch

lucidrains/RETRO-pytorch