Interesting links, 01/02/2024

VHS-Decode has to be one of the most interesting software projects I've stumbled upon recently.

It replaces the decoding process of a VHS tape with a software stack, bypassing most of the original hardware.

Using an FPGA device for RF capture like the MiSTer, it creates a… pic.twitter.com/9c2R2Xhkyf
— LaurieWired (@lauriewired) January 31, 2024

oyvindln/vhs-decode

How can we get LLM-based agents to understand the *visual structure* of a webpage? Announcing Llama2D🦙👀!

We fine-tuned Llama on OCR'd webpage screenshots but with 2D positional embeddings, enabling it to see the structure of a webpage rather than just a sequence of tokens. 🧵 pic.twitter.com/Rz2JocZyOq
— Rohan Pandey (@khoomeik) February 2, 2024

BlackMamba Mixture of Experts

BlackMamba is an novel architecture which combines state-space models (SSMs) with mixture of experts (MoE). It uses Mamba as its SSM block and switch transformer as its MoE block base. BlackMamba is extremely low latency for generation and… pic.twitter.com/ojhmAKfsUK
— Carlos E. Perez (@IntuitMachine) February 3, 2024

MatFormer: Nested Transformer for Elastic Inference

What Do Self-Supervised Speech Models Know About Words?

What Do Self-Supervised Speech and Speaker Models Learn? New Findings From a Cross Model Layer-Wise Analysis

Progress on dense retrievers is saturating.

The best retrievers in 2024 will apply new forms of late interaction, i.e. scalable attention-like scoring for multi-vector embeddings.

A🧵on late interaction, how it works efficiently, and why/where it's been shown to improve quality pic.twitter.com/2XG33TtM9R
— Omar Khattab (@lateinteraction) December 18, 2023

Is it possible to teach LLMs a different language? 🤔 Can we transfer the capabilities of LLMs, like Llama, from English to non-English language?

A group of researchers from Fudan University tried to answer those questions by running vast experiments on extending vocabulary… pic.twitter.com/fJLYFyQOqP
— Philipp Schmid (@_philschmid) January 4, 2024

RAIVNLab/MatFormer-OLMo — Code repository for the public reproduction of the language modelling experiments on “MatFormer: Nested Transformer for Elastic Inference”

arcee-ai/mergekit — Tools for merging pretrained large language models.

OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

E-Branchformer: Branchformer with Enhanced merging for speech recognition

PAM: Prompting Audio-Language Models for Audio Quality Assessment, no code yet

ChatQA: Building GPT-4 Level Conversational QA Models

3 Advanced Document Retrieval Techniques To Improve RAG Systems

Efficiently Modeling Long Sequences with Structured State Spaces, code

Add S4 decoder in ESPnet2

Alignment-Length Synchronous Decoding for RNN Transducer

init owsm v3.1 recipe

Lingit uttaleleksikon for nynorsk

NLB uttaleleksikon for bokmål

Tuva Taledatabase

NST uttaleleksikon for bokmål

NST uttaleleksikon for svensk

N-gram – svensk

LIA sápmi – LIA-korpuset for samiske dialekter

collabora/WhisperSpeech — An Open Source text-to-speech system built by inverting Whisper. Space

Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation

SentenceTransformer: A Model For Computing Sentence Embedding

CLAP Learning Audio Concepts from Natural Language Supervision, code

MambaByte: Token-free Selective State Space Model

kyegomez/MambaByte — Implementation of MambaByte in “MambaByte: Token-free Selective State Space Model” in Pytorch and Zeta

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model, code

Matryoshka Representation Learning, code

It’s year 2024, and n-gram LMs are making a comeback!!

We develop infini-gram, an engine that efficiently processes n-gram queries with unbounded n and trillion-token corpora. It takes merely 20 milliseconds to count the frequency of an arbitrarily long n-gram in RedPajama (1.4T… pic.twitter.com/07O1o5pahv
— Jiacheng Liu (@liujc1998) February 1, 2024

BlackMamba: Mixture of Experts for State-Space Models, code

V-IRL: Grounding Virtual Intelligence in Real Life

Layer-Wise Analysis of Self-Supervised Acoustic Word Embeddings: A Study on Speech Emotion Recognition

MM-LLMs: Recent Advances in MultiModal Large Language Models

Training-Free Consistent Text-to-Image Generation

RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval, no code yet

Open Language Model: OLMo

Magyar nyelvtan

Compressing Transformer-based self-supervised models for speech processing

AdANNS: A Framework for Adaptive Semantic Search, code

SGI's 3D File System Navigator (1993) was real pic.twitter.com/UWbx3PS3Kk
— Retro Tech Dreams (@RetroTechDreams) February 6, 2024

Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations

Listen, Chat, and Edit: Text-Guided Soundscape Modification for Enhanced Auditory Experience

Self-Discover: Large Language Models Self-Compose Reasoning Structures

RAG From Scratch: Video series focused on understanding the RAG landscape

RAG is central for LLM application development, connecting LLMs to external data sources.

But, the pace of innovation and new approaches makes it challenging to keep up.

We're launching a new video… pic.twitter.com/963lOnVLcP
— LangChain (@LangChainAI) February 6, 2024

We are releasing the Gen-2 weights.

This is a limited edition. Collect all 6,834 books to acquire the complete model. pic.twitter.com/VVVdLPWYSO
— Cristóbal Valenzuela (@c_valenzuelab) February 6, 2024

REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR

Background Removal w/ 🤗 Transformers.js

Scaling Laws for Downstream Task Performance of Large Language Models

Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks

Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

Led by @GoogleDeepMind, we present ALOHA 2 🤙: An Enhanced Low-Cost Hardware for Bimanual Teleoperation.

ALOHA 2 🤙 significantly improves the durability of the original ALOHA 🏖️, enabling fleet-scale data collection on more complex tasks.

As usual, everything is open-sourced! pic.twitter.com/5OEpO8EFrG
— Tony Z. Zhao (@tonyzzhao) February 7, 2024

tonyzhaozh/aloha

Fast Timing-Conditioned Latent Audio Diffusion, code

StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion

LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation, code, weights, space

segmind/segmoe — Segmind Mixture of Diffusion Experts, blog

Unified Speech-Text Pretraining for Spoken Dialog Modeling

Memory Consolidation Enables Long-Context Video Understanding

AV2Wav: Diffusion-Based Re-synthesis from Continuous Self-supervised Features for Audio-Visual Speech Enhancement

> two genius tech hippies just want to make their music software run faster
> develop an algorithm, publish a paper
> algorithm not patented, but used in every commercial sampling synthesizer immediately after
> tech hippies run out of grant money for their lab 😅

Gossett: We… pic.twitter.com/LSz4INRKVy
— 👩‍💻 Paige Bailey (@DynamicWebPaige) February 9, 2024

CNChTu/FCPE — fast pitch estimator using Transformer

SpiRit-LM: Interleaved Spoken and Written Language Model

Multilingual E5 Text Embeddings: A Technical Report, code

Learning to Route Among Specialized Experts for Zero-Shot Generalization, code

Can "small" finetuned LLMs with less than 2B parameters outperform larger openly available LLMs (Mixtral, Llama 2 Chat) and proprietary LLMs (ChatGPT)? Here's a closer look at the Tiny Titans paper (https://t.co/WBFDJ9Q7th), where researchers tried to find the answer to this… pic.twitter.com/z6rDkBrLEj
— Sebastian Raschka (@rasbt) February 10, 2024

idT5: Indonesian Version of Multilingual T5 Transformer

mT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs

Self-Discover: Large Language Models Self-Compose Reasoning Structures

Efficient Exploration for LLMs

Can Large Language Models Understand Context?

Long is more for alignment

TL;DR: LIMA's paper [1] claimed that if you just train on 1000 high quality samples you will get a great model.

Well.. turns out it is even easier.

Just use the 1000 longest responses in the dataset.

You will get a surprisingly powerful model.

---… pic.twitter.com/0kaTByZ2ho
— Yam Peleg (@Yampeleg) February 10, 2024

Spectral State Space Models

K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters, code

CodeIt: Self-Improving Language Models with Prioritized Hindsight Replay

Hungarian - Geography

Texts - Easy Hungarian

Wordwall - Hungarian grammar

Resource List for Learning Hungarian, doc

Code LoRA from Scratch

Accelerating RNN Transducer Inference via Adaptive Expansion Search

CTC Segmentation for ESPnet 2

Implement wav2gloss

gemelo-ai/vocos — Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

CPJKU/onset_detection — Python implementation of the most common spectral based onset detection algorithms.

Taskmaster wiki

A Hackers’ Guide to Language Models

veeresht/CommPy — Digital Communication with Python

RVC-Project/Retrieval-based-Voice-Conversion-WebUI

Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation, demo

Affective and Dynamic Beam Search for Story Generation

Automatic vocal tract landmark localization from midsagittal MRI data, code

lucidrains/phenaki-pytorch — Implementation of Phenaki Video, which uses Mask GIT to produce text guided videos of up to 2 minutes in length, in Pytorch

Describing Differences in Image Sets with Natural Language, code

google-research/multinerf — A Code Release for Mip-NeRF 360, Ref-NeRF, and RawNeRF

vasistalodagala/whisper-finetune — Fine-tune and evaluate Whisper models for Automatic Speech Recognition (ASR) on custom datasets or datasets from huggingface.

lucidrains/CALM-pytorch — Implementation of CALM from the paper “LLM Augmented LLMs: Expanding Capabilities through Composition”, out of Google Deepmind

lucidrains/llama-qrlhf — Implementation of the Llama architecture with RLHF + Q-learning

Robust Speech Recognition via Large-Scale Weak Supervision

ActiveVisionLab/Awesome-LLM-3D — Awesome-LLM-3D: a curated list of Multi-modal Large Language Model in 3D world Resources

Are Emergent Abilities of Large Language Models a Mirage?

facebookresearch/Pearl — A Production-ready Reinforcement Learning AI Agent Library brought by the Applied Reinforcement Learning team at Meta.

metavoiceio/metavoice-src — Foundational model for human-like, expressive TTS

lucidrains/retro-pytorch — Implementation of RETRO, Deepmind’s Retrieval based Attention net, in Pytorch

The Illustrated Retrieval Transformer

lifeiteng/vall-e — PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo

Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning

cisnlp/simalign — Obtain Word Alignments using Pretrained Language Models (e.g., mBERT)

Speech Recognition for Minority Languages Using HuBERT and Model Adaptation

Textually Pretrained Speech Language Models

An Embarrassingly Simple Approach for LLM with Strong ASR Capacity

Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection

Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation, code

Can a 2B LLM outperform Mistral 7B or Llama 13B?
Creators of the popular Ultrafeedback dataset released MiniCPM, a 2.4B parameter model claiming performance close to Mistral 7B, Llama 2 13B, or Falcon 40B. 🤯🤔

As part of the release, the researchers released a detailed…
— Philipp Schmid (@_philschmid) February 7, 2024

MiniCPM: Unveiling the Potential of End-side Large Language Models, code

There's a strain of anti-anti-monopolist that insists that they're not *pro*-monopoly - they're just *realists* who understand that global gigacorporations are too big to fail, too big to jail, and that governments can't hope to rein them in.

1/ pic.twitter.com/nx0lM4lKWu
— Cory Doctorow NONCONSENSUAL BLUE TICK (@doctorow) February 6, 2024

Review — Flamingo: A Visual Language Model for Few-Shot Learning

Multimodal Language Models Explained: Visual Instruction Tuning

📢Mixtures of Experts unlock parameter scaling for deep RL!

Adding MoEs, and in particular Soft MoEs, to value-based deep RL agents results in more parameter-scalable models.

Performance keeps increasing as we increase number of experts (green line below)!
1/9 https://t.co/SMFUrpdNXN pic.twitter.com/kb9mqfyg3m
— Pablo Samuel Castro (@pcastr) February 14, 2024

Mixtures of Experts Unlock Parameter Scaling for Deep RL

🔥Half a year after its initial release we are upgrading self-expanding neural networks🔥

* SENN based on full-connectivity + now with convolutions
* layer & width addition + now pruning any time during training
* jax code: https://t.co/ndnnGjBu4F https://t.co/QrYBkWyAV8

🧵 ⬇️ https://t.co/XLWcj8xoC1 pic.twitter.com/MnhObyE81B
— Martin Mundt (@mundt_martin) February 12, 2024

Cohere for AI launches open source LLM for 101 languages

100x less compute with GPT-level LLM performance: How a little known open source project could help solve the GPU power conundrum — RWKV looks promising but challenges remain

BAAI-DCAI/Bunny — A family of lightweight multimodal models.

theodorblackbird/lina-speech seems interesting, but it’s not open source, so I don’t care.

Large Language Models, GPT-2 — Language Models Are Unsupervised Multitask Learners

vosen/ZLUDA — CUDA on AMD GPUs