Interesting links, 01/02/2024
Misc. interesting things.
VHS-Decode has to be one of the most interesting software projects I've stumbled upon recently.
— LaurieWired (@lauriewired) January 31, 2024
It replaces the decoding process of a VHS tape with a software stack, bypassing most of the original hardware.
Using an FPGA device for RF capture like the MiSTer, it creates a… pic.twitter.com/9c2R2Xhkyf
How can we get LLM-based agents to understand the *visual structure* of a webpage? Announcing Llama2D🦙👀!
— Rohan Pandey (@khoomeik) February 2, 2024
We fine-tuned Llama on OCR'd webpage screenshots but with 2D positional embeddings, enabling it to see the structure of a webpage rather than just a sequence of tokens. 🧵 pic.twitter.com/Rz2JocZyOq
BlackMamba Mixture of Experts
— Carlos E. Perez (@IntuitMachine) February 3, 2024
BlackMamba is an novel architecture which combines state-space models (SSMs) with mixture of experts (MoE). It uses Mamba as its SSM block and switch transformer as its MoE block base. BlackMamba is extremely low latency for generation and… pic.twitter.com/ojhmAKfsUK
MatFormer: Nested Transformer for Elastic Inference
What Do Self-Supervised Speech Models Know About Words?
Progress on dense retrievers is saturating.
— Omar Khattab (@lateinteraction) December 18, 2023
The best retrievers in 2024 will apply new forms of late interaction, i.e. scalable attention-like scoring for multi-vector embeddings.
A🧵on late interaction, how it works efficiently, and why/where it's been shown to improve quality pic.twitter.com/2XG33TtM9R
Is it possible to teach LLMs a different language? 🤔 Can we transfer the capabilities of LLMs, like Llama, from English to non-English language?
— Philipp Schmid (@_philschmid) January 4, 2024
A group of researchers from Fudan University tried to answer those questions by running vast experiments on extending vocabulary… pic.twitter.com/fJLYFyQOqP
RAIVNLab/MatFormer-OLMo — Code repository for the public reproduction of the language modelling experiments on “MatFormer: Nested Transformer for Elastic Inference”
arcee-ai/mergekit — Tools for merging pretrained large language models.
OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer
E-Branchformer: Branchformer with Enhanced merging for speech recognition
PAM: Prompting Audio-Language Models for Audio Quality Assessment, no code yet
ChatQA: Building GPT-4 Level Conversational QA Models
3 Advanced Document Retrieval Techniques To Improve RAG Systems
Efficiently Modeling Long Sequences with Structured State Spaces, code
Alignment-Length Synchronous Decoding for RNN Transducer
Lingit uttaleleksikon for nynorsk
LIA sápmi – LIA-korpuset for samiske dialekter
collabora/WhisperSpeech — An Open Source text-to-speech system built by inverting Whisper. Space
Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation
SentenceTransformer: A Model For Computing Sentence Embedding
CLAP Learning Audio Concepts from Natural Language Supervision, code
MambaByte: Token-free Selective State Space Model
kyegomez/MambaByte — Implementation of MambaByte in “MambaByte: Token-free Selective State Space Model” in Pytorch and Zeta
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model, code
Matryoshka Representation Learning, code
It’s year 2024, and n-gram LMs are making a comeback!!
— Jiacheng Liu (@liujc1998) February 1, 2024
We develop infini-gram, an engine that efficiently processes n-gram queries with unbounded n and trillion-token corpora. It takes merely 20 milliseconds to count the frequency of an arbitrarily long n-gram in RedPajama (1.4T… pic.twitter.com/07O1o5pahv
BlackMamba: Mixture of Experts for State-Space Models, code
V-IRL: Grounding Virtual Intelligence in Real Life
MM-LLMs: Recent Advances in MultiModal Large Language Models
Training-Free Consistent Text-to-Image Generation
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval, no code yet
Compressing Transformer-based self-supervised models for speech processing
AdANNS: A Framework for Adaptive Semantic Search, code
SGI's 3D File System Navigator (1993) was real pic.twitter.com/UWbx3PS3Kk
— Retro Tech Dreams (@RetroTechDreams) February 6, 2024
Listen, Chat, and Edit: Text-Guided Soundscape Modification for Enhanced Auditory Experience
Self-Discover: Large Language Models Self-Compose Reasoning Structures
RAG From Scratch: Video series focused on understanding the RAG landscape
— LangChain (@LangChainAI) February 6, 2024
RAG is central for LLM application development, connecting LLMs to external data sources.
But, the pace of innovation and new approaches makes it challenging to keep up.
We're launching a new video… pic.twitter.com/963lOnVLcP
We are releasing the Gen-2 weights.
— Cristóbal Valenzuela (@c_valenzuelab) February 6, 2024
This is a limited edition. Collect all 6,834 books to acquire the complete model. pic.twitter.com/VVVdLPWYSO
REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR
Background Removal w/ 🤗 Transformers.js
Scaling Laws for Downstream Task Performance of Large Language Models
Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks
Led by @GoogleDeepMind, we present ALOHA 2 🤙: An Enhanced Low-Cost Hardware for Bimanual Teleoperation.
— Tony Z. Zhao (@tonyzzhao) February 7, 2024
ALOHA 2 🤙 significantly improves the durability of the original ALOHA 🏖️, enabling fleet-scale data collection on more complex tasks.
As usual, everything is open-sourced! pic.twitter.com/5OEpO8EFrG
Fast Timing-Conditioned Latent Audio Diffusion, code
StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion
LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation, code, weights, space
segmind/segmoe — Segmind Mixture of Diffusion Experts, blog
Unified Speech-Text Pretraining for Spoken Dialog Modeling
Memory Consolidation Enables Long-Context Video Understanding
> two genius tech hippies just want to make their music software run faster
— 👩💻 Paige Bailey (@DynamicWebPaige) February 9, 2024
> develop an algorithm, publish a paper
> algorithm not patented, but used in every commercial sampling synthesizer immediately after
> tech hippies run out of grant money for their lab 😅
Gossett: We… pic.twitter.com/LSz4INRKVy
CNChTu/FCPE — fast pitch estimator using Transformer
SpiRit-LM: Interleaved Spoken and Written Language Model
Multilingual E5 Text Embeddings: A Technical Report, code
Learning to Route Among Specialized Experts for Zero-Shot Generalization, code
Can "small" finetuned LLMs with less than 2B parameters outperform larger openly available LLMs (Mixtral, Llama 2 Chat) and proprietary LLMs (ChatGPT)? Here's a closer look at the Tiny Titans paper (https://t.co/WBFDJ9Q7th), where researchers tried to find the answer to this… pic.twitter.com/z6rDkBrLEj
— Sebastian Raschka (@rasbt) February 10, 2024
idT5: Indonesian Version of Multilingual T5 Transformer
mT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs
Self-Discover: Large Language Models Self-Compose Reasoning Structures
Efficient Exploration for LLMs
Can Large Language Models Understand Context?
Long is more for alignment
— Yam Peleg (@Yampeleg) February 10, 2024
TL;DR: LIMA's paper [1] claimed that if you just train on 1000 high quality samples you will get a great model.
Well.. turns out it is even easier.
Just use the 1000 longest responses in the dataset.
You will get a surprisingly powerful model.
---… pic.twitter.com/0kaTByZ2ho
K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters, code
CodeIt: Self-Improving Language Models with Prioritized Hindsight Replay
Resource List for Learning Hungarian, doc
Accelerating RNN Transducer Inference via Adaptive Expansion Search
gemelo-ai/vocos — Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis
CPJKU/onset_detection — Python implementation of the most common spectral based onset detection algorithms.
A Hackers’ Guide to Language Models
veeresht/CommPy — Digital Communication with Python
RVC-Project/Retrieval-based-Voice-Conversion-WebUI
Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation, demo
Affective and Dynamic Beam Search for Story Generation
Automatic vocal tract landmark localization from midsagittal MRI data, code
lucidrains/phenaki-pytorch — Implementation of Phenaki Video, which uses Mask GIT to produce text guided videos of up to 2 minutes in length, in Pytorch
Describing Differences in Image Sets with Natural Language, code
google-research/multinerf — A Code Release for Mip-NeRF 360, Ref-NeRF, and RawNeRF
vasistalodagala/whisper-finetune — Fine-tune and evaluate Whisper models for Automatic Speech Recognition (ASR) on custom datasets or datasets from huggingface.
lucidrains/CALM-pytorch — Implementation of CALM from the paper “LLM Augmented LLMs: Expanding Capabilities through Composition”, out of Google Deepmind
lucidrains/llama-qrlhf — Implementation of the Llama architecture with RLHF + Q-learning
Robust Speech Recognition via Large-Scale Weak Supervision
ActiveVisionLab/Awesome-LLM-3D — Awesome-LLM-3D: a curated list of Multi-modal Large Language Model in 3D world Resources
Are Emergent Abilities of Large Language Models a Mirage?
facebookresearch/Pearl — A Production-ready Reinforcement Learning AI Agent Library brought by the Applied Reinforcement Learning team at Meta.
metavoiceio/metavoice-src — Foundational model for human-like, expressive TTS
lucidrains/retro-pytorch — Implementation of RETRO, Deepmind’s Retrieval based Attention net, in Pytorch
The Illustrated Retrieval Transformer
lifeiteng/vall-e — PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo
Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning
cisnlp/simalign — Obtain Word Alignments using Pretrained Language Models (e.g., mBERT)
Speech Recognition for Minority Languages Using HuBERT and Model Adaptation
Textually Pretrained Speech Language Models
An Embarrassingly Simple Approach for LLM with Strong ASR Capacity
Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation, code
Can a 2B LLM outperform Mistral 7B or Llama 13B?
— Philipp Schmid (@_philschmid) February 7, 2024
Creators of the popular Ultrafeedback dataset released MiniCPM, a 2.4B parameter model claiming performance close to Mistral 7B, Llama 2 13B, or Falcon 40B. 🤯🤔
As part of the release, the researchers released a detailed…
MiniCPM: Unveiling the Potential of End-side Large Language Models, code
There's a strain of anti-anti-monopolist that insists that they're not *pro*-monopoly - they're just *realists* who understand that global gigacorporations are too big to fail, too big to jail, and that governments can't hope to rein them in.
— Cory Doctorow NONCONSENSUAL BLUE TICK (@doctorow) February 6, 2024
1/ pic.twitter.com/nx0lM4lKWu
Review — Flamingo: A Visual Language Model for Few-Shot Learning
Multimodal Language Models Explained: Visual Instruction Tuning
📢Mixtures of Experts unlock parameter scaling for deep RL!
— Pablo Samuel Castro (@pcastr) February 14, 2024
Adding MoEs, and in particular Soft MoEs, to value-based deep RL agents results in more parameter-scalable models.
Performance keeps increasing as we increase number of experts (green line below)!
1/9 https://t.co/SMFUrpdNXN pic.twitter.com/kb9mqfyg3m
Mixtures of Experts Unlock Parameter Scaling for Deep RL
🔥Half a year after its initial release we are upgrading self-expanding neural networks🔥
— Martin Mundt (@mundt_martin) February 12, 2024
* SENN based on full-connectivity + now with convolutions
* layer & width addition + now pruning any time during training
* jax code: https://t.co/ndnnGjBu4Fhttps://t.co/QrYBkWyAV8
🧵 ⬇️ https://t.co/XLWcj8xoC1 pic.twitter.com/MnhObyE81B
Cohere for AI launches open source LLM for 101 languages
BAAI-DCAI/Bunny — A family of lightweight multimodal models.
theodorblackbird/lina-speech seems interesting, but it’s not open source, so I don’t care.
Large Language Models, GPT-2 — Language Models Are Unsupervised Multitask Learners
vosen/ZLUDA — CUDA on AMD GPUs