Interesting links, 25/11/2021

Todo

ELS-RD/transformer-deploy — Deploy optimized transformer based models in production

davidbrochart/nbterm

Fine-tuning XLS-R for Multi-Lingual ASR with 🤗 Transformers, fairseq, Facebook AI blog

facebookresearch/covost

CoVoST 2 and Massively Multilingual Speech-to-Text Translation

Pygments lexer

jusText 3 — jusText is a tool for removing boilerplate content

Onion — onion (ONe Instance ONly) is a tool for removing duplicate parts from large collections of texts.

rsling/texrex — texrex web page cleaning & ClaraX random walk crawler

Common Crawled web corpora

Representation Learning with Contrastive Predictive Coding, facebookresearch/CPC_audio

bshall/VectorQuantizedCPC

menelik3/cmudict-ipa — The CMU Pronouncing Dictionary converted to IPA

A cross-linguistic database of phonetic transcription systems

glottobank/potential-of-cognate-detection — Source code and data accompanying the paper “The Potential of Automatic Word Comparison for Historical Linguistics”

glottobank/tukano — Repository for computer-guided reconstruction with Jena wordlist standard for Tukano language data

Cpc vox populi #965

flashlight/flashlight/app/asr/tools/alignment

wav2letter/recipes/lexicon_free

CMU Advanced NLP 2021 Prompting + Sequence-to-sequence Pre-training

ming024/FastSpeech2 — An implementation of Microsoft’s “FastSpeech 2: Fast and High-Quality End-to-End Text to Speech”

[Phrase Retrieval and Beyond

Princeton NLP Group](https://princeton-nlp.github.io/phrase-retrieval-and-beyond/)

princeton-nlp/PURE A Frustratingly Easy Approach for Entity and Relation Extraction

princeton-nlp/LM-BFF LM-BFF. Better Few-shot Fine-tuning of Language Models

Docusaurus

camelot-dev/camelot — A Python library to extract tabular data from PDFs

neural-network-and-data-loading.ipynb

jina-ai/finetuner — Finetuning any DNN for better embedding on neural search tasks

ddbourgin/numpy-ml

jina-ai/jina — Cloud-native neural search framework for 𝙖𝙣𝙮 kind of data

kaldialign/setup.py

nnmnkwii_gallery/01-DNN-based statistical speech synthesis (en).ipynb

Character-level Convolutional Networks for Text Classification

toganlabs/seanchlo_keyboard/

Todo

Die araner mundart/Lautlehre

Die araner mundart/Wörterbuch/æ ȧ – Wikisource

L’Accent dans le gaëlique du Munster - Wikisource

patrickvonplaten/Wav2Vec2_PyCTCDecode

kensho-technologies/pyctcdecode

What’s New in v3.2

kaldi/run_segmentation_long_utts.sh

kaldi/egs/wsj/s5/steps/cleanup

kaldi/clean_and_segment_data.sh

kaldi/decode_segmentation.sh

Paracrawl

[OSCAR 21.09

OSCAR](https://oscar-corpus.com/post/oscar-v21-09/)

kaldialign/calign.pxd

ga-conj-1a