Interesting links, 13/07/2022

patrick-kidger/equinox — Callable PyTrees and filtered transforms => neural networks in JAX.

patrick-kidger/diffrax — Numerical differential equation solvers in JAX. Autodifferentiable and GPU-capable.

M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation

@misc{https://doi.org/10.48550/arxiv.2207.00952,
  doi = {10.48550/ARXIV.2207.00952},
  url = {https://arxiv.org/abs/2207.00952},
  author = {Zhao, Jinming and Yang, Hao and Shareghi, Ehsan and Haffari, Gholamreza},
  title = {M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

Check out our latest breakthrough in machine translation that Mark Zuckerberg just announced. We built and open sourced a state-of-the-art AI model that now translates between 200 different languages.
— AI at Meta (@AIatMeta) July 6, 2022

Code is open source, model is not

Trillson in transformers

Emformer: Efficient Memory Transformer Based Acoustic Model for Low Latency Streaming Speech Recognition

@INPROCEEDINGS{9414560,
  author={Shi, Yangyang and Wang, Yongqiang and Wu, Chunyang and Yeh, Ching-Feng and Chan, Julian and Zhang, Frank and Le, Duc and Seltzer, Mike},
  booktitle={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={Emformer: Efficient Memory Transformer Based Acoustic Model for Low Latency Streaming Speech Recognition}, 
  year={2021},
  volume={},
  number={},
  pages={6783-6787},
  doi={10.1109/ICASSP39728.2021.9414560}}

lumaku/ctc-segmentation — Segment an audio file and obtain utterance alignments.

This past week I spent some time learning about SentenceTransformers (https://t.co/5ZAV7lJq7u), and I'm pretty blown away by what sentence embeddings can be used for.

If you're curious to see what researchers have been getting up to with it, here's a 🧵 with some highlights:
— Nima Boscarino (@NimaBoscarino) June 10, 2022

How much data do you need for a good MFA alignment?

If you care only about alignments of the training data, 3-5 hours should be enough.

Caveat: increasing the number of speakers/varieties in the training data will likely need more training data

If you care about generating models for more widespread use, 8-10 should be enough for generalizing to the same variety

The more speakers the better, but also more speakers should need more data

I usually recommend about 20 hours for a decently performant model

google-research/t5x — essentially a new and improved implementation of the T5 codebase (based on Mesh TensorFlow) in JAX and Flax.

google/seqio — Task-based datasets, preprocessing, and evaluation for sequence models.

r/weirddalle

Towards End-to-end Unsupervised Speech Recognition

@misc{https://doi.org/10.48550/arxiv.2204.02492,
  doi = {10.48550/ARXIV.2204.02492},
  url = {https://arxiv.org/abs/2204.02492},
  author = {Liu, Alexander H. and Hsu, Wei-Ning and Auli, Michael and Baevski, Alexei},
  title = {Towards End-to-end Unsupervised Speech Recognition},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution 4.0 International}
}

Unified Speech-Text Pre-training for Speech Translation and Recognition

@misc{tang2022unified,
      title={Unified Speech-Text Pre-training for Speech Translation and Recognition},
      author={Yun Tang and Hongyu Gong and Ning Dong and Changhan Wang and Wei-Ning Hsu and Jiatao Gu and Alexei Baevski and Xian Li and Abdelrahman Mohamed and Michael Auli and Juan Pino},
      year={2022},
      eprint={2204.05409},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}