Interesting links, 13/07/2022
Misc. interesting things.
patrick-kidger/equinox — Callable PyTrees and filtered transforms => neural networks in JAX.
patrick-kidger/diffrax — Numerical differential equation solvers in JAX. Autodifferentiable and GPU-capable.
M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation
@misc{https://doi.org/10.48550/arxiv.2207.00952,
doi = {10.48550/ARXIV.2207.00952},
url = {https://arxiv.org/abs/2207.00952},
author = {Zhao, Jinming and Yang, Hao and Shareghi, Ehsan and Haffari, Gholamreza},
title = {M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}
Check out our latest breakthrough in machine translation that Mark Zuckerberg just announced. We built and open sourced a state-of-the-art AI model that now translates between 200 different languages.
— AI at Meta (@AIatMeta) July 6, 2022
Code is open source, model is not
@INPROCEEDINGS{9414560,
author={Shi, Yangyang and Wang, Yongqiang and Wu, Chunyang and Yeh, Ching-Feng and Chan, Julian and Zhang, Frank and Le, Duc and Seltzer, Mike},
booktitle={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Emformer: Efficient Memory Transformer Based Acoustic Model for Low Latency Streaming Speech Recognition},
year={2021},
volume={},
number={},
pages={6783-6787},
doi={10.1109/ICASSP39728.2021.9414560}}
lumaku/ctc-segmentation — Segment an audio file and obtain utterance alignments.
This past week I spent some time learning about SentenceTransformers (https://t.co/5ZAV7lJq7u), and I'm pretty blown away by what sentence embeddings can be used for.
— Nima Boscarino (@NimaBoscarino) June 10, 2022
If you're curious to see what researchers have been getting up to with it, here's a 🧵 with some highlights:
How much data do you need for a good MFA alignment?
- If you care only about alignments of the training data, 3-5 hours should be enough.
- Caveat: increasing the number of speakers/varieties in the training data will likely need more training data
- If you care about generating models for more widespread use, 8-10 should be enough for generalizing to the same variety
- The more speakers the better, but also more speakers should need more data
- I usually recommend about 20 hours for a decently performant model
google-research/t5x — essentially a new and improved implementation of the T5 codebase (based on Mesh TensorFlow) in JAX and Flax.
google/seqio — Task-based datasets, preprocessing, and evaluation for sequence models.
Towards End-to-end Unsupervised Speech Recognition
@misc{https://doi.org/10.48550/arxiv.2204.02492,
doi = {10.48550/ARXIV.2204.02492},
url = {https://arxiv.org/abs/2204.02492},
author = {Liu, Alexander H. and Hsu, Wei-Ning and Auli, Michael and Baevski, Alexei},
title = {Towards End-to-end Unsupervised Speech Recognition},
publisher = {arXiv},
year = {2022},
copyright = {Creative Commons Attribution 4.0 International}
}
Unified Speech-Text Pre-training for Speech Translation and Recognition
@misc{tang2022unified,
title={Unified Speech-Text Pre-training for Speech Translation and Recognition},
author={Yun Tang and Hongyu Gong and Ning Dong and Changhan Wang and Wei-Ning Hsu and Jiatao Gu and Alexei Baevski and Xian Li and Abdelrahman Mohamed and Michael Auli and Juan Pino},
year={2022},
eprint={2204.05409},
archivePrefix={arXiv},
primaryClass={cs.CL}
}