patrick-kidger/equinox — Callable PyTrees and filtered transforms => neural networks in JAX.

patrick-kidger/diffrax — Numerical differential equation solvers in JAX. Autodifferentiable and GPU-capable.


M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation

@misc{https://doi.org/10.48550/arxiv.2207.00952,
  doi = {10.48550/ARXIV.2207.00952},
  url = {https://arxiv.org/abs/2207.00952},
  author = {Zhao, Jinming and Yang, Hao and Shareghi, Ehsan and Haffari, Gholamreza},
  title = {M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

Code is open source, model is not


Trillson in transformers


Emformer: Efficient Memory Transformer Based Acoustic Model for Low Latency Streaming Speech Recognition

@INPROCEEDINGS{9414560,
  author={Shi, Yangyang and Wang, Yongqiang and Wu, Chunyang and Yeh, Ching-Feng and Chan, Julian and Zhang, Frank and Le, Duc and Seltzer, Mike},
  booktitle={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={Emformer: Efficient Memory Transformer Based Acoustic Model for Low Latency Streaming Speech Recognition}, 
  year={2021},
  volume={},
  number={},
  pages={6783-6787},
  doi={10.1109/ICASSP39728.2021.9414560}}

lumaku/ctc-segmentation — Segment an audio file and obtain utterance alignments.



How much data do you need for a good MFA alignment?

  • If you care only about alignments of the training data, 3-5 hours should be enough.
    • Caveat: increasing the number of speakers/varieties in the training data will likely need more training data
  • If you care about generating models for more widespread use, 8-10 should be enough for generalizing to the same variety
    • The more speakers the better, but also more speakers should need more data
    • I usually recommend about 20 hours for a decently performant model

google-research/t5x — essentially a new and improved implementation of the T5 codebase (based on Mesh TensorFlow) in JAX and Flax.

google/seqio — Task-based datasets, preprocessing, and evaluation for sequence models.


r/weirddalle


Towards End-to-end Unsupervised Speech Recognition

@misc{https://doi.org/10.48550/arxiv.2204.02492,
  doi = {10.48550/ARXIV.2204.02492},
  url = {https://arxiv.org/abs/2204.02492},
  author = {Liu, Alexander H. and Hsu, Wei-Ning and Auli, Michael and Baevski, Alexei},
  title = {Towards End-to-end Unsupervised Speech Recognition},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution 4.0 International}
}

Unified Speech-Text Pre-training for Speech Translation and Recognition

@misc{tang2022unified,
      title={Unified Speech-Text Pre-training for Speech Translation and Recognition},
      author={Yun Tang and Hongyu Gong and Ning Dong and Changhan Wang and Wei-Ning Hsu and Jiatao Gu and Alexei Baevski and Xian Li and Abdelrahman Mohamed and Michael Auli and Juan Pino},
      year={2022},
      eprint={2204.05409},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}