Phonetic transcription with HuggingFace
wav2vec2 espeak phonetic model
Based on an earlier notebook
%%capture
!pip install youtube-dl
!pip install phonemizer
%%capture
!apt install espeak-ng
!youtube-dl -x --audio-format best -o '%(id)s.%(ext)s' https://www.youtube.com/watch?v=Kw5jkyLGFGc
%%capture
!ffmpeg -i Kw5jkyLGFGc.m4a -acodec pcm_s16le -ac 1 -ar 16000 Kw5jkyLGFGc.wav
Here starts the actual ASR stuff.
%%capture
!pip install transformers
_SWE_MODEL = "facebook/wav2vec2-lv-60-espeak-cv-ft"
from transformers import pipeline
pipe = pipeline(model=_SWE_MODEL, device=0)
output = pipe("/content/Kw5jkyLGFGc.wav", chunk_length_s=10, return_timestamps="char")
import json
with open("/content/Kw5jkyLGFGc.json", "w") as f:
json.dump(output, f)