Getting timestamps on long audio
With wav2vec2 and HuggingFace transformers
First, an audio sample. Using this video from youtube. Youtube says it's 11 minutes, 51 seconds, so that should be enough to check that striding works.
!pip install youtube-dl
!youtube-dl -x --audio-format best -o '%(id)s.%(ext)s' https://www.youtube.com/watch?v=Kw5jkyLGFGc
!ffmpeg -i Kw5jkyLGFGc.m4a -acodec pcm_s16le -ac 1 -ar 16000 Kw5jkyLGFGc.wav
Here starts the actual ASR stuff.
!pip install transformers
_SWE_MODEL = "KBLab/wav2vec2-large-voxrex-swedish"
from transformers import pipeline
pipe = pipeline(model=_SWE_MODEL)
For working with strides, there's information in a blog post.
There isn't much information on getting timestamps from a pipeline, but the detail is in the pull request.
output = pipe("/content/Kw5jkyLGFGc.wav", chunk_length_s=10)
output = pipe("/content/Kw5jkyLGFGc.wav", chunk_length_s=10, return_timestamps="word")
import json
with open("/content/Kw5jkyLGFGc.json", "w") as f:
json.dump(output, f)