First, an audio sample. Using this video from youtube. Youtube says it's 11 minutes, 51 seconds, so that should be enough to check that striding works.

!pip install youtube-dl

!youtube-dl -x --audio-format best -o '%(id)s.%(ext)s' https://www.youtube.com/watch?v=Kw5jkyLGFGc

!ffmpeg -i Kw5jkyLGFGc.m4a -acodec pcm_s16le -ac 1 -ar 16000 Kw5jkyLGFGc.wav

Here starts the actual ASR stuff.

!pip install transformers

_SWE_MODEL = "KBLab/wav2vec2-large-voxrex-swedish"

from transformers import pipeline

pipe = pipeline(model=_SWE_MODEL)

For working with strides, there's information in a blog post.

There isn't much information on getting timestamps from a pipeline, but the detail is in the pull request.

output = pipe("/content/Kw5jkyLGFGc.wav", chunk_length_s=10)

output = pipe("/content/Kw5jkyLGFGc.wav", chunk_length_s=10, return_timestamps="word")

import json
with open("/content/Kw5jkyLGFGc.json", "w") as f:
    json.dump(output, f)