First, an audio sample. Using this video from youtube. Youtube says it's 11 minutes, 51 seconds, so that should be enough to check that striding works.

!pip install youtube-dl
!youtube-dl -x --audio-format best -o '%(id)s.%(ext)s' https://www.youtube.com/watch?v=Kw5jkyLGFGc
!ffmpeg -i Kw5jkyLGFGc.m4a -acodec pcm_s16le -ac 1 -ar 16000 Kw5jkyLGFGc.wav

Here starts the actual ASR stuff.

!pip install transformers
_SWE_MODEL = "KBLab/wav2vec2-large-voxrex-swedish"
from transformers import pipeline
pipe = pipeline(model=_SWE_MODEL)

For working with strides, there's information in a blog post.

There isn't much information on getting timestamps from a pipeline, but the detail is in the pull request.

output = pipe("/content/Kw5jkyLGFGc.wav", chunk_length_s=10)
output = pipe("/content/Kw5jkyLGFGc.wav", chunk_length_s=10, return_timestamps="word")
import json
with open("/content/Kw5jkyLGFGc.json", "w") as f:
    json.dump(output, f)