Word-level timestamps broken for short-form audio #30325

kamilakesbi · 2024-04-18T16:51:10Z

What does this PR do?

This PR aims at fixing issue #30224: word-level timestamps is currently broken in Whisper large/large-v2/large-v3/distil-large-v3. The problem comes from the fact that num_frames isn't passed to the generate method when stride is None.

I suggest simply adding a new extra argument to the output of the preprocess method, called segment_size, which will store the input length in the case of short form audios. It will be passed to the _forward method and used when stride is None to compute generate_kwargs["num_frames"] = segment_size // self.feature_extractor.hop_length.

This seems to fix the problem. Is there any particular test that should be run regarding this issue?

Who can review?

@sanchit-gandhi

HuggingFaceDocBuilderDev · 2024-04-18T17:11:43Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

kamilakesbi · 2024-04-23T13:47:16Z

Hi @sanchit-gandhi,

I've added slow tests that pass. With this PR we should be able to compute word-level timestamps for short form audio using the automatic-speech-recognition pipeline.

I've noticed that word-level timestamps cannot be computed using WhisperForConditionalGeneration instead of the pipeline. I'll open an issue so we can fix this in another PR.

sanchit-gandhi

LGTM - thanks for fixing @kamilakesbi! Let's indeed do the model + processor API in a follow-up PR

cc @ylacombe and @xenova for info

tests/pipelines/test_pipelines_automatic_speech_recognition.py

kamilakesbi · 2024-04-30T11:05:08Z

Hey @amyeroberts! would appreciate a final review here when you have time.
It should be pretty quick to review ;)

amyeroberts

Thanks for working on fixing this!

Just a few small questions/comments.

amyeroberts · 2024-04-30T12:48:59Z

src/transformers/pipelines/automatic_speech_recognition.py

+                        if isinstance(segment_size, int):
+                            generate_kwargs["num_frames"] = segment_size // self.feature_extractor.hop_length
+                        else:
+                            generate_kwargs["num_frames"] = segment_size[0] // self.feature_extractor.hop_length


Under what cases would segment_size be a iterable? And will all the values of segment_size in the iterable be the same?

segment_size would be an iterable when using the batch_size argument in the pipeline (with bs >1), as in here: in this case, we get an iterable of size 1. If I increase the batch size, I still get an iterable with size 1.

src/transformers/pipelines/automatic_speech_recognition.py

amyeroberts · 2024-04-30T13:08:26Z

src/transformers/pipelines/automatic_speech_recognition.py

@@ -459,6 +460,7 @@ def preprocess(self, inputs, chunk_length_s=0, stride_length_s=None):
    def _forward(self, model_inputs, return_timestamps=False, **generate_kwargs):
        attention_mask = model_inputs.pop("attention_mask", None)
        stride = model_inputs.pop("stride", None)
+        segment_size = model_inputs.pop("segment_size", None)


Are there any cases when stride is None and we want to use segment_size?

Because we're popping both from model_inputs here, technically, the user can pass this in directly as an argument when calling pipeline.forward. In which case, it would be good to have input validation on these argument

We will get stride is None when using this pipeline:

pipe = pipeline( task="automatic-speech-recognition", model="openai/whisper-large-v3", return_timestamps="word", )

--> segment_size is the alternative to compute and add num_frames to the generate_kwargs to be passed to the generate method here.

Sorry, I mistyped. I meant to say when stride is not None

We want to use segment_size only when stride is None.

OK. Let's add a quick input validation so the argument isn't just silently ignored then if stride is not None

I've added the following lines:

if stride is not None and segment_size is not None: raise ValueError("segment_size must be used only when stride is None")

tests/pipelines/test_pipelines_automatic_speech_recognition.py

amyeroberts · 2024-04-30T13:13:55Z

tests/pipelines/test_pipelines_automatic_speech_recognition.py

+            ],
+        }
+
+        # batch size 1: copy the audio sample since pipeline consumes it


👀 - as in, mutates it? It shouldn't do that....

Not sure to know why it does that, I've simply copied the logic from another test (this one) here!

OK, let's open an issue to address this, so the work isn't forgotten but can be done in an separate PR. We should definitely avoid side-effect like this 😬

Ok, I'll open an issue!

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

…kamilakesbi/transformers into whisper_large_fix_word_level_timestamps

amyeroberts

Thanks for fixing and iterating on this!

@kamilakesbi Do you have permissions to merge in the PR? If not, let me know and I can merge it it for you

kamilakesbi · 2024-05-03T07:34:52Z

Hi @amyeroberts, I don't! Could you merge it ?
Thanks!

kamilakesbi added 2 commits April 15, 2024 14:17

force chunk_length_s in AutomaticSpeechRecognitionPipeline

fecd79d

compute num_frames even when stride is None

0f5fbae

kamilakesbi requested a review from sanchit-gandhi April 18, 2024 16:51

add slow tests

24a340a

sanchit-gandhi approved these changes Apr 23, 2024

View reviewed changes

tests/pipelines/test_pipelines_automatic_speech_recognition.py Outdated Show resolved Hide resolved

fix test

Loading
Loading status checks…

15395df

amyeroberts reviewed Apr 30, 2024

View reviewed changes

kamilakesbi and others added 2 commits April 30, 2024 16:41

Update src/transformers/pipelines/automatic_speech_recognition.py

Loading
Loading status checks…

034449c

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

Update tests/pipelines/test_pipelines_automatic_speech_recognition.py

Loading
Loading status checks…

e52d11a

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

kamilakesbi changed the title ~~[WIP] - Word-level timestamps broken for short-form audio~~ Word-level timestamps broken for short-form audio Apr 30, 2024

kamilakesbi added 3 commits May 2, 2024 16:32

add input validation

47da641

Merge branch 'whisper_large_fix_word_level_timestamps' of github.com:…

a8cd483

…kamilakesbi/transformers into whisper_large_fix_word_level_timestamps

fixup

76fbfd3

amyeroberts approved these changes May 2, 2024

View reviewed changes

small fix

a7a5d5e

amyeroberts merged commit 9c8979e into huggingface:main May 7, 2024
19 checks passed

ylacombe mentioned this pull request May 13, 2024

Whisper Word-level Timestamps broken on some inputs #29502

Closed

4 tasks

kamilakesbi mentioned this pull request May 17, 2024

AutomaticSpeechRecognition pipeline cannot predict WORD timestamps for Whisper models finetuned without timestamps prediction #30148

Closed

4 tasks

kamilakesbi added the Audio label May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word-level timestamps broken for short-form audio #30325

Word-level timestamps broken for short-form audio #30325

kamilakesbi commented Apr 18, 2024

HuggingFaceDocBuilderDev commented Apr 18, 2024

kamilakesbi commented Apr 23, 2024

sanchit-gandhi left a comment

kamilakesbi commented Apr 30, 2024

amyeroberts left a comment

amyeroberts Apr 30, 2024

kamilakesbi Apr 30, 2024

amyeroberts Apr 30, 2024

kamilakesbi Apr 30, 2024

amyeroberts Apr 30, 2024 •

edited

Loading

kamilakesbi May 2, 2024

amyeroberts May 2, 2024

kamilakesbi May 2, 2024 •

edited

Loading

amyeroberts Apr 30, 2024

kamilakesbi Apr 30, 2024

amyeroberts May 2, 2024

kamilakesbi May 2, 2024

amyeroberts left a comment

kamilakesbi commented May 3, 2024

Word-level timestamps broken for short-form audio #30325

Word-level timestamps broken for short-form audio #30325

Conversation

kamilakesbi commented Apr 18, 2024

What does this PR do?

Who can review?

HuggingFaceDocBuilderDev commented Apr 18, 2024

kamilakesbi commented Apr 23, 2024

sanchit-gandhi left a comment

Choose a reason for hiding this comment

kamilakesbi commented Apr 30, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amyeroberts Apr 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kamilakesbi May 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

kamilakesbi commented May 3, 2024

amyeroberts Apr 30, 2024 •

edited

Loading

kamilakesbi May 2, 2024 •

edited

Loading