You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
### Description
This PR updates exporting and running the Whisper model with beam search
by adding the following.
- Adds temperature as a graph input to the exported model
- Fixes the token ids by adding them as attributes to
`WhisperBeamSearch`
- Fixes the timestamps test cases so they pass now
- Fixes a bug with invoking `torch.onnx.export`
- Cleans up the Whisper scripts and groups the arguments in
`convert_to_onnx.py`
- Adds a `requirements.txt` file to specify package dependencies
- Adds `whisper-large-v3` to list of pretrained models
- Fixes a bug with missing cross-attention KV cache inputs in the
decoder subgraph
### Motivation and Context
- This is a follow-up to [this
PR](#19188).
- The incorrect token ids in the timestamps processor were first noticed
during [this PR
review](#17500 (comment)).
When they were originally added in [this
PR](#15853), the offsets
were previously constant across the Whisper model sizes. When comparing
the new `whisper-large-v3` variant, the English-only variants (e.g.
`whisper-tiny.en`), and the original variants (e.g. `whisper-tiny`),
both the values and the offsets differ. Therefore, it is easier to set
the token ids as attributes to `WhisperBeamSearch` when exporting to
ensure the right values are used in the timestamps processor.
- The Hugging Face API for returning timestamps and the expected outputs
from the PyTorch model have both changed.
- The fix for `torch.onnx.export` is a follow-up to [this PR
review](#17179 (comment)).
- The argument grouping is a follow-up to [this PR
review](#17500 (comment)).
- Specific package versions are needed to run the Whisper scripts and
the `requirements.txt` file ensures that these versions are installed.
- The `whisper-large-v3` variant is released and should be in the list
of official pretrained models.
- After the changes from [this
PR](#17316), the exported
model is not loading in an ORT inference session because the
cross-attention KV cache inputs are missing in the decoder subgraph.
<dd>Mask of vocabulary for first step. Words that masked with 0 are not allowed to be generated, and 1 is allowed. Shape is (batch_size, vocab_size)</dd>
467
467
<dt><tt>attention_mask</tt> (optional) : I</dt>
@@ -2252,7 +2252,7 @@ This version of the operator has been available since version 1 of the 'com.micr
<dd>Mask of vocabulary for first step. Words that masked with 0 are not allowed to be generated, and 1 is allowed. Shape is (batch_size, vocab_size)</dd>
2258
2258
<dt><tt>attention_mask</tt> (optional) : I</dt>
@@ -5154,7 +5154,7 @@ This version of the operator has been available since version 1 of the 'com.micr
<dd>Mask of vocabulary for first step. Words that masked with 0 are not allowed to be generated, and 1 is allowed. Shape is (batch_size, vocab_size)</dd>
5160
5160
<dt><tt>attention_mask</tt> (optional) : I</dt>
@@ -5743,12 +5743,14 @@ This version of the operator has been available since version 1 of the 'com.micr
<dd>The id of the token that indicates decoding starts.</dd>
5753
+
<dd>The id of the token that indicates decoding starts (i.e. the start of transcription token id)</dd>
5752
5754
<dt><tt>early_stopping</tt> : int</dt>
5753
5755
<dd>early stop or not</dd>
5754
5756
<dt><tt>encoder</tt> : graph</dt>
@@ -5761,10 +5763,18 @@ This version of the operator has been available since version 1 of the 'com.micr
5761
5763
<dd>Must be 2 for whisper</dd>
5762
5764
<dt><tt>no_repeat_ngram_size</tt> : int</dt>
5763
5765
<dd>no repeat ngrams size</dd>
5764
-
<dt><tt>no_speech_token</tt> : int</dt>
5766
+
<dt><tt>no_speech_token_id</tt> : int</dt>
5765
5767
<dd>The token in whisper model that marks all sequence empty. With this model, whisper could output no_speech_prob after. Default -1.</dd>
5768
+
<dt><tt>no_timestamps_token_id</tt> : int</dt>
5769
+
<dd>The id of the token that indicates no timestamps</dd>
5766
5770
<dt><tt>pad_token_id</tt> : int (required)</dt>
5767
5771
<dd>The id of the padding token</dd>
5772
+
<dt><tt>start_of_lm_token_id</tt> : int</dt>
5773
+
<dd>The id of the token that indicates LM starts</dd>
5774
+
<dt><tt>transcribe_token_id</tt> : int</dt>
5775
+
<dd>The id of the transcribe task</dd>
5776
+
<dt><tt>translate_token_id</tt> : int</dt>
5777
+
<dd>The id of the translate task</dd>
5768
5778
<dt><tt>vocab_size</tt> : int</dt>
5769
5779
<dd>Size of the vocabulary. If not provided, it will be inferred from the decoder subgraph's output shape</dd>
5770
5780
</dl>
@@ -5783,11 +5793,11 @@ This version of the operator has been available since version 1 of the 'com.micr
5783
5793
<dt><tt>num_return_sequences</tt> : I</dt>
5784
5794
<dd>The number of returned sequences in the batch. Shape is (1)</dd>
5785
5795
<dt><tt>length_penalty</tt> (optional) : T</dt>
5786
-
<dd>Exponential penalty to the length. Default value 1.0 means no penalty.Value > 1.0 encourages longer sequences, while values < 1.0 produces shorter sequences.Shape is (1,)</dd>
5796
+
<dd>Exponential penalty to the length. Default value 1.0 means no penalty.Value > 1.0 encourages longer sequences, while values < 1.0 produces shorter sequences.Shape is (1,)</dd>
<dd>Mask of vocabulary for first step. Words that masked with 0 are not allowed to be generated, and 1 is allowed. Shape is (batch_size, vocab_size)</dd>
5793
5803
<dt><tt>attention_mask</tt> (optional) : I</dt>
@@ -5797,7 +5807,7 @@ This version of the operator has been available since version 1 of the 'com.micr
5797
5807
<dt><tt>logits_processor</tt> (optional) : I</dt>
5798
5808
<dd>Specific logits processor for different types of beamsearch models. Default value 0 means no specific logit processor. Accepts value >= 0. Shape is (1)</dd>
<dd>Only keep this list of (layer, head) of QK in the final cross_qk output when use_cross_qk is set. Default collect allits shape is (number of (layer, head) to keep, 2), i.e., [[layer_id1, head_id1], [layer_id2, head_id2]......]</dd>
5810
+
<dd>Only keep this list of (layer, head) of QK in the final cross_qk output when use_cross_qk is set. Default collect all its shape is (number of (layer, head) to keep, 2), i.e., [[layer_id1, head_id1], [layer_id2, head_id2]......]</dd>
<dd>Part of the decoder_input_ids that we need cross qk for it. it is of shape (batch_size, extra_decoding_ids_len).In such case, we should remove this from the tail of the decoder_input_ids, and put it here. ids < 0 in it (for multiple batch) are treated as stop of the extra_decoding_ids for corresponding batch.</dd>
5803
5813
<dt><tt>temperature</tt> (optional) : T</dt>
@@ -5812,11 +5822,11 @@ This version of the operator has been available since version 1 of the 'com.micr
5812
5822
<dt><tt>sequences_scores</tt> (optional) : T</dt>
5813
5823
<dd>Final beam score of the generated sequences. Shape is (batch_size, num_return_sequences)</dd>
5814
5824
<dt><tt>scores</tt> (optional) : T</dt>
5815
-
<dd>Processed beam scores for each vocabulary token at each generation step.Beam scores consisting of log softmax scores for each vocabulary token and sum of log softmax of previously generated tokens in this beam.Shape is (max_length - sequence_length, batch_size, num_beams, vocab_size)</dd>
5825
+
<dd>Processed beam scores for each vocabulary token at each generation step.Beam scores consisting of log softmax scores for each vocabulary token and sum of log softmax of previously generated tokens in this beam.Shape is (max_length - sequence_length, batch_size, num_beams, vocab_size)</dd>
5816
5826
<dt><tt>cross_qk</tt> (optional) : V</dt>
5817
-
<dd>Output the accumulated stacked Q*K in cross attentions. Let H = number of Head of cross attention, F = the frames or kv-seq-len of the cross attention input, T = real decoded token length, L = number of layers,B = batch size, R = num_return_sequences. It then should return tensor of shape [B, R, L*H, T, F].If cross_qk_layer_head is given, shape is [B, R, cross_qk_layer_head.shape[0], T, F]</dd>
5827
+
<dd>Output the accumulated stacked Q*K in cross attentions. Let H = number of Head of cross attention, F = the frames or kv-seq-len of the cross attention input, T = real decoded token length, L = number of layers,B = batch size, R = num_return_sequences. It then should return tensor of shape [B, R, L*H, T, F].If cross_qk_layer_head is given, shape is [B, R, cross_qk_layer_head.shape[0], T, F]</dd>
5818
5828
<dt><tt>non_speech_probs</tt> (optional) : T</dt>
5819
-
<dd>For whisper model, output the probabilities from logits after encoder and context decoding for the no_speech_token.Currently we treat the last token's logits is what we need, in future extra graph logic may be add to the encoder/context-decoder subgraph.The prob is save before logits may be updated by extra-decoding-ids. The shape of non_speech_probs is [B]</dd>
5829
+
<dd>For whisper model, output the probabilities from logits after encoder and context decoding for the no_speech_token_id. The shape of non_speech_probs is [B]</dd>
0 commit comments