-
-
Notifications
You must be signed in to change notification settings - Fork 7.8k
[BugFix] Fix Llama4 - Index Error When Single Request Near Max Context #16209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -264,7 +264,7 @@ def make_local_attention_virtual_batches( | |||||||||||||||||||||||||||||
np.arange(pages_per_local_batch, dtype=np.int32), | ||||||||||||||||||||||||||||||
(virtual_batches, pages_per_local_batch)) \ | ||||||||||||||||||||||||||||||
+ np.expand_dims(block_starts, axis=1) | ||||||||||||||||||||||||||||||
block_indices = block_indices.flatten() | ||||||||||||||||||||||||||||||
block_indices = block_indices.flatten().clip(max=block_table.shape[1] - 1) | ||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure what the correct number of elements is here, but
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cool ya I can open up a separate PR to use that, a bit hesitant to update this PR since this one has be vetted by @LagPixelLOL (since im not setup to repo it locally) and it would be nice to get something in |
||||||||||||||||||||||||||||||
batch_indices = np.repeat(np.arange(actual_batch_size, dtype=np.int32), | ||||||||||||||||||||||||||||||
local_blocks * pages_per_local_batch) | ||||||||||||||||||||||||||||||
block_table_local = block_table[batch_indices, block_indices]\ | ||||||||||||||||||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does block_indices contain OOB items without the
clip
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it happens when
np.arange(pages_per_local_batch, dtype=np.int32)
runs off the end of the block-table, i.e. max_model_len is not a multiple of the attention_chunk_size, in this case we need to clip to simulate that there is a partial attention chunk at the end of the context