-
Notifications
You must be signed in to change notification settings - Fork 11.9k
DeepSeek V2/V3 with -mla
option
#12725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 21 commits
b4c169f
10207b4
ea3c05b
1f604a7
1de077b
7f92e7b
319e3ef
ee4b389
c00cd9e
55ad3a7
0c86f56
b0c8a43
8c329bc
68302ee
937a48d
1fd0aab
4fb439f
f9a0ef4
5fe402a
9b862f9
8e23e0d
b384086
5dbf99c
f0d514a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1030,6 +1030,8 @@ static const std::map<llm_arch, std::map<llm_tensor, const char *>> LLM_TENSOR_N | |
{ LLM_TENSOR_ATTN_Q_B, "blk.%d.attn_q_b" }, | ||
{ LLM_TENSOR_ATTN_KV_A_MQA, "blk.%d.attn_kv_a_mqa" }, | ||
{ LLM_TENSOR_ATTN_KV_B, "blk.%d.attn_kv_b" }, | ||
{ LLM_TENSOR_ATTN_K_B, "blk.%d.attn_k_b" }, | ||
{ LLM_TENSOR_ATTN_V_B, "blk.%d.attn_v_b" }, | ||
{ LLM_TENSOR_ATTN_OUT, "blk.%d.attn_output" }, | ||
{ LLM_TENSOR_FFN_NORM, "blk.%d.ffn_norm" }, | ||
{ LLM_TENSOR_FFN_GATE, "blk.%d.ffn_gate" }, | ||
|
@@ -1471,23 +1473,8 @@ static const std::map<llm_tensor, llm_tensor_info> LLM_TENSOR_INFOS = { | |
{LLM_TENSOR_ATTN_Q_B, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_ATTN_KV_A_MQA, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_ATTN_KV_B, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_DEC_ATTN_Q, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_DEC_ATTN_K, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_ATTN_Q, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_ATTN_K, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_ATTN_V, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_ATTN_QKV, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_ATTN_OUT, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_FFN_GATE, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_FFN_DOWN, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_FFN_UP, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_FFN_DOWN_SHEXP, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_FFN_GATE_SHEXP, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_FFN_UP_SHEXP, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_ATTN_Q_A, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_ATTN_Q_B, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_ATTN_KV_A_MQA, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_ATTN_KV_B, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
Comment on lines
-1474
to
-1490
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think these are deleted inadvertently? For example, ffn_*_shexp are still used by qwen moe There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think these were all accidentally duplicated in the main branch so I removed the duplicates when inserting the new ones. |
||
{LLM_TENSOR_ATTN_K_B, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_ATTN_V_B, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_DEC_ATTN_Q, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_DEC_ATTN_K, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_DEC_ATTN_V, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
|
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please tell me if I missed something regarding the discussion about duplicate or not to duplicate these tensor (slices).
About the subject of not to duplicate it, I'm thinking about an idea that could allow slicing
kv_b_proj
at load without using too much memory, is to do something like this:The
ggml_cpy
andggml_view
won't allocate new memory on device buffer, onlyggml_cont
need to allocate memory.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we could easily do this (or something similar) so long as we keep
kv_proj_b
as afloat32
, but this has different problems:-mla
option to havekv_proj_b
stored asfloat32
when they don't really need it and can just use theggml_mul_mat_set_prec(xxx, GGML_PREC_F32)
call instead.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This raises a good point though: I don't think we really need to save
wv_b
at all and can just use the upper slice ofwkv_b
(I think - will have to check tomorrow).