-
Notifications
You must be signed in to change notification settings - Fork 11.9k
DeepSeek V2/V3 with -mla
option (final)
#12772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepSeek V2/V3 with -mla
option (final)
#12772
Conversation
-mla
option (fiinal)-mla
option (final)
if name.endswith("kv_b_proj.weight"):
name_kb = name.replace("kv_b_proj", "k_b_proj_trans")
name_vb = name.replace("kv_b_proj", "v_b_proj")
n_head_kv = self.hparams["num_key_value_heads"]
v_head_dim = self.hparams["v_head_dim"]
qk_nope_head_dim = self.hparams["qk_nope_head_dim"]
assert data_torch.shape[0] == n_head_kv * (v_head_dim + qk_nope_head_dim)
kv_b = data_torch.view(n_head_kv, v_head_dim + qk_nope_head_dim, data_torch.shape[-1])
k_b, v_b = torch.split(kv_b, [qk_nope_head_dim, v_head_dim], dim=1)
k_b_trans = k_b.transpose(1, 2)
k_b_trans = k_b_trans.reshape(n_head_kv * data_torch.shape[-1], qk_nope_head_dim)
v_b = v_b.reshape(n_head_kv * v_head_dim, data_torch.shape[-1])
return [
(self.map_tensor_name(name), data_torch),
(self.map_tensor_name(name_kb), k_b_trans),
(self.map_tensor_name(name_vb), v_b)
] One way we could save a little memory here is by saving ggml_tensor * kv = ggml_mul_mat(ctx0, model.layers[il].wkv_b, kv_cmpr);
cb(kv, "kv", il);
// split into {n_embd_head_qk_nope, n_head, n_tokens}
ggml_tensor * k_nope = ggml_view_3d(ctx0, kv,
n_embd_head_qk_nope, n_head, n_tokens,
ggml_row_size(kv->type, n_embd_head_qk_nope + n_embd_head_v),
ggml_row_size(kv->type, n_embd_head_qk_nope + n_embd_head_v) * n_head,
0);
cb(k_nope, "k_nope", il);
// and {n_embd_head_v, n_head, n_tokens}
ggml_tensor * v_states = ggml_view_3d(ctx0, kv,
n_embd_head_v, n_head, n_tokens,
ggml_row_size(kv->type, n_embd_head_qk_nope + n_embd_head_v),
ggml_row_size(kv->type, n_embd_head_qk_nope + n_embd_head_v) * n_head,
ggml_row_size(kv->type, n_embd_head_qk_nope));
cb(v_states, "v_states", il);
v_states = ggml_cont(ctx0, v_states);
cb(v_states, "v_states", il); and there would have to be a special case for old GGUF files here. |
I've split so the new GGUF files to use: I've tested on It will likely be tomorrow by the time I have tested |
This change reduces the prompt processing quite badly for the "lite" model when not using Mainline
|
Here's what the changes look like without the I won't open another PR as already have several open/closed now 😁 but I think this looks much cleaner and easier to maintain long-term? |
Mainline
|
I agree. I don't think the Regarding backwards compatibility it will obviously hurt having to requant and reupload all DeepSeek V2/V3 based models but in the end, MLA offers a better user experience and so is likely something that should be done regardless if llama.cpp mainline still supports non-MLA quants or not. We are talking of 34 quants totaling 11579.4 GB for every V3 based model mradermacher offers making this really painful but probably worth it. I agree that the simpler branch is much cleaner. @jukofyork Do you know if the imatrix computed using the old Q8 quants needs to be recomputed as well? To compute the imatrix of an V3/R1 based model in Q8 using my RPC setup takes around 20 hours. It is for sure something I'm willing to do if there is a reason for it I just don't want to waste weeks recomputing the imatrix for no reason. I would prefer if MLA gets merged soon. V3 was released 26th December 2024 and the longer we wait the more quants will be created without MLA making the effort to requantize all of them harder and harder over time. I decided to pause quantizing any DeepSeek V2/V3 based model for now until this is either merged or postponed again. |
I think II have found a solution that keeps the backward compatibility, but removed the This also means I don;'t have to make the ugly changes to @nicoboss II can probably create an imatriix file as Ii have a machine with 1.5TB of RAM, but want to be 100% clear on the final solution first. |
That is awesome. This is the perfect solution. It will make the transition to MLA supported models much easier. Thanks a lot for your great work!
No problem I can recompute them by my own. I just wanted to make sure doing so is required. Now that I know it is I will recompute imatrix for all DeepSeek V2/V3 based models once this is merged. For imatrix computation of such massive models I'm using an RPC setup with 512+256+128 GiB RAM with each node having a GPU. I'm using GGML_CUDA_ENABLE_UNIFIED_MEMORY on every RPC server to let GPU memory overflow into RAM as there is no way to GPU accelerate layers in RAM on RPC servers as far I'm aware. 512+256+128 GiB is enough to compute imatrix of all V3/R1 based models in Q8. |
is this better than loading up the model into just system memory and then using the GPU for KV calculations? sorry to sidetrack, just never heard of this option before today somehow |
@bartowski1182 For optimal imatrix computation performance on a single server you just offload as many layers to the GPU you can and keep the remaining ones in RAM. There you only use GGML_CUDA_ENABLE_UNIFIED_MEMORY in case we wrongly estimated the number of layers can offload so you don't OOM. You could offload all layers to GPU and let it overflow to RAM using GGML_CUDA_ENABLE_UNIFIED_MEMORY but then imatrix computation will take around twice as long. For a multi-server RPC imatrix computation setup using GGML_CUDA_ENABLE_UNIFIED_MEMORY on every RPC server seams mandatory as all layers that get offloaded to the RPC server will be stored in GPU memory if you run it with the CUDA backend. If you don't use the CUDA backend the layers will be stored in RAM but it will not be GPU acceleration making it take hundreds of hours. There doesn't seem to be an option for RPC servers to store layers in RAM while also using GPU acceleration making GGML_CUDA_ENABLE_UNIFIED_MEMORY the only option providing reasonable performance. Sorry for everyone who read very off topic message. |
So here is the cleaned up version that added a metadata tag called
This looks a lot cleaner and easier to maintain to me, but loses the @ngxson can you take a look at this branch and see what you think? If it's OK to go with losing the In case the compare window doesn't show up then it is this branch: https://github.com/jukofyork/llama.cpp/tree/mla--final-simplified |
I've cleaned this up even more now and have almost got the context shifting to work: It's not quite working though and I think it's possibly due to the // We have to pre-scale kq_scale and attn_factor to make the YaRN RoPE work correctly.
// See https://github.com/ggerganov/llama.cpp/discussions/7416 for detailed explanation.
const float mscale = attn_factor * (1.0f + hparams.rope_yarn_log_mul * logf(1.0f / freq_scale));
const float kq_scale = 1.0f*mscale*mscale/sqrtf(float(n_embd_head_k));
const float attn_factor_scaled = 1.0f / (1.0f + 0.1f * logf(1.0f / freq_scale)); ggml_tensor * llama_context::build_rope_shift(
ggml_context * ctx0,
ggml_tensor * cur,
ggml_tensor * shift,
ggml_tensor * factors,
float freq_base,
float freq_scale,
ggml_backend_buffer * bbuf) const {
const auto & n_ctx_orig = cparams.n_ctx_orig_yarn;
const auto & yarn_ext_factor = cparams.yarn_ext_factor;
const auto & yarn_attn_factor = cparams.yarn_attn_factor;
const auto & yarn_beta_fast = cparams.yarn_beta_fast;
const auto & yarn_beta_slow = cparams.yarn_beta_slow;
const auto & hparams = model.hparams;
const auto & n_rot = hparams.n_rot;
const auto & rope_type = hparams.rope_type;
ggml_tensor * tmp;
if (ggml_is_quantized(cur->type)) {
// dequantize to f32 -> RoPE -> quantize back
tmp = ggml_cast(ctx0, cur, GGML_TYPE_F32);
if (bbuf) {
for (const auto & backend : backends) {
// Figure out which backend KV cache belongs to
if (ggml_backend_supports_buft(backend.get(), ggml_backend_buffer_get_type(bbuf))) {
ggml_backend_sched_set_tensor_backend(sched.get(), tmp, backend.get());
break;
}
}
}
tmp = ggml_rope_ext_inplace(ctx0, tmp,
shift, factors, n_rot, rope_type, n_ctx_orig, freq_base, freq_scale,
yarn_ext_factor, yarn_attn_factor, yarn_beta_fast, yarn_beta_slow);
tmp = ggml_cpy(ctx0, tmp, cur);
} else {
// we rotate only the first n_rot dimensions
tmp = ggml_rope_ext_inplace(ctx0, cur,
shift, factors, n_rot, rope_type, n_ctx_orig, freq_base, freq_scale,
yarn_ext_factor, yarn_attn_factor, yarn_beta_fast, yarn_beta_slow);
}
return tmp;
} I've got to go out today so no more time to investigate. Changing this back to draft until I have this hopefully final version working. |
Context shifting working now for both non-MLA and MLA: https://github.com/jukofyork/llama.cpp/tree/mla--final-simplified--with-shift Big respect for @fairydreaming for doing all the hard work in #7416! It turned out to just need the RoPE-ed part first for I'll tidy this up tomorrow and it should be done then. |
I don't have strong opinion. From UX perspective it's better not to have I think it's ok if the code is ugly for a while then we can find a way to refactor it later, it's better to have something firstly usable (so users can test it) and then improve / clean it up in a follow-up PR. |
Closing as hopefully this #12801 PR will be my last. |
See the old PR for why
attn_k_b_trans
andattn_v_b
have to stay as separate tensors: #12725If you think you can find an acceptable way to avoid the
attn_k_b_trans
andattn_v_b
duplicated tensors, then feel free to take the baton as I did from @fairydreaming, but I'm bowing out now and assuming this tests OK (6 more hours of requanting - sigh), then this is definitely my last attempt as I have wasted far too much time on this already...I'll move it out of draft as soon as I have tested the "lite" versions in an hour or so.