Skip to content

DeepSeek V2/V3 with -mla option (final) #12772

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jukofyork
Copy link
Collaborator

See the old PR for why attn_k_b_trans and attn_v_b have to stay as separate tensors: #12725

If you think you can find an acceptable way to avoid the attn_k_b_trans and attn_v_b duplicated tensors, then feel free to take the baton as I did from @fairydreaming, but I'm bowing out now and assuming this tests OK (6 more hours of requanting - sigh), then this is definitely my last attempt as I have wasted far too much time on this already...

I'll move it out of draft as soon as I have tested the "lite" versions in an hour or so.

@jukofyork jukofyork changed the title DeepSeek V2/V3 with -mla option (fiinal) DeepSeek V2/V3 with -mla option (final) Apr 5, 2025
@jukofyork
Copy link
Collaborator Author

        if name.endswith("kv_b_proj.weight"):
            name_kb = name.replace("kv_b_proj", "k_b_proj_trans")
            name_vb = name.replace("kv_b_proj", "v_b_proj")

            n_head_kv = self.hparams["num_key_value_heads"]
            v_head_dim = self.hparams["v_head_dim"]
            qk_nope_head_dim = self.hparams["qk_nope_head_dim"]

            assert data_torch.shape[0] == n_head_kv * (v_head_dim + qk_nope_head_dim)

            kv_b = data_torch.view(n_head_kv, v_head_dim + qk_nope_head_dim, data_torch.shape[-1])
            k_b, v_b = torch.split(kv_b, [qk_nope_head_dim, v_head_dim], dim=1)
            k_b_trans = k_b.transpose(1, 2)
            k_b_trans = k_b_trans.reshape(n_head_kv * data_torch.shape[-1], qk_nope_head_dim)
            v_b = v_b.reshape(n_head_kv * v_head_dim, data_torch.shape[-1])

            return [
                (self.map_tensor_name(name),    data_torch),
                (self.map_tensor_name(name_kb), k_b_trans),
                (self.map_tensor_name(name_vb), v_b)
            ]

One way we could save a little memory here is by saving k_b, k_b_trans and v_b, but this will make the logic even more complex inside of struct llm_build_deepseek2 : public llm_graph_context:

                    ggml_tensor * kv = ggml_mul_mat(ctx0, model.layers[il].wkv_b, kv_cmpr);
                    cb(kv, "kv", il);

                    // split into {n_embd_head_qk_nope, n_head, n_tokens}
                    ggml_tensor * k_nope = ggml_view_3d(ctx0, kv,
                            n_embd_head_qk_nope, n_head, n_tokens,
                            ggml_row_size(kv->type, n_embd_head_qk_nope + n_embd_head_v),
                            ggml_row_size(kv->type, n_embd_head_qk_nope + n_embd_head_v) * n_head,
                            0);
                    cb(k_nope, "k_nope", il);

                    // and {n_embd_head_v, n_head, n_tokens}
                    ggml_tensor * v_states = ggml_view_3d(ctx0, kv,
                            n_embd_head_v, n_head, n_tokens,
                            ggml_row_size(kv->type, n_embd_head_qk_nope + n_embd_head_v),
                            ggml_row_size(kv->type, n_embd_head_qk_nope + n_embd_head_v) * n_head,
                            ggml_row_size(kv->type, n_embd_head_qk_nope));
                    cb(v_states, "v_states", il);

                    v_states = ggml_cont(ctx0, v_states);
                    cb(v_states, "v_states", il);

and there would have to be a special case for old GGUF files here.

@jukofyork
Copy link
Collaborator Author

jukofyork commented Apr 5, 2025

I've split so the new GGUF files to use: k_b, k_b_trans and v_b which saves a little extra RAM but still allows the option of using -mla or not (at the cost of uglier code in llm_build_deepseek2() though...).

I've tested on DeepSeek-V2-Lite-Chat all 4 cases of using legacy GGUF and new GGUF, and all seems to be working fine.

It will likely be tomorrow by the time I have tested r1 now though.

@jukofyork jukofyork marked this pull request as ready for review April 5, 2025 14:52
@jukofyork jukofyork requested a review from ngxson as a code owner April 5, 2025 14:52
@jukofyork
Copy link
Collaborator Author

jukofyork commented Apr 5, 2025

This change reduces the prompt processing quite badly for the "lite" model when not using -mla:

Mainline llama.cpp

| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| deepseek2 16B F16              |  29.26 GiB |    15.71 B | CUDA       |  99 |         pp512 |      2778.75 ± 20.30 |
| deepseek2 16B F16              |  29.26 GiB |    15.71 B | CUDA       |  99 |         tg128 |         60.68 ± 0.02 |
| deepseek2 16B F16              |  29.26 GiB |    15.71 B | CUDA       |  99 |   pp512+tg128 |        273.10 ± 0.89 |

This PR

| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| deepseek2 16B BF16             |  29.31 GiB |    15.73 B | CUDA       |  99 |         pp512 |      1857.67 ± 20.06 |
| deepseek2 16B BF16             |  29.31 GiB |    15.73 B | CUDA       |  99 |         tg128 |         60.51 ± 0.24 |
| deepseek2 16B BF16             |  29.31 GiB |    15.73 B | CUDA       |  99 |   pp512+tg128 |        266.23 ± 1.02 |

Being brutally honest:

I think we should just remove the -mla option, make everyone re-quant to use only k_b_trans and v_b, and remove about 2/3rds of this likely unmaintainable code in the PR.

I've tried my best to make it maintainable, but it's just ended up a huge mess now after adding all these special cases and in a couple of months; even I won't remember what is going on... 😦

The loss of performance between using the -mla option and not using -mla option likely lies somewhere else in the back-end code, and can be dealt with later. I honestly don't think there ever is going to be a good reason to not use -mla, and the only reason anybody ever used the MLA models converted to MHA; is because there was no option otherwise back then.

@jukofyork
Copy link
Collaborator Author

Here's what the changes look like without the -mla option and just using the same named tensors as @fairydreaming's original PR:

https://github.com/ggml-org/llama.cpp/compare/master...jukofyork:llama.cpp:mla--final-simplified?expand=1

I won't open another PR as already have several open/closed now 😁 but I think this looks much cleaner and easier to maintain long-term?

@jukofyork
Copy link
Collaborator Author

jukofyork commented Apr 5, 2025

Mainline llama.cpp

| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| deepseek2 16B F16              |  29.26 GiB |    15.71 B | CUDA       |  99 |         pp512 |      2778.75 ± 20.30 |
| deepseek2 16B F16              |  29.26 GiB |    15.71 B | CUDA       |  99 |         tg128 |         60.68 ± 0.02 |
| deepseek2 16B F16              |  29.26 GiB |    15.71 B | CUDA       |  99 |   pp512+tg128 |        273.10 ± 0.89 |

The simplified MLA branch linked above

All BF16

| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| deepseek2 16B BF16             |  29.26 GiB |    15.71 B | CUDA       |  99 |         pp512 |      2367.43 ± 14.71 |
| deepseek2 16B BF16             |  29.26 GiB |    15.71 B | CUDA       |  99 |         tg128 |         71.14 ± 0.09 |
| deepseek2 16B BF16             |  29.26 GiB |    15.71 B | CUDA       |  99 |   pp512+tg128 |        310.34 ± 0.59 |

All Q8_0

| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| deepseek2 16B Q8_0             |  15.55 GiB |    15.71 B | CUDA       |  99 |         pp512 |      3001.38 ± 26.43 |
| deepseek2 16B Q8_0             |  15.55 GiB |    15.71 B | CUDA       |  99 |         tg128 |         74.54 ± 0.02 |
| deepseek2 16B Q8_0             |  15.55 GiB |    15.71 B | CUDA       |  99 |   pp512+tg128 |        329.66 ± 0.34 |

All Q8_0, but with attn_k_b and attn_v_b using BF16

| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| deepseek2 16B Q8_0 & BF16      |  15.60 GiB |    15.71 B | CUDA       |  99 |         pp512 |      3123.83 ± 13.79 |
| deepseek2 16B Q8_0 & BF16      |  15.60 GiB |    15.71 B | CUDA       |  99 |         tg128 |         91.92 ± 0.05 |
| deepseek2 16B Q8_0 & BF16      |  15.60 GiB |    15.71 B | CUDA       |  99 |   pp512+tg128 |        398.45 ± 0.46 |

So at least this shows that it's not that MLA is worse performance; it's simply that the CUDA back-end doesn't like the {xxx, yyy, 128} batched multiplications for quantised attn_k_b and attn_v_b. I'm not sure if the same is true for the other back-ends though.

@nicoboss
Copy link
Contributor

nicoboss commented Apr 5, 2025

I think we should just remove the -mla option, make everyone re-quant to use only k_b_trans and v_b, and remove about 2/3rds of this likely unmaintainable code in the PR.
I honestly don't think there ever is going to be a good reason to not use -mla, and the only reason anybody ever used the MLA models converted to MHA; is because there was no option otherwise back then.

I agree. I don't think the -mla flag is needed. It would make sense to enable MLA by default if it is available in the model. I fear that if we make it opt-in most casual users will not enable it because they are not aware that it is a feature they want. If anything, there should be a flag to disable it.

Regarding backwards compatibility it will obviously hurt having to requant and reupload all DeepSeek V2/V3 based models but in the end, MLA offers a better user experience and so is likely something that should be done regardless if llama.cpp mainline still supports non-MLA quants or not. We are talking of 34 quants totaling 11579.4 GB for every V3 based model mradermacher offers making this really painful but probably worth it. I agree that the simpler branch is much cleaner.

@jukofyork Do you know if the imatrix computed using the old Q8 quants needs to be recomputed as well? To compute the imatrix of an V3/R1 based model in Q8 using my RPC setup takes around 20 hours. It is for sure something I'm willing to do if there is a reason for it I just don't want to waste weeks recomputing the imatrix for no reason.

I would prefer if MLA gets merged soon. V3 was released 26th December 2024 and the longer we wait the more quants will be created without MLA making the effort to requantize all of them harder and harder over time. I decided to pause quantizing any DeepSeek V2/V3 based model for now until this is either merged or postponed again.

@jukofyork
Copy link
Collaborator Author

I think II have found a solution that keeps the backward compatibility, but removed the -mla option so you either use the old version of the GGUF and don't get to use MLA, or you use the new version of the GGUF and use MLA only.

This also means I don;'t have to make the ugly changes to llama-kv-cache.cpp. Will post a link when tested it.

@nicoboss II can probably create an imatriix file as Ii have a machine with 1.5TB of RAM, but want to be 100% clear on the final solution first.

@nicoboss
Copy link
Contributor

nicoboss commented Apr 5, 2025

I think II have found a solution that keeps the backward compatibility, but removed the -mla option so you either use the old version of the GGUF and don't get to use MLA, or you use the new version of the GGUF and use MLA only.
This also means I don;'t have to make the ugly changes to llama-kv-cache.cpp.

That is awesome. This is the perfect solution. It will make the transition to MLA supported models much easier. Thanks a lot for your great work!

@nicoboss II can probably create an imatriix file as Ii have a machine with 1.5TB of RAM, but want to be 100% clear on the final solution first.

No problem I can recompute them by my own. I just wanted to make sure doing so is required. Now that I know it is I will recompute imatrix for all DeepSeek V2/V3 based models once this is merged. For imatrix computation of such massive models I'm using an RPC setup with 512+256+128 GiB RAM with each node having a GPU. I'm using GGML_CUDA_ENABLE_UNIFIED_MEMORY on every RPC server to let GPU memory overflow into RAM as there is no way to GPU accelerate layers in RAM on RPC servers as far I'm aware. 512+256+128 GiB is enough to compute imatrix of all V3/R1 based models in Q8.

@bartowski1182
Copy link
Contributor

GGML_CUDA_ENABLE_UNIFIED_MEMORY

is this better than loading up the model into just system memory and then using the GPU for KV calculations? sorry to sidetrack, just never heard of this option before today somehow

@nicoboss
Copy link
Contributor

nicoboss commented Apr 5, 2025

is this better than loading up the model into just system memory and then using the GPU for KV calculations? sorry to sidetrack, just never heard of this option before today somehow

@bartowski1182 For optimal imatrix computation performance on a single server you just offload as many layers to the GPU you can and keep the remaining ones in RAM. There you only use GGML_CUDA_ENABLE_UNIFIED_MEMORY in case we wrongly estimated the number of layers can offload so you don't OOM. You could offload all layers to GPU and let it overflow to RAM using GGML_CUDA_ENABLE_UNIFIED_MEMORY but then imatrix computation will take around twice as long.

For a multi-server RPC imatrix computation setup using GGML_CUDA_ENABLE_UNIFIED_MEMORY on every RPC server seams mandatory as all layers that get offloaded to the RPC server will be stored in GPU memory if you run it with the CUDA backend. If you don't use the CUDA backend the layers will be stored in RAM but it will not be GPU acceleration making it take hundreds of hours. There doesn't seem to be an option for RPC servers to store layers in RAM while also using GPU acceleration making GGML_CUDA_ENABLE_UNIFIED_MEMORY the only option providing reasonable performance.

Sorry for everyone who read very off topic message.

@jukofyork
Copy link
Collaborator Author

jukofyork commented Apr 5, 2025

So here is the cleaned up version that added a metadata tag called mla_optimized:

https://github.com/ggml-org/llama.cpp/compare/master...jukofyork:llama.cpp:mla--final-simplified?expand=1

  • If the tag is not in the file (or set false) then it will still load and run the legacy GGUF files with unsplit attn_kv_proj in.
  • If the tag is in the file and set true then it will expect to load the new GGUF file format with the split attn_k_proj and attn_v_proj in, and use the optimised MLA branch in the code.

This looks a lot cleaner and easier to maintain to me, but loses the -mla flag ability to run the split attn_k_proj and attn_v_proj without MLA (which I can't really see the point of now).

@ngxson can you take a look at this branch and see what you think? If it's OK to go with losing the -mla option then I'll make that the new (hopefully final) PR instead. I've got rid of the ugliness in llama-kv-cache.cpp now, but maybe build_attn_mla() and how it calls build_attn_mha() can be refactored to be less ugly?

In case the compare window doesn't show up then it is this branch: https://github.com/jukofyork/llama.cpp/tree/mla--final-simplified

@jukofyork
Copy link
Collaborator Author

I've cleaned this up even more now and have almost got the context shifting to work:

https://github.com/ggml-org/llama.cpp/compare/master...jukofyork:llama.cpp:mla--final-simplified--with-shift?expand=1

It's not quite working though and I think it's possibly due to the attn_factor_scaled or one of these:

        // We have to pre-scale kq_scale and attn_factor to make the YaRN RoPE work correctly.
        // See https://github.com/ggerganov/llama.cpp/discussions/7416 for detailed explanation.
        const float mscale = attn_factor * (1.0f + hparams.rope_yarn_log_mul * logf(1.0f / freq_scale));
        const float kq_scale = 1.0f*mscale*mscale/sqrtf(float(n_embd_head_k));
        const float attn_factor_scaled = 1.0f / (1.0f + 0.1f * logf(1.0f / freq_scale));
ggml_tensor * llama_context::build_rope_shift(
        ggml_context * ctx0,
        ggml_tensor * cur,
        ggml_tensor * shift,
        ggml_tensor * factors,
              float   freq_base,
              float   freq_scale,
        ggml_backend_buffer * bbuf) const {
    const auto & n_ctx_orig = cparams.n_ctx_orig_yarn;

    const auto & yarn_ext_factor  = cparams.yarn_ext_factor;
    const auto & yarn_attn_factor = cparams.yarn_attn_factor;
    const auto & yarn_beta_fast   = cparams.yarn_beta_fast;
    const auto & yarn_beta_slow   = cparams.yarn_beta_slow;

    const auto & hparams = model.hparams;

    const auto & n_rot     = hparams.n_rot;
    const auto & rope_type = hparams.rope_type;

    ggml_tensor * tmp;

    if (ggml_is_quantized(cur->type)) {
        // dequantize to f32 -> RoPE -> quantize back
        tmp = ggml_cast(ctx0, cur, GGML_TYPE_F32);

        if (bbuf) {
            for (const auto & backend : backends) {
                // Figure out which backend KV cache belongs to
                if (ggml_backend_supports_buft(backend.get(), ggml_backend_buffer_get_type(bbuf))) {
                    ggml_backend_sched_set_tensor_backend(sched.get(), tmp, backend.get());
                    break;
                }
            }
        }

        tmp = ggml_rope_ext_inplace(ctx0, tmp,
                shift, factors, n_rot, rope_type, n_ctx_orig, freq_base, freq_scale,
                yarn_ext_factor, yarn_attn_factor, yarn_beta_fast, yarn_beta_slow);

        tmp = ggml_cpy(ctx0, tmp, cur);
    } else {
        // we rotate only the first n_rot dimensions
        tmp = ggml_rope_ext_inplace(ctx0, cur,
                shift, factors, n_rot, rope_type, n_ctx_orig, freq_base, freq_scale,
                yarn_ext_factor, yarn_attn_factor, yarn_beta_fast, yarn_beta_slow);
    }

    return tmp;
}

I've got to go out today so no more time to investigate.

Changing this back to draft until I have this hopefully final version working.

@jukofyork jukofyork marked this pull request as draft April 6, 2025 07:19
@jukofyork
Copy link
Collaborator Author

jukofyork commented Apr 6, 2025

Context shifting working now for both non-MLA and MLA:

https://github.com/jukofyork/llama.cpp/tree/mla--final-simplified--with-shift

https://github.com/ggml-org/llama.cpp/compare/master...jukofyork:llama.cpp:mla--final-simplified--with-shift?expand=1

Big respect for @fairydreaming for doing all the hard work in #7416!

It turned out to just need the RoPE-ed part first for ggml_rope_ext_inplace() to use and yarn_attn_factor_scaled setting in build_rope_shift().

I'll tidy this up tomorrow and it should be done then.

@ngxson
Copy link
Collaborator

ngxson commented Apr 7, 2025

@ngxson can you take a look at this branch and see what you think? If it's OK to go with losing the -mla option then I'll make that the new (hopefully final) PR instead. I've got rid of the ugliness in llama-kv-cache.cpp now, but maybe build_attn_mla() and how it calls build_attn_mha() can be refactored to be less ugly?

I don't have strong opinion. From UX perspective it's better not to have -mla because most users won't even notice that it's a thing. If we need to support both MLA and non-MLA, I think having MLA enabled by default and -no-mla flag to disable it makes more sense.

I think it's ok if the code is ugly for a while then we can find a way to refactor it later, it's better to have something firstly usable (so users can test it) and then improve / clean it up in a follow-up PR.

@jukofyork
Copy link
Collaborator Author

Closing as hopefully this #12801 PR will be my last.

@jukofyork jukofyork closed this Apr 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants