DeepSeek V2/V3 with `-mla` option (final) #12772

jukofyork · 2025-04-05T13:19:40Z

See the old PR for why attn_k_b_trans and attn_v_b have to stay as separate tensors: #12725

If you think you can find an acceptable way to avoid the attn_k_b_trans and attn_v_b duplicated tensors, then feel free to take the baton as I did from @fairydreaming, but I'm bowing out now and assuming this tests OK (6 more hours of requanting - sigh), then this is definitely my last attempt as I have wasted far too much time on this already...

I'll move it out of draft as soon as I have tested the "lite" versions in an hour or so.

jukofyork · 2025-04-05T13:32:19Z

        if name.endswith("kv_b_proj.weight"):
            name_kb = name.replace("kv_b_proj", "k_b_proj_trans")
            name_vb = name.replace("kv_b_proj", "v_b_proj")

            n_head_kv = self.hparams["num_key_value_heads"]
            v_head_dim = self.hparams["v_head_dim"]
            qk_nope_head_dim = self.hparams["qk_nope_head_dim"]

            assert data_torch.shape[0] == n_head_kv * (v_head_dim + qk_nope_head_dim)

            kv_b = data_torch.view(n_head_kv, v_head_dim + qk_nope_head_dim, data_torch.shape[-1])
            k_b, v_b = torch.split(kv_b, [qk_nope_head_dim, v_head_dim], dim=1)
            k_b_trans = k_b.transpose(1, 2)
            k_b_trans = k_b_trans.reshape(n_head_kv * data_torch.shape[-1], qk_nope_head_dim)
            v_b = v_b.reshape(n_head_kv * v_head_dim, data_torch.shape[-1])

            return [
                (self.map_tensor_name(name),    data_torch),
                (self.map_tensor_name(name_kb), k_b_trans),
                (self.map_tensor_name(name_vb), v_b)
            ]

One way we could save a little memory here is by saving k_b, k_b_trans and v_b, but this will make the logic even more complex inside of struct llm_build_deepseek2 : public llm_graph_context:

                    ggml_tensor * kv = ggml_mul_mat(ctx0, model.layers[il].wkv_b, kv_cmpr);
                    cb(kv, "kv", il);

                    // split into {n_embd_head_qk_nope, n_head, n_tokens}
                    ggml_tensor * k_nope = ggml_view_3d(ctx0, kv,
                            n_embd_head_qk_nope, n_head, n_tokens,
                            ggml_row_size(kv->type, n_embd_head_qk_nope + n_embd_head_v),
                            ggml_row_size(kv->type, n_embd_head_qk_nope + n_embd_head_v) * n_head,
                            0);
                    cb(k_nope, "k_nope", il);

                    // and {n_embd_head_v, n_head, n_tokens}
                    ggml_tensor * v_states = ggml_view_3d(ctx0, kv,
                            n_embd_head_v, n_head, n_tokens,
                            ggml_row_size(kv->type, n_embd_head_qk_nope + n_embd_head_v),
                            ggml_row_size(kv->type, n_embd_head_qk_nope + n_embd_head_v) * n_head,
                            ggml_row_size(kv->type, n_embd_head_qk_nope));
                    cb(v_states, "v_states", il);

                    v_states = ggml_cont(ctx0, v_states);
                    cb(v_states, "v_states", il);

and there would have to be a special case for old GGUF files here.

jukofyork · 2025-04-05T14:52:02Z

I've split so the new GGUF files to use: k_b, k_b_trans and v_b which saves a little extra RAM but still allows the option of using -mla or not (at the cost of uglier code in llm_build_deepseek2() though...).

I've tested on DeepSeek-V2-Lite-Chat all 4 cases of using legacy GGUF and new GGUF, and all seems to be working fine.

It will likely be tomorrow by the time I have tested r1 now though.

jukofyork · 2025-04-05T15:44:37Z

This change reduces the prompt processing quite badly for the "lite" model when not using -mla:

Mainline `llama.cpp`

| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| deepseek2 16B F16              |  29.26 GiB |    15.71 B | CUDA       |  99 |         pp512 |      2778.75 ± 20.30 |
| deepseek2 16B F16              |  29.26 GiB |    15.71 B | CUDA       |  99 |         tg128 |         60.68 ± 0.02 |
| deepseek2 16B F16              |  29.26 GiB |    15.71 B | CUDA       |  99 |   pp512+tg128 |        273.10 ± 0.89 |

This PR

| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| deepseek2 16B BF16             |  29.31 GiB |    15.73 B | CUDA       |  99 |         pp512 |      1857.67 ± 20.06 |
| deepseek2 16B BF16             |  29.31 GiB |    15.73 B | CUDA       |  99 |         tg128 |         60.51 ± 0.24 |
| deepseek2 16B BF16             |  29.31 GiB |    15.73 B | CUDA       |  99 |   pp512+tg128 |        266.23 ± 1.02 |

Being brutally honest:

I think we should just remove the -mla option, make everyone re-quant to use only k_b_trans and v_b, and remove about 2/3rds of this likely unmaintainable code in the PR.

I've tried my best to make it maintainable, but it's just ended up a huge mess now after adding all these special cases and in a couple of months; even I won't remember what is going on... 😦

The loss of performance between using the -mla option and not using -mla option likely lies somewhere else in the back-end code, and can be dealt with later. I honestly don't think there ever is going to be a good reason to not use -mla, and the only reason anybody ever used the MLA models converted to MHA; is because there was no option otherwise back then.

jukofyork · 2025-04-05T16:25:10Z

Here's what the changes look like without the -mla option and just using the same named tensors as @fairydreaming's original PR:

https://github.com/ggml-org/llama.cpp/compare/master...jukofyork:llama.cpp:mla--final-simplified?expand=1

I won't open another PR as already have several open/closed now 😁 but I think this looks much cleaner and easier to maintain long-term?

jukofyork · 2025-04-05T16:43:05Z

Mainline `llama.cpp`

| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| deepseek2 16B F16              |  29.26 GiB |    15.71 B | CUDA       |  99 |         pp512 |      2778.75 ± 20.30 |
| deepseek2 16B F16              |  29.26 GiB |    15.71 B | CUDA       |  99 |         tg128 |         60.68 ± 0.02 |
| deepseek2 16B F16              |  29.26 GiB |    15.71 B | CUDA       |  99 |   pp512+tg128 |        273.10 ± 0.89 |

The simplified MLA branch linked above

All `BF16`

| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| deepseek2 16B BF16             |  29.26 GiB |    15.71 B | CUDA       |  99 |         pp512 |      2367.43 ± 14.71 |
| deepseek2 16B BF16             |  29.26 GiB |    15.71 B | CUDA       |  99 |         tg128 |         71.14 ± 0.09 |
| deepseek2 16B BF16             |  29.26 GiB |    15.71 B | CUDA       |  99 |   pp512+tg128 |        310.34 ± 0.59 |

All `Q8_0`

| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| deepseek2 16B Q8_0             |  15.55 GiB |    15.71 B | CUDA       |  99 |         pp512 |      3001.38 ± 26.43 |
| deepseek2 16B Q8_0             |  15.55 GiB |    15.71 B | CUDA       |  99 |         tg128 |         74.54 ± 0.02 |
| deepseek2 16B Q8_0             |  15.55 GiB |    15.71 B | CUDA       |  99 |   pp512+tg128 |        329.66 ± 0.34 |

All `Q8_0`, but with `attn_k_b` and `attn_v_b` using `BF16`

| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| deepseek2 16B Q8_0 & BF16      |  15.60 GiB |    15.71 B | CUDA       |  99 |         pp512 |      3123.83 ± 13.79 |
| deepseek2 16B Q8_0 & BF16      |  15.60 GiB |    15.71 B | CUDA       |  99 |         tg128 |         91.92 ± 0.05 |
| deepseek2 16B Q8_0 & BF16      |  15.60 GiB |    15.71 B | CUDA       |  99 |   pp512+tg128 |        398.45 ± 0.46 |

So at least this shows that it's not that MLA is worse performance; it's simply that the CUDA back-end doesn't like the {xxx, yyy, 128} batched multiplications for quantised attn_k_b and attn_v_b. I'm not sure if the same is true for the other back-ends though.

nicoboss · 2025-04-05T19:51:50Z

I think we should just remove the -mla option, make everyone re-quant to use only k_b_trans and v_b, and remove about 2/3rds of this likely unmaintainable code in the PR.
I honestly don't think there ever is going to be a good reason to not use -mla, and the only reason anybody ever used the MLA models converted to MHA; is because there was no option otherwise back then.

I agree. I don't think the -mla flag is needed. It would make sense to enable MLA by default if it is available in the model. I fear that if we make it opt-in most casual users will not enable it because they are not aware that it is a feature they want. If anything, there should be a flag to disable it.

Regarding backwards compatibility it will obviously hurt having to requant and reupload all DeepSeek V2/V3 based models but in the end, MLA offers a better user experience and so is likely something that should be done regardless if llama.cpp mainline still supports non-MLA quants or not. We are talking of 34 quants totaling 11579.4 GB for every V3 based model mradermacher offers making this really painful but probably worth it. I agree that the simpler branch is much cleaner.

@jukofyork Do you know if the imatrix computed using the old Q8 quants needs to be recomputed as well? To compute the imatrix of an V3/R1 based model in Q8 using my RPC setup takes around 20 hours. It is for sure something I'm willing to do if there is a reason for it I just don't want to waste weeks recomputing the imatrix for no reason.

I would prefer if MLA gets merged soon. V3 was released 26th December 2024 and the longer we wait the more quants will be created without MLA making the effort to requantize all of them harder and harder over time. I decided to pause quantizing any DeepSeek V2/V3 based model for now until this is either merged or postponed again.

jukofyork · 2025-04-05T20:01:42Z

I think II have found a solution that keeps the backward compatibility, but removed the -mla option so you either use the old version of the GGUF and don't get to use MLA, or you use the new version of the GGUF and use MLA only.

This also means I don;'t have to make the ugly changes to llama-kv-cache.cpp. Will post a link when tested it.

@nicoboss II can probably create an imatriix file as Ii have a machine with 1.5TB of RAM, but want to be 100% clear on the final solution first.

nicoboss · 2025-04-05T20:18:53Z

I think II have found a solution that keeps the backward compatibility, but removed the -mla option so you either use the old version of the GGUF and don't get to use MLA, or you use the new version of the GGUF and use MLA only.
This also means I don;'t have to make the ugly changes to llama-kv-cache.cpp.

That is awesome. This is the perfect solution. It will make the transition to MLA supported models much easier. Thanks a lot for your great work!

@nicoboss II can probably create an imatriix file as Ii have a machine with 1.5TB of RAM, but want to be 100% clear on the final solution first.

No problem I can recompute them by my own. I just wanted to make sure doing so is required. Now that I know it is I will recompute imatrix for all DeepSeek V2/V3 based models once this is merged. For imatrix computation of such massive models I'm using an RPC setup with 512+256+128 GiB RAM with each node having a GPU. I'm using GGML_CUDA_ENABLE_UNIFIED_MEMORY on every RPC server to let GPU memory overflow into RAM as there is no way to GPU accelerate layers in RAM on RPC servers as far I'm aware. 512+256+128 GiB is enough to compute imatrix of all V3/R1 based models in Q8.

bartowski1182 · 2025-04-05T20:37:02Z

GGML_CUDA_ENABLE_UNIFIED_MEMORY

is this better than loading up the model into just system memory and then using the GPU for KV calculations? sorry to sidetrack, just never heard of this option before today somehow

nicoboss · 2025-04-05T21:14:57Z

is this better than loading up the model into just system memory and then using the GPU for KV calculations? sorry to sidetrack, just never heard of this option before today somehow

@bartowski1182 For optimal imatrix computation performance on a single server you just offload as many layers to the GPU you can and keep the remaining ones in RAM. There you only use GGML_CUDA_ENABLE_UNIFIED_MEMORY in case we wrongly estimated the number of layers can offload so you don't OOM. You could offload all layers to GPU and let it overflow to RAM using GGML_CUDA_ENABLE_UNIFIED_MEMORY but then imatrix computation will take around twice as long.

For a multi-server RPC imatrix computation setup using GGML_CUDA_ENABLE_UNIFIED_MEMORY on every RPC server seams mandatory as all layers that get offloaded to the RPC server will be stored in GPU memory if you run it with the CUDA backend. If you don't use the CUDA backend the layers will be stored in RAM but it will not be GPU acceleration making it take hundreds of hours. There doesn't seem to be an option for RPC servers to store layers in RAM while also using GPU acceleration making GGML_CUDA_ENABLE_UNIFIED_MEMORY the only option providing reasonable performance.

Sorry for everyone who read very off topic message.

jukofyork · 2025-04-05T22:35:43Z

So here is the cleaned up version that added a metadata tag called mla_optimized:

https://github.com/ggml-org/llama.cpp/compare/master...jukofyork:llama.cpp:mla--final-simplified?expand=1

If the tag is not in the file (or set false) then it will still load and run the legacy GGUF files with unsplit attn_kv_proj in.
If the tag is in the file and set true then it will expect to load the new GGUF file format with the split attn_k_proj and attn_v_proj in, and use the optimised MLA branch in the code.

This looks a lot cleaner and easier to maintain to me, but loses the -mla flag ability to run the split attn_k_proj and attn_v_proj without MLA (which I can't really see the point of now).

@ngxson can you take a look at this branch and see what you think? If it's OK to go with losing the -mla option then I'll make that the new (hopefully final) PR instead. I've got rid of the ugliness in llama-kv-cache.cpp now, but maybe build_attn_mla() and how it calls build_attn_mha() can be refactored to be less ugly?

In case the compare window doesn't show up then it is this branch: https://github.com/jukofyork/llama.cpp/tree/mla--final-simplified

jukofyork · 2025-04-06T07:19:38Z

I've cleaned this up even more now and have almost got the context shifting to work:

https://github.com/ggml-org/llama.cpp/compare/master...jukofyork:llama.cpp:mla--final-simplified--with-shift?expand=1

It's not quite working though and I think it's possibly due to the attn_factor_scaled or one of these:

        // We have to pre-scale kq_scale and attn_factor to make the YaRN RoPE work correctly.
        // See https://github.com/ggerganov/llama.cpp/discussions/7416 for detailed explanation.
        const float mscale = attn_factor * (1.0f + hparams.rope_yarn_log_mul * logf(1.0f / freq_scale));
        const float kq_scale = 1.0f*mscale*mscale/sqrtf(float(n_embd_head_k));
        const float attn_factor_scaled = 1.0f / (1.0f + 0.1f * logf(1.0f / freq_scale));

ggml_tensor * llama_context::build_rope_shift(
        ggml_context * ctx0,
        ggml_tensor * cur,
        ggml_tensor * shift,
        ggml_tensor * factors,
              float   freq_base,
              float   freq_scale,
        ggml_backend_buffer * bbuf) const {
    const auto & n_ctx_orig = cparams.n_ctx_orig_yarn;

    const auto & yarn_ext_factor  = cparams.yarn_ext_factor;
    const auto & yarn_attn_factor = cparams.yarn_attn_factor;
    const auto & yarn_beta_fast   = cparams.yarn_beta_fast;
    const auto & yarn_beta_slow   = cparams.yarn_beta_slow;

    const auto & hparams = model.hparams;

    const auto & n_rot     = hparams.n_rot;
    const auto & rope_type = hparams.rope_type;

    ggml_tensor * tmp;

    if (ggml_is_quantized(cur->type)) {
        // dequantize to f32 -> RoPE -> quantize back
        tmp = ggml_cast(ctx0, cur, GGML_TYPE_F32);

        if (bbuf) {
            for (const auto & backend : backends) {
                // Figure out which backend KV cache belongs to
                if (ggml_backend_supports_buft(backend.get(), ggml_backend_buffer_get_type(bbuf))) {
                    ggml_backend_sched_set_tensor_backend(sched.get(), tmp, backend.get());
                    break;
                }
            }
        }

        tmp = ggml_rope_ext_inplace(ctx0, tmp,
                shift, factors, n_rot, rope_type, n_ctx_orig, freq_base, freq_scale,
                yarn_ext_factor, yarn_attn_factor, yarn_beta_fast, yarn_beta_slow);

        tmp = ggml_cpy(ctx0, tmp, cur);
    } else {
        // we rotate only the first n_rot dimensions
        tmp = ggml_rope_ext_inplace(ctx0, cur,
                shift, factors, n_rot, rope_type, n_ctx_orig, freq_base, freq_scale,
                yarn_ext_factor, yarn_attn_factor, yarn_beta_fast, yarn_beta_slow);
    }

    return tmp;
}

I've got to go out today so no more time to investigate.

Changing this back to draft until I have this hopefully final version working.

jukofyork · 2025-04-06T16:13:55Z

Context shifting working now for both non-MLA and MLA:

https://github.com/jukofyork/llama.cpp/tree/mla--final-simplified--with-shift

https://github.com/ggml-org/llama.cpp/compare/master...jukofyork:llama.cpp:mla--final-simplified--with-shift?expand=1

Big respect for @fairydreaming for doing all the hard work in #7416!

It turned out to just need the RoPE-ed part first for ggml_rope_ext_inplace() to use and yarn_attn_factor_scaled setting in build_rope_shift().

I'll tidy this up tomorrow and it should be done then.

ngxson · 2025-04-07T08:21:42Z

@ngxson can you take a look at this branch and see what you think? If it's OK to go with losing the -mla option then I'll make that the new (hopefully final) PR instead. I've got rid of the ugliness in llama-kv-cache.cpp now, but maybe build_attn_mla() and how it calls build_attn_mha() can be refactored to be less ugly?

I don't have strong opinion. From UX perspective it's better not to have -mla because most users won't even notice that it's a thing. If we need to support both MLA and non-MLA, I think having MLA enabled by default and -no-mla flag to disable it makes more sense.

I think it's ok if the code is ugly for a while then we can find a way to refactor it later, it's better to have something firstly usable (so users can test it) and then improve / clean it up in a follow-up PR.

jukofyork · 2025-04-07T15:24:51Z

Closing as hopefully this #12801 PR will be my last.

jukofyork and others added 26 commits April 2, 2025 17:43

Initial commit with all but the MLA graph code done

b4c169f

Fixes

10207b4

Just make uint32_t n_embd_k and uint32_t n_embd_v

ea3c05b

First working version

1f604a7

Fixed return bug in DeepseekV2Model

1de077b

Minor fixes

7f92e7b

More fixes

319e3ef

Renamed wv_b to wv_decompress to avoid confusion with _b biases

ee4b389

Better _compressed variable names

c00cd9e

Minor comment and variable name fixes

55ad3a7

Moved build_attn_mla to better location

0c86f56

Removed gguf.MODEL_TENSOR.ATTN_K_B from prepare_tensors() for now

b0c8a43

Bumped wkv_b and wk_b to use F32.

8c329bc

Use ggml_mul_mat_set_prec GGML_PREC_F32 by default for now

68302ee

Better/shorter variable names and more tidying up of code

937a48d

Fixed kv_cmpr_pe name

1fd0aab

Added n_embd_head_k as constant

4fb439f

Fixed to use build_attn_mha() now

f9a0ef4

mla_attn on then not flash_attn so we can run -fa for draft models

5fe402a

"flash_attn is not compatible with mla_attn" --> flash_attn off

9b862f9

Fixed subtle bug caused by -mla for speculative models

8e23e0d

Removed need for v_b_proj storing. Tidied all ggml_row_size for quants

b384086

Removed both calls to ggml_mul_mat_set_prec for MLA and non-MLA cases

5dbf99c

Version that has wk_b_trans and wv_b forced to use F32

01a612a

Merge branch 'ggml-org:master' into mainline-llama-cpp-master--mla--f32

c0ffed5

Reverted to use wk_b_trans and wv_b. Documented build_attn_mla

8d12c42

github-actions bot added examples python python script changes server labels Apr 5, 2025

jukofyork mentioned this pull request Apr 5, 2025

DeepSeek V2/V3 with -mla option #12725

Closed

jukofyork changed the title ~~DeepSeek V2/V3 with -mla option (fiinal)~~ DeepSeek V2/V3 with -mla option (final) Apr 5, 2025

Split into k_b_proj, k_b_proj_trans and v_b_proj

997a4a8

jukofyork marked this pull request as ready for review April 5, 2025 14:52

jukofyork requested a review from ngxson as a code owner April 5, 2025 14:52

jukofyork marked this pull request as draft April 6, 2025 07:19

jukofyork closed this Apr 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DeepSeek V2/V3 with `-mla` option (final) #12772

DeepSeek V2/V3 with `-mla` option (final) #12772

Uh oh!

jukofyork commented Apr 5, 2025

Uh oh!

jukofyork commented Apr 5, 2025

Uh oh!

jukofyork commented Apr 5, 2025 •

edited

Loading

Uh oh!

jukofyork commented Apr 5, 2025 •

edited

Loading

Uh oh!

jukofyork commented Apr 5, 2025

Uh oh!

jukofyork commented Apr 5, 2025 •

edited

Loading

Uh oh!

nicoboss commented Apr 5, 2025

Uh oh!

jukofyork commented Apr 5, 2025

Uh oh!

nicoboss commented Apr 5, 2025 •

edited

Loading

Uh oh!

bartowski1182 commented Apr 5, 2025

Uh oh!

nicoboss commented Apr 5, 2025 •

edited

Loading

Uh oh!

jukofyork commented Apr 5, 2025 •

edited

Loading

Uh oh!

jukofyork commented Apr 6, 2025

Uh oh!

jukofyork commented Apr 6, 2025 •

edited

Loading

Uh oh!

ngxson commented Apr 7, 2025

Uh oh!

jukofyork commented Apr 7, 2025

Uh oh!

Uh oh!

DeepSeek V2/V3 with -mla option (final) #12772

DeepSeek V2/V3 with -mla option (final) #12772

Uh oh!

Conversation

jukofyork commented Apr 5, 2025

Uh oh!

jukofyork commented Apr 5, 2025

Uh oh!

jukofyork commented Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Mainline llama.cpp

This PR

Uh oh!

jukofyork commented Apr 5, 2025

Uh oh!

jukofyork commented Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Mainline llama.cpp

The simplified MLA branch linked above

All BF16

All Q8_0

All Q8_0, but with attn_k_b and attn_v_b using BF16

Uh oh!

nicoboss commented Apr 5, 2025

Uh oh!

jukofyork commented Apr 5, 2025

Uh oh!

nicoboss commented Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bartowski1182 commented Apr 5, 2025

Uh oh!

nicoboss commented Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Apr 6, 2025

Uh oh!

jukofyork commented Apr 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Apr 7, 2025

Uh oh!

jukofyork commented Apr 7, 2025

Uh oh!

Uh oh!

DeepSeek V2/V3 with `-mla` option (final) #12772

DeepSeek V2/V3 with `-mla` option (final) #12772

jukofyork commented Apr 5, 2025 •

edited

Loading

jukofyork commented Apr 5, 2025 •

edited

Loading

Mainline `llama.cpp`

jukofyork commented Apr 5, 2025 •

edited

Loading

Mainline `llama.cpp`

All `BF16`

All `Q8_0`

All `Q8_0`, but with `attn_k_b` and `attn_v_b` using `BF16`

nicoboss commented Apr 5, 2025 •

edited

Loading

nicoboss commented Apr 5, 2025 •

edited

Loading

jukofyork commented Apr 5, 2025 •

edited

Loading

jukofyork commented Apr 6, 2025 •

edited

Loading