llama/ggml: add LLM training support #10544

JohannesGaessler · 2024-11-27T11:32:50Z

See ggml-org/ggml#1025 except I decided to implement the training directly in llama.cpp after all because the GPT-2 GGML example is already pretty complex, would require a significant amount of effort to refactor, and I'm not familiar with the codebase at all.

The goal of this PR is to add general training support to llama.cpp using ggml_opt. CPU training seems to work, other backends are missing support for some GGML ops. It's currently not possible to actually save the finetuned model to disk but you can confirm that the finetuning works by doing one epoch over the input text prior to perplexity calculation (or by observing how the loss goes down with the new finetune example). One epoch over the test set of Wikitext-2 (with the stride chosen in such a way that each token is used twice per epoch) currently takes ~1 minute with Stories 260k or ~20 hours and ~100 GB RAM with LLaMA 3 8b. For the user-facing API my concrete plans are:

The parameter n_ctx determines the max. sequence length with which the model is trained.
The parameter n_batch determines how many tokens are consumed per optimizer step.
The parameter n_ubatch determines the number of tokens in parallel, enables speed <-> memory use tradeoff, should have no effect on the result except for differences in floating point rounding error.
A function with which the user can initialize a dataset from type std::vector<llama_token>. Currently I have this as part of llama.h but maybe this would make more sense to put in common.h?
A function llama_opt_init that prepares a llama_context for training and lets the user define things like the learning rate or which tensors should be trainable parameters.
A function llama_opt_epoch that performs one epoch over a ggml_opt_dataset, equivalent to ggml_opt_epoch.
Maybe a function like llama_opt_fit equivalent to ggml_opt_fit that is even more high-level?

Currently, while functional, the PR is in a bad state in terms of software design and is in need of a refactor. The reason I'm already opening it now is because I want to ask for advice regarding how to best implement llama_opt_epoch. My current approach was to try and hijack the first half of llama_decode_internal but I found that in the end all I needed from it was the generation of the next llama_ubatch and the corresponding manipulation of the KV cache. But maybe it would make more sense to instead write a function like llama_prepare_next_ubatch and to use that function in llama_decode_internal and llama_opt_epoch?

JohannesGaessler · 2024-12-01T23:38:13Z

I pushed a version that I think is in a state where it could be merged.

I refactored llama_decode_internal and split off functions llama_prepare_sbatch and llama_prepare_ubatch that can be called from llama_opt_epoch.
ggml training now has calls ggml_opt_alloc and ggml_opt_eval instead of ggml_opt_forward and ggml_opt_forward_backward. When not using static graphs a call to ggml_opt_prepare_alloc is also needed to provide a new forward graph.
I added a function llama_save_model_to_file for converting a llama_model to a GGUF file. For finetuning it would have been possible to copy a lot of the data from the input file but for training a model from scratch a method like this will be needed anyways. Currently tensors with non-CPU data cause a segfault when passed to the GGUF code, see GGUF: ggml backend support for writing tensor data ggml#1033 .
To control which tensors should be trainable parameters the user can pass a function that filters the tensors in a model.

My immediate next goals will be:

Fixing GGUF for non-CPU tensors.
CUDA support for the operations missing for training.
Support for FP16/BF16.

On a somewhat related note, it may make sense to refactor the file llama.cpp in such a way that moves code to other files; for some cases my IDE is starting to get a little sluggish when working on a 22k LOC file.

lexasub · 2024-12-29T18:10:52Z

@JohannesGaessler you may see #10902

JohannesGaessler · 2024-12-30T09:02:53Z

The link doesn't work.

lexasub · 2024-12-30T09:07:22Z

@JohannesGaessler sorry, #10902

JohannesGaessler · 2025-01-13T09:05:44Z

I've started working on this again, I rebased my local branch onto master and am currently adding the missing ops for CUDA training. This PR is getting quite large; in terms of reviewing, would you prefer if I split off things that can be reviewed and merged on their own?

ggerganov · 2025-01-13T10:20:18Z

If you can separate things in standalone PRs, it's always helpful (maybe the CUDA ops can be in a standalone PR).

JohannesGaessler · 2025-01-22T21:21:45Z

I pushed an update where the finetuning of Stories 260k and more relevantly LLaMA 3.2 1b works either on CPU or with CUDA and 24 GB VRAM. For LLaMA 3.2 1b one epoch over the Wikitext-2 test set takes ~3 minutes on an RTX 4090, ~15 hours on an Epyc 7742. The finetuned model should then have a lower perplexity score when given the text it was finetuned on again. For Stories 260k the speed is mostly the same due to its diminutive size.

I will soon have more time for llama.cpp, I will try to get this PR into a state where it can be merged. My goal is simply to have finetuning technically functional for CPU and CUDA with a single GPU and max. GPU layers. I will work on partial GPU layers and multi GPU in later PRs. My immediate next goal after having a technically functional finetuning setup will be to implement methods for actually evaluating the quality of a finetuned model using language model benchmarks such as MMLU.

nonetrix · 2025-01-26T21:54:45Z

I'm kinda confused was training removed and now being added back? I just want to train Qwen 7B on dataset of Japanese sentence grammar explanations

This seems outdated
https://rentry.org/cpu-lora

JohannesGaessler · 2025-01-26T22:09:14Z

There was at some point limited training support that was single threaded and only worked with the CPU backend. This was at some point removed because it was broken and unmaintained. I am currently working on adding back training support in a way that is compatible with all backends.

nonetrix · 2025-01-26T22:11:01Z

Neato I'll give it a go then and see if anything explodes, does it support all models llama cpp already supports?

JohannesGaessler · 2025-01-26T22:13:11Z

No, the support is currently extremely limited and I think you will just waste your time trying to use the current state of the code for anything other than testing.

more compact progress bar refactor: llama_prepare_sbatch/ubatch llama_save_model_to_file gqa_mode arg for repeat_back llama_opt_param_filter ggml_graph_dup force_grads refactor ggml_opt, fix test-opt

JohannesGaessler · 2025-01-27T18:47:03Z

From my end I would now consider this PR ready to be merged. Things are still relatively janky but I don't think that will change in a reasonable time frame. My next goals will be better support for model quality evaluation and then better performance for training. I can already work on these things regardless of what happens with this PR so it's fine if you just proceed in way that's convenient for you.

Question regarding the header files: right now I put the llama_opt API into llama.h. Should this be put into a separate header like llama-opt.h since most users will not need it?

ggerganov · 2025-01-29T13:09:56Z

Question regarding the header files: right now I put the llama_opt API into llama.h. Should this be put into a separate header like llama-opt.h since most users will not need it?

IMO it's fine as it is. We can split the header in the future if it becomes too heavy, but for now I think it is still quite manageable.

ttkciar · 2025-02-16T02:36:21Z

Is this waiting on #11769 or just on a review? Been waiting for it with bated breath.

ggerganov · 2025-02-17T08:29:02Z

We should first merge #11213 and then adapt and merge this PR. I will help with that.

killjaqular · 2025-04-21T17:55:32Z

See ggml-org/ggml#1025 except I decided to implement the training directly in llama.cpp after all because the GPT-2 GGML example is already pretty complex, would require a significant amount of effort to refactor, and I'm not familiar with the codebase at all.

The goal of this PR is to add general training support to llama.cpp using ggml_opt. CPU training seems to work, other backends are missing support for some GGML ops. It's currently not possible to actually save the finetuned model to disk but you can confirm that the finetuning works by doing one epoch over the input text prior to perplexity calculation (or by observing how the loss goes down with the new finetune example). One epoch over the test set of Wikitext-2 (with the stride chosen in such a way that each token is used twice per epoch) currently takes ~1 minute with Stories 260k or ~20 hours and ~100 GB RAM with LLaMA 3 8b. For the user-facing API my concrete plans are:

The parameter n_ctx determines the max. sequence length with which the model is trained.

The parameter n_batch determines how many tokens are consumed per optimizer step.

The parameter n_ubatch determines the number of tokens in parallel, enables speed <-> memory use tradeoff, should have no effect on the result except for differences in floating point rounding error.

A function with which the user can initialize a dataset from type std::vector<llama_token>. Currently I have this as part of llama.h but maybe this would make more sense to put in common.h?

A function llama_opt_init that prepares a llama_context for training and lets the user define things like the learning rate or which tensors should be trainable parameters.

A function llama_opt_epoch that performs one epoch over a ggml_opt_dataset, equivalent to ggml_opt_epoch.

Maybe a function like llama_opt_fit equivalent to ggml_opt_fit that is even more high-level?

Currently, while functional, the PR is in a bad state in terms of software design and is in need of a refactor. The reason I'm already opening it now is because I want to ask for advice regarding how to best implement llama_opt_epoch. My current approach was to try and hijack the first half of llama_decode_internal but I found that in the end all I needed from it was the generation of the next llama_ubatch and the corresponding manipulation of the KV cache. But maybe it would make more sense to instead write a function like llama_prepare_next_ubatch and to use that function in llama_decode_internal and llama_opt_epoch?

Hello team, I am currently trying to write my own application to finetune a GGUF model with your branch (https://github.com/JohannesGaessler/llama.cpp/tree/llama-opt-3) using Llama CPP and GGML.

And so, I would appreciate effort to divorce as much as possible from common.h/cpp and write the Llama CPP training and finetuning logic into its own llama-XXX.h/cpp files or into llama.h/cpp itself. This would allow maximum flexibility both for you in-house devs and us little people downstream from you guys.

Furthermore, while I have your attention on finetuning in Llama CPP, in the application I am trying to Frankenstein together, I am failing to understand how I am hitting an assert() in ggml-backend.cpp.
Specifically in ggml_backend_sched_split_graph():
assert(node_backend_id != -1); // all nodes should be assigned by now, this can happen if there is no CPU fallback

I am registering the CPU as a backend device, but it seems like a call to ggml_backend_sched_reset() is resetting all hv_tensor_backend_ids to -1?

ggml-backend.cpp:

void ggml_backend_sched_reset(ggml_backend_sched_t sched) {
    // reset state for the next run
    if (!sched->is_reset) {
        ggml_hash_set_reset(&sched->hash_set);
        memset(sched->hv_tensor_backend_ids, -1, sched->hash_set.size * sizeof(sched->hv_tensor_backend_ids[0]));
        memset(sched->hv_tensor_copies,       0, sched->hash_set.size * sched->n_backends * sched->n_copies * sizeof(struct ggml_tensor *));
        sched->is_reset = true;
    }
    sched->is_alloc = false;
}

If this is not the appropriate place to be discussing this (I kind of mixed two different topics into one reply), please refer me to the appropriate platform to do so.

A million thanks in advance!

JohannesGaessler · 2025-04-21T20:04:30Z

Sorry, but until there is an agreed-upon version on master or at least an imminent merge you're essentially on your own. As the comment suggests, I encountered this issue when a tensor needed for the backward pass did not have an implementation in the CPU backend.

And so, I would appreciate effort to divorce as much as possible from common.h/cpp and write the Llama CPP training and finetuning logic into its own llama-XXX.h/cpp files or into llama.h/cpp itself. This would allow maximum flexibility both for you in-house devs and us little people downstream from you guys.

Noted.

killjaqular · 2025-04-22T12:33:10Z

I appreciate the quick reply!

Sorry, but until there is an agreed-upon version on master or at least an imminent merge you're essentially on your own. As the comment suggests, I encountered this issue when a tensor needed for the backward pass did not have an implementation in the CPU backend.

That is fair. At least the highest in the ranks are aware. I appreciate the entire Llama CPP team for their efforts!
Is there an open issue/ticket for the missing tensor backend implementation?
I will patiently wait for the official merge.

Noted.

There is a massive community benefiting from all of your hard work, and I am not alone in expressing great gratitude.
THANK YOU! 👏

JohannesGaessler · 2025-04-22T13:01:41Z

Is there an open issue/ticket for the missing tensor backend implementation?

As far as I know there isn't one. If the problem is indeed a missing implementation for the CPU backend that can only happen if ggml_backend_cpu_device_supports_op in ggml/src/ggml-cpu/ggml-cpu.cpp returns false. So you can check what op the node that triggers the assert uses and compare that to the unsupported ops.

killjaqular · 2025-04-23T13:24:09Z

Is there an open issue/ticket for the missing tensor backend implementation?

As far as I know there isn't one. If the problem is indeed a missing implementation for the CPU backend that can only happen if ggml_backend_cpu_device_supports_op in ggml/src/ggml-cpu/ggml-cpu.cpp returns false. So you can check what op the node that triggers the assert uses and compare that to the unsupported ops.

Buildtime context:
Windows 11

cmake -B Build_local -DBUILD_SHARED_LIBS=OFF -DLLAMA_BUILD_EXAMPLES=off -DLLAMA_BUILD_TESTS=off -T v143,version=14.36

cmake -B Build_local_d -DBUILD_SHARED_LIBS=OFF -DLLAMA_BUILD_EXAMPLES=off -DLLAMA_BUILD_TESTS=off -T v143,version=14.36 -D CMAKE_POLICY_DEFAULT_CMP0091=NEW -D CMAKE_MSVC_RUNTIME_LIBRARY=MultiThreadedDebug -DGGML_OPENMP=OFF

cmake --build Build_local_d --config Debug --clean-first

cmake --install Build_local_d --config Debug --prefix ..\my_llama_finetuner\

This is where the assert is occurring:
File: ggml-backend.cpp
Function: ggml_backend_sched_split_graph()
Line: assert(node_backend_id != -1); // all nodes should be assigned by now, this can happen if there is no CPU fallback

As suggested, I am now looking in ggml_backend_cpu_device_supports_op() and have manually added some fprintf()s to print out the:

ggml_tensor type of src0
ggml_tensor type of src1
ggml_op of the ggml_tensor
I've also added an abort() call prior to ggml_backend_cpu_device_supports_op() returning its first false.
This condition occurs on the very first call to ggml_backend_cpu_device_supports_op().

This is what I have gathered from one trial run:

[i]:GGML Tensor Type src0: GGML_TYPE_F32
[i]:GGML Tensor Type src1: GGML_TYPE_I32
[i]:GGML Tensor Operation: GGML_OP_GET_ROWS

The following condition is satisfied and causing ggml_backend_cpu_device_supports_op() to return false:

[i]:C:\Users\some_user\dev\llama.cpp\ggml\src\ggml-cpu\ggml-cpu.cpp:ggml_backend_cpu_device_supports_op():if (op->src[i] && op->src[i]->buffer && !ggml_backend_buft_is_host(op->src[i]->buffer->buft))
[i]:RETURNING: False

Is this the expected behavior?

When my GGUF model is loaded, I am told it was all F32. It says so here:

register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (12th Gen Intel(R) Core(TM) i9-12900HK)
llama_model_loader: loaded meta data with 29 key-value pairs and 272 tensors from ..\models\some_model.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = some_model 135M
llama_model_loader: - kv   3:                           general.basename str              = some_model
llama_model_loader: - kv   4:                         general.size_label str              = 135M
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv   7:                          llama.block_count u32              = 30
llama_model_loader: - kv   8:                       llama.context_length u32              = 8192
llama_model_loader: - kv   9:                     llama.embedding_length u32              = 576
llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 1536
llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 9
llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 3
llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 100000.000000
llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  15:                          general.file_type u32              = 0
llama_model_loader: - kv  16:                           llama.vocab_size u32              = 49152
llama_model_loader: - kv  17:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv  18:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  19:                         tokenizer.ggml.pre str              = some_model
llama_model_loader: - kv  20:                      tokenizer.ggml.tokens arr[str,49152]   = ["<|endoftext|>", "<|im_start|>", "<|...
llama_model_loader: - kv  21:                  tokenizer.ggml.token_type arr[i32,49152]   = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  22:                      tokenizer.ggml.merges arr[str,48900]   = ["─á t", "─á a", "i n", "h e", "─á ─á...
llama_model_loader: - kv  23:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 0
llama_model_loader: - kv  25:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  26:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  272 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = all F32
print_info: file size   = 513.13 MiB (32.00 BPW)

Are there forums/threads/platforms better suited to ask and share these kinds of questions/problems?
This thread is very specific to your branch and incoming changes. If there is a better place for me to be having this discussion, please let me know.

Again, a million thanks. 🙏

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs examples ggml changes relating to the ggml tensor library for machine learning labels Nov 27, 2024

JohannesGaessler added the Review Complexity : High Generally require indepth knowledge of LLMs or GPUs label Nov 27, 2024

JohannesGaessler marked this pull request as ready for review December 1, 2024 23:15

JohannesGaessler force-pushed the llama-opt-3 branch from 866da9b to a315cac Compare January 22, 2025 21:14

This was referenced Jan 23, 2025

CPU/CUDA: fix GQA mul mat back, add CUDA support #11380

Merged

llama: refactor llama_decode_impl #11381

Merged

llama/ggml: add LLM training support

c255573

more compact progress bar refactor: llama_prepare_sbatch/ubatch llama_save_model_to_file gqa_mode arg for repeat_back llama_opt_param_filter ggml_graph_dup force_grads refactor ggml_opt, fix test-opt

JohannesGaessler force-pushed the llama-opt-3 branch from a315cac to c255573 Compare January 27, 2025 16:41

try CI fix

055e01b

remyoudompheng mentioned this pull request Feb 9, 2025

vulkan: implement several ops relevant for ggml_opt #11769

Merged

JohannesGaessler mentioned this pull request Apr 14, 2025

examples : remove finetune and train-text-from-scratch #8669

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama/ggml: add LLM training support #10544

llama/ggml: add LLM training support #10544

JohannesGaessler commented Nov 27, 2024 •

edited

Loading

JohannesGaessler commented Dec 1, 2024

lexasub commented Dec 29, 2024 •

edited

Loading

JohannesGaessler commented Dec 30, 2024

lexasub commented Dec 30, 2024 •

edited

Loading

JohannesGaessler commented Jan 13, 2025

ggerganov commented Jan 13, 2025

JohannesGaessler commented Jan 22, 2025

nonetrix commented Jan 26, 2025 •

edited

Loading

JohannesGaessler commented Jan 26, 2025

nonetrix commented Jan 26, 2025

JohannesGaessler commented Jan 26, 2025

JohannesGaessler commented Jan 27, 2025

ggerganov commented Jan 29, 2025

ttkciar commented Feb 16, 2025

ggerganov commented Feb 17, 2025

killjaqular commented Apr 21, 2025

JohannesGaessler commented Apr 21, 2025

killjaqular commented Apr 22, 2025

JohannesGaessler commented Apr 22, 2025 •

edited

Loading

killjaqular commented Apr 23, 2025 •

edited

Loading

llama/ggml: add LLM training support #10544

Are you sure you want to change the base?

llama/ggml: add LLM training support #10544

Conversation

JohannesGaessler commented Nov 27, 2024 • edited Loading

JohannesGaessler commented Dec 1, 2024

lexasub commented Dec 29, 2024 • edited Loading

JohannesGaessler commented Dec 30, 2024

lexasub commented Dec 30, 2024 • edited Loading

JohannesGaessler commented Jan 13, 2025

ggerganov commented Jan 13, 2025

JohannesGaessler commented Jan 22, 2025

nonetrix commented Jan 26, 2025 • edited Loading

JohannesGaessler commented Jan 26, 2025

nonetrix commented Jan 26, 2025

JohannesGaessler commented Jan 26, 2025

JohannesGaessler commented Jan 27, 2025

ggerganov commented Jan 29, 2025

ttkciar commented Feb 16, 2025

ggerganov commented Feb 17, 2025

killjaqular commented Apr 21, 2025

JohannesGaessler commented Apr 21, 2025

killjaqular commented Apr 22, 2025

JohannesGaessler commented Apr 22, 2025 • edited Loading

killjaqular commented Apr 23, 2025 • edited Loading

JohannesGaessler commented Nov 27, 2024 •

edited

Loading

lexasub commented Dec 29, 2024 •

edited

Loading

lexasub commented Dec 30, 2024 •

edited

Loading

nonetrix commented Jan 26, 2025 •

edited

Loading

JohannesGaessler commented Apr 22, 2025 •

edited

Loading

killjaqular commented Apr 23, 2025 •

edited

Loading