Skip to content

Feature Request: Ability to pack multiple GGUFs into single one #13028

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ngxson opened this issue Apr 19, 2025 · 3 comments
Open

Feature Request: Ability to pack multiple GGUFs into single one #13028

ngxson opened this issue Apr 19, 2025 · 3 comments
Labels
enhancement New feature or request

Comments

@ngxson
Copy link
Collaborator

ngxson commented Apr 19, 2025

Feature Description

From an idea brought up by @ggerganov in this discussion: #11139 (reply in thread)

While it is NOT a good idea to pack both mmproj + text models (because vision support is still messy atm), we still have some interesting use cases:

  • For TTS models, this can be useful because some models may requires more than 2 GGUFs to run (for ex. Sesame CSM requires backbone, decoder and Mimi models)
  • For phi-4-mm model, while the mmproj can't be packed, it is still interesting to pack the LoRA adapters and the text model together
  • There are some techniques which use LoRA to recover quality loss due to quantization, it can be useful to pack LoRA with the model (though, I don't know how effective this can be, cc @compilade )
  • Some models having more than 1 modality (i.e.Phi-4-mm with both audio+vision input), so could be useful to pack audio encoder and vision encoder into single GGUF

Motivation

I create this issue to discuss about possible implementation

Possible Implementation

An implementation could be to have "namespace" for KV metadata and tensor name, then have a "super" key for the list of namespaces

For example, with the case of Sesame CSM, given 2 GGUFs: backbone and decoder, the routine to pack these 2 GGUFs is as follow:

  • We create a blank GGUF
  • Add metadata general.namespaces = ["backbone", "decoder"]
  • Copy all metadata + tensors from backbone while adding backbone. prefix to the key name
  • Copy all metadata + tensors from decoder while adding decoder. prefix to the key name

These APIs will need to be added into libllama:

  • int32_t llama_model_n_namespaces(llama_model * model): returns the number of namespaces, 0 meaning no namespace
  • const char ** llama_model_list_namespaces(llama_model * model): returns the list of namespace as strings
  • llama_model * llama_model_get_namespace(int idx): returns the sub llama_model * object corresponding to a namespace index

Problems

  1. For existing models (like TTS), how to we make a smooth transition to the new packed format? Or probably accept breaking changes since not many people are using it anyway?
  2. How can we design the API such that it implies the least change to user code?
@ngxson ngxson added the enhancement New feature or request label Apr 19, 2025
@ggerganov
Copy link
Member

An alternative approach to the one proposed is like this:

  • Add enum llama_model_type:

    enum llama_model_type {
        LLAMA_MODEL_TYPE_DEFAULT,
        LLAMA_MODEL_TYPE_CUSTOM,
        LLAMA_MODEL_TYPE_ENCODER,
        LLAMA_MODEL_TYPE_DECODER,
        LLAMA_MODEL_TYPE_BACKBONE,
        ...
    };
  • Extend struct llama_model_params:

    enum llama_model_type type;
    const char * type_str; // when `type == LLAMA_MODEL_TYPE_CUSTOM`
  • In the user code, we can load models like this:

    llama_model_params model_params_enc = llama_model_default_params();
    llama_model_params model_params_dec = llama_model_default_params();
    
    model_params_enc.type = LLAMA_MODEL_TYPE_ENCODER;
    model_params_dec.type = LLAMA_MODEL_TYPE_DECODER;
    
    // load different models from the same file
    llama_model * model_enc = llama_model_load_from_file(path, model_params_enc);
    llama_model * model_dec = llama_model_load_from_file(path, model_params_dec);

Internally, the implementation will have respective namespace prefixes for each llama_model_type as you suggested. And we also support the option to provide a custom prefix.

There could be an API for querying the available namespaces in a GGUF file, but it seems optional for now and we can add it later on.

@JohnLoveJoy
Copy link

So that's why MS's bitnet-25 doesn't work yet?

https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf/tree/main

@compilade
Copy link
Collaborator

compilade commented Apr 21, 2025

So that's why MS's bitnet-25 doesn't work yet?

https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf/tree/main

@JohnLoveJoy

No, that one doesn't work yet because they used a custom quant type (called i2_s, I think), and also a different architecture than the other BitNet models (notably using squared RELU instead of SILU).

The architecture of that model can be added relatively easily (using the changes from microsoft/BitNet@4f2e41a as suggested in https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf/discussions/2, and also adapting the conversion script to handle their packed format) and then using TQ1_0 and TQ2_0 should be possible, but the i2_s models would not be usable as-is (because that type was not upstreamed (and I did not look into what it brings compared to TQ2_0)).

It's not related to having multiple models in a single GGUF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants