Feature Request: Ability to pack multiple GGUFs into single one #13028

ngxson · 2025-04-19T19:09:58Z

Feature Description

From an idea brought up by @ggerganov in this discussion: #11139 (reply in thread)

While it is NOT a good idea to pack both mmproj + text models (because vision support is still messy atm), we still have some interesting use cases:

For TTS models, this can be useful because some models may requires more than 2 GGUFs to run (for ex. Sesame CSM requires backbone, decoder and Mimi models)
For phi-4-mm model, while the mmproj can't be packed, it is still interesting to pack the LoRA adapters and the text model together
There are some techniques which use LoRA to recover quality loss due to quantization, it can be useful to pack LoRA with the model (though, I don't know how effective this can be, cc @compilade )
Some models having more than 1 modality (i.e.Phi-4-mm with both audio+vision input), so could be useful to pack audio encoder and vision encoder into single GGUF

Motivation

I create this issue to discuss about possible implementation

Possible Implementation

An implementation could be to have "namespace" for KV metadata and tensor name, then have a "super" key for the list of namespaces

For example, with the case of Sesame CSM, given 2 GGUFs: backbone and decoder, the routine to pack these 2 GGUFs is as follow:

We create a blank GGUF
Add metadata general.namespaces = ["backbone", "decoder"]
Copy all metadata + tensors from backbone while adding backbone. prefix to the key name
Copy all metadata + tensors from decoder while adding decoder. prefix to the key name

These APIs will need to be added into libllama:

int32_t llama_model_n_namespaces(llama_model * model): returns the number of namespaces, 0 meaning no namespace
const char ** llama_model_list_namespaces(llama_model * model): returns the list of namespace as strings
llama_model * llama_model_get_namespace(int idx): returns the sub llama_model * object corresponding to a namespace index

Problems

For existing models (like TTS), how to we make a smooth transition to the new packed format? Or probably accept breaking changes since not many people are using it anyway?
How can we design the API such that it implies the least change to user code?

The text was updated successfully, but these errors were encountered:

ggerganov · 2025-04-20T06:03:39Z

An alternative approach to the one proposed is like this:

Add enum llama_model_type:

enum llama_model_type {
    LLAMA_MODEL_TYPE_DEFAULT,
    LLAMA_MODEL_TYPE_CUSTOM,
    LLAMA_MODEL_TYPE_ENCODER,
    LLAMA_MODEL_TYPE_DECODER,
    LLAMA_MODEL_TYPE_BACKBONE,
    ...
};

Extend struct llama_model_params:

enum llama_model_type type;
const char * type_str; // when `type == LLAMA_MODEL_TYPE_CUSTOM`

In the user code, we can load models like this:

llama_model_params model_params_enc = llama_model_default_params();
llama_model_params model_params_dec = llama_model_default_params();

model_params_enc.type = LLAMA_MODEL_TYPE_ENCODER;
model_params_dec.type = LLAMA_MODEL_TYPE_DECODER;

// load different models from the same file
llama_model * model_enc = llama_model_load_from_file(path, model_params_enc);
llama_model * model_dec = llama_model_load_from_file(path, model_params_dec);

Internally, the implementation will have respective namespace prefixes for each llama_model_type as you suggested. And we also support the option to provide a custom prefix.

There could be an API for querying the available namespaces in a GGUF file, but it seems optional for now and we can add it later on.

JohnLoveJoy · 2025-04-21T19:30:48Z

So that's why MS's bitnet-25 doesn't work yet?

https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf/tree/main

compilade · 2025-04-21T21:45:39Z

So that's why MS's bitnet-25 doesn't work yet?

https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf/tree/main

@JohnLoveJoy

No, that one doesn't work yet because they used a custom quant type (called i2_s, I think), and also a different architecture than the other BitNet models (notably using squared RELU instead of SILU).

The architecture of that model can be added relatively easily (using the changes from microsoft/BitNet@4f2e41a as suggested in https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf/discussions/2, and also adapting the conversion script to handle their packed format) and then using TQ1_0 and TQ2_0 should be possible, but the i2_s models would not be usable as-is (because that type was not upstreamed (and I did not look into what it brings compared to TQ2_0)).

It's not related to having multiple models in a single GGUF.

ngxson added the enhancement New feature or request label Apr 19, 2025

compilade mentioned this issue Apr 20, 2025

convert : experimental support for --mmproj flag #13023

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Ability to pack multiple GGUFs into single one #13028

Feature Request: Ability to pack multiple GGUFs into single one #13028

ngxson commented Apr 19, 2025 •

edited

Loading

ggerganov commented Apr 20, 2025

JohnLoveJoy commented Apr 21, 2025

compilade commented Apr 21, 2025 •

edited

Loading

Feature Request: Ability to pack multiple GGUFs into single one #13028

Feature Request: Ability to pack multiple GGUFs into single one #13028

Comments

ngxson commented Apr 19, 2025 • edited Loading

Feature Description

Motivation

Possible Implementation

Problems

ggerganov commented Apr 20, 2025

JohnLoveJoy commented Apr 21, 2025

compilade commented Apr 21, 2025 • edited Loading

ngxson commented Apr 19, 2025 •

edited

Loading

compilade commented Apr 21, 2025 •

edited

Loading