Skip to content

Commit b2e245c

Browse files
DN6sayakpaulstevhliu
committed
[Single File] Add GGUF support (#9964)
* update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * Update src/diffusers/quantizers/gguf/utils.py Co-authored-by: Sayak Paul <[email protected]> * update * update * update * update * update * update * update * update * update * update * Update docs/source/en/quantization/gguf.md Co-authored-by: Steven Liu <[email protected]> * update * update * update * update --------- Co-authored-by: Sayak Paul <[email protected]> Co-authored-by: Steven Liu <[email protected]>
1 parent 54c838e commit b2e245c

22 files changed

+1321
-21
lines changed

.github/workflows/nightly_tests.yml

+2
Original file line numberDiff line numberDiff line change
@@ -357,6 +357,8 @@ jobs:
357357
config:
358358
- backend: "bitsandbytes"
359359
test_location: "bnb"
360+
- backend: "gguf"
361+
test_location: "gguf"
360362
runs-on:
361363
group: aws-g6e-xlarge-plus
362364
container:

docs/source/en/_toctree.yml

+2
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,8 @@
157157
title: Getting Started
158158
- local: quantization/bitsandbytes
159159
title: bitsandbytes
160+
- local: quantization/gguf
161+
title: gguf
160162
- local: quantization/torchao
161163
title: torchao
162164
title: Quantization Methods

docs/source/en/api/quantization.md

+3
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,9 @@ Learn how to quantize models in the [Quantization](../quantization/overview) gui
2828

2929
[[autodoc]] BitsAndBytesConfig
3030

31+
## GGUFQuantizationConfig
32+
33+
[[autodoc]] GGUFQuantizationConfig
3134
## TorchAoConfig
3235

3336
[[autodoc]] TorchAoConfig

docs/source/en/quantization/gguf.md

+70
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
-->
13+
14+
# GGUF
15+
16+
The GGUF file format is typically used to store models for inference with [GGML](https://github.com/ggerganov/ggml) and supports a variety of block wise quantization options. Diffusers supports loading checkpoints prequantized and saved in the GGUF format via `from_single_file` loading with Model classes. Loading GGUF checkpoints via Pipelines is currently not supported.
17+
18+
The following example will load the [FLUX.1 DEV](https://huggingface.co/black-forest-labs/FLUX.1-dev) transformer model using the GGUF Q2_K quantization variant.
19+
20+
Before starting please install gguf in your environment
21+
22+
```shell
23+
pip install -U gguf
24+
```
25+
26+
Since GGUF is a single file format, use [`~FromSingleFileMixin.from_single_file`] to load the model and pass in the [`GGUFQuantizationConfig`].
27+
28+
When using GGUF checkpoints, the quantized weights remain in a low memory `dtype`(typically `torch.unint8`) and are dynamically dequantized and cast to the configured `compute_dtype` during each module's forward pass through the model. The `GGUFQuantizationConfig` allows you to set the `compute_dtype`.
29+
30+
The functions used for dynamic dequantizatation are based on the great work done by [city96](https://github.com/city96/ComfyUI-GGUF), who created the Pytorch ports of the original (`numpy`)[https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/quants.py] implementation by [compilade](https://github.com/compilade).
31+
32+
```python
33+
import torch
34+
35+
from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig
36+
37+
ckpt_path = (
38+
"https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q2_K.gguf"
39+
)
40+
transformer = FluxTransformer2DModel.from_single_file(
41+
ckpt_path,
42+
quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
43+
torch_dtype=torch.bfloat16,
44+
)
45+
pipe = FluxPipeline.from_pretrained(
46+
"black-forest-labs/FLUX.1-dev",
47+
transformer=transformer,
48+
generator=torch.manual_seed(0),
49+
torch_dtype=torch.bfloat16,
50+
)
51+
pipe.enable_model_cpu_offload()
52+
prompt = "A cat holding a sign that says hello world"
53+
image = pipe(prompt).images[0]
54+
image.save("flux-gguf.png")
55+
```
56+
57+
## Supported Quantization Types
58+
59+
- BF16
60+
- Q4_0
61+
- Q4_1
62+
- Q5_0
63+
- Q5_1
64+
- Q8_0
65+
- Q2_K
66+
- Q3_K
67+
- Q4_K
68+
- Q5_K
69+
- Q6_K
70+

docs/source/en/quantization/overview.md

+7-2
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ Quantization techniques focus on representing data with less information while a
1717

1818
<Tip>
1919

20-
Interested in adding a new quantization method to Transformers? Refer to the [Contribute new quantization method guide](https://huggingface.co/docs/transformers/main/en/quantization/contribute) to learn more about adding a new quantization method.
20+
Interested in adding a new quantization method to Diffusers? Refer to the [Contribute new quantization method guide](https://huggingface.co/docs/transformers/main/en/quantization/contribute) to learn more about adding a new quantization method.
2121

2222
</Tip>
2323

@@ -32,4 +32,9 @@ If you are new to the quantization field, we recommend you to check out these be
3232

3333
## When to use what?
3434

35-
Diffusers supports [bitsandbytes](https://huggingface.co/docs/bitsandbytes/main/en/index) and [torchao](https://github.com/pytorch/ao). Refer to this [table](https://huggingface.co/docs/transformers/main/en/quantization/overview#when-to-use-what) to help you determine which quantization backend to use.
35+
Diffusers currently supports the following quantization methods.
36+
- [BitsandBytes]()
37+
- [TorchAO]()
38+
- [GGUF]()
39+
40+
[This resource](https://huggingface.co/docs/transformers/main/en/quantization/overview#when-to-use-what) provides a good overview of the pros and cons of different quantization techniques.

src/diffusers/__init__.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@
3131
"loaders": ["FromOriginalModelMixin"],
3232
"models": [],
3333
"pipelines": [],
34-
"quantizers.quantization_config": ["BitsAndBytesConfig", "TorchAoConfig"],
34+
"quantizers.quantization_config": ["BitsAndBytesConfig", "GGUFQuantizationConfig", "TorchAoConfig"],
3535
"schedulers": [],
3636
"utils": [
3737
"OptionalDependencyNotAvailable",
@@ -569,7 +569,7 @@
569569

570570
if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
571571
from .configuration_utils import ConfigMixin
572-
from .quantizers.quantization_config import BitsAndBytesConfig, TorchAoConfig
572+
from .quantizers.quantization_config import BitsAndBytesConfig, GGUFQuantizationConfig, TorchAoConfig
573573

574574
try:
575575
if not is_onnx_available():

src/diffusers/loaders/single_file_model.py

+44-2
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,10 @@
1717
from contextlib import nullcontext
1818
from typing import Optional
1919

20+
import torch
2021
from huggingface_hub.utils import validate_hf_hub_args
2122

23+
from ..quantizers import DiffusersAutoQuantizer
2224
from ..utils import deprecate, is_accelerate_available, logging
2325
from .single_file_utils import (
2426
SingleFileComponentError,
@@ -214,6 +216,8 @@ def from_single_file(cls, pretrained_model_link_or_path_or_dict: Optional[str] =
214216
subfolder = kwargs.pop("subfolder", None)
215217
revision = kwargs.pop("revision", None)
216218
torch_dtype = kwargs.pop("torch_dtype", None)
219+
quantization_config = kwargs.pop("quantization_config", None)
220+
device = kwargs.pop("device", None)
217221

218222
if isinstance(pretrained_model_link_or_path_or_dict, dict):
219223
checkpoint = pretrained_model_link_or_path_or_dict
@@ -227,6 +231,12 @@ def from_single_file(cls, pretrained_model_link_or_path_or_dict: Optional[str] =
227231
local_files_only=local_files_only,
228232
revision=revision,
229233
)
234+
if quantization_config is not None:
235+
hf_quantizer = DiffusersAutoQuantizer.from_config(quantization_config)
236+
hf_quantizer.validate_environment()
237+
238+
else:
239+
hf_quantizer = None
230240

231241
mapping_functions = SINGLE_FILE_LOADABLE_CLASSES[mapping_class_name]
232242

@@ -309,8 +319,36 @@ def from_single_file(cls, pretrained_model_link_or_path_or_dict: Optional[str] =
309319
with ctx():
310320
model = cls.from_config(diffusers_model_config)
311321

322+
# Check if `_keep_in_fp32_modules` is not None
323+
use_keep_in_fp32_modules = (cls._keep_in_fp32_modules is not None) and (
324+
(torch_dtype == torch.float16) or hasattr(hf_quantizer, "use_keep_in_fp32_modules")
325+
)
326+
if use_keep_in_fp32_modules:
327+
keep_in_fp32_modules = cls._keep_in_fp32_modules
328+
if not isinstance(keep_in_fp32_modules, list):
329+
keep_in_fp32_modules = [keep_in_fp32_modules]
330+
331+
else:
332+
keep_in_fp32_modules = []
333+
334+
if hf_quantizer is not None:
335+
hf_quantizer.preprocess_model(
336+
model=model,
337+
device_map=None,
338+
state_dict=diffusers_format_checkpoint,
339+
keep_in_fp32_modules=keep_in_fp32_modules,
340+
)
341+
312342
if is_accelerate_available():
313-
unexpected_keys = load_model_dict_into_meta(model, diffusers_format_checkpoint, dtype=torch_dtype)
343+
param_device = torch.device(device) if device else torch.device("cpu")
344+
unexpected_keys = load_model_dict_into_meta(
345+
model,
346+
diffusers_format_checkpoint,
347+
dtype=torch_dtype,
348+
device=param_device,
349+
hf_quantizer=hf_quantizer,
350+
keep_in_fp32_modules=keep_in_fp32_modules,
351+
)
314352

315353
else:
316354
_, unexpected_keys = model.load_state_dict(diffusers_format_checkpoint, strict=False)
@@ -324,7 +362,11 @@ def from_single_file(cls, pretrained_model_link_or_path_or_dict: Optional[str] =
324362
f"Some weights of the model checkpoint were not used when initializing {cls.__name__}: \n {[', '.join(unexpected_keys)]}"
325363
)
326364

327-
if torch_dtype is not None:
365+
if hf_quantizer is not None:
366+
hf_quantizer.postprocess_model(model)
367+
model.hf_quantizer = hf_quantizer
368+
369+
if torch_dtype is not None and hf_quantizer is None:
328370
model.to(torch_dtype)
329371

330372
model.eval()

src/diffusers/loaders/single_file_utils.py

+19-6
Original file line numberDiff line numberDiff line change
@@ -81,8 +81,14 @@
8181
"open_clip_sd3": "text_encoders.clip_g.transformer.text_model.embeddings.position_embedding.weight",
8282
"stable_cascade_stage_b": "down_blocks.1.0.channelwise.0.weight",
8383
"stable_cascade_stage_c": "clip_txt_mapper.weight",
84-
"sd3": "model.diffusion_model.joint_blocks.0.context_block.adaLN_modulation.1.bias",
85-
"sd35_large": "model.diffusion_model.joint_blocks.37.x_block.mlp.fc1.weight",
84+
"sd3": [
85+
"joint_blocks.0.context_block.adaLN_modulation.1.bias",
86+
"model.diffusion_model.joint_blocks.0.context_block.adaLN_modulation.1.bias",
87+
],
88+
"sd35_large": [
89+
"joint_blocks.37.x_block.mlp.fc1.weight",
90+
"model.diffusion_model.joint_blocks.37.x_block.mlp.fc1.weight",
91+
],
8692
"animatediff": "down_blocks.0.motion_modules.0.temporal_transformer.transformer_blocks.0.attention_blocks.0.pos_encoder.pe",
8793
"animatediff_v2": "mid_block.motion_modules.0.temporal_transformer.norm.bias",
8894
"animatediff_sdxl_beta": "up_blocks.2.motion_modules.0.temporal_transformer.norm.weight",
@@ -542,13 +548,20 @@ def infer_diffusers_model_type(checkpoint):
542548
):
543549
model_type = "stable_cascade_stage_b"
544550

545-
elif CHECKPOINT_KEY_NAMES["sd3"] in checkpoint and checkpoint[CHECKPOINT_KEY_NAMES["sd3"]].shape[-1] == 9216:
546-
if checkpoint["model.diffusion_model.pos_embed"].shape[1] == 36864:
551+
elif any(key in checkpoint for key in CHECKPOINT_KEY_NAMES["sd3"]) and any(
552+
checkpoint[key].shape[-1] == 9216 if key in checkpoint else False for key in CHECKPOINT_KEY_NAMES["sd3"]
553+
):
554+
if "model.diffusion_model.pos_embed" in checkpoint:
555+
key = "model.diffusion_model.pos_embed"
556+
else:
557+
key = "pos_embed"
558+
559+
if checkpoint[key].shape[1] == 36864:
547560
model_type = "sd3"
548-
elif checkpoint["model.diffusion_model.pos_embed"].shape[1] == 147456:
561+
elif checkpoint[key].shape[1] == 147456:
549562
model_type = "sd35_medium"
550563

551-
elif CHECKPOINT_KEY_NAMES["sd35_large"] in checkpoint:
564+
elif any(key in checkpoint for key in CHECKPOINT_KEY_NAMES["sd35_large"]):
552565
model_type = "sd35_large"
553566

554567
elif CHECKPOINT_KEY_NAMES["animatediff"] in checkpoint:

src/diffusers/models/model_loading_utils.py

+83-1
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
import importlib
1818
import inspect
1919
import os
20+
from array import array
2021
from collections import OrderedDict
2122
from pathlib import Path
2223
from typing import List, Optional, Union
@@ -26,13 +27,16 @@
2627
from huggingface_hub.utils import EntryNotFoundError
2728

2829
from ..utils import (
30+
GGUF_FILE_EXTENSION,
2931
SAFE_WEIGHTS_INDEX_NAME,
3032
SAFETENSORS_FILE_EXTENSION,
3133
WEIGHTS_INDEX_NAME,
3234
_add_variant,
3335
_get_model_file,
3436
deprecate,
3537
is_accelerate_available,
38+
is_gguf_available,
39+
is_torch_available,
3640
is_torch_version,
3741
logging,
3842
)
@@ -139,6 +143,8 @@ def load_state_dict(checkpoint_file: Union[str, os.PathLike], variant: Optional[
139143
file_extension = os.path.basename(checkpoint_file).split(".")[-1]
140144
if file_extension == SAFETENSORS_FILE_EXTENSION:
141145
return safetensors.torch.load_file(checkpoint_file, device="cpu")
146+
elif file_extension == GGUF_FILE_EXTENSION:
147+
return load_gguf_checkpoint(checkpoint_file)
142148
else:
143149
weights_only_kwarg = {"weights_only": True} if is_torch_version(">=", "1.13") else {}
144150
return torch.load(
@@ -211,13 +217,14 @@ def load_model_dict_into_meta(
211217
set_module_kwargs["dtype"] = dtype
212218

213219
# bnb params are flattened.
220+
# gguf quants have a different shape based on the type of quantization applied
214221
if empty_state_dict[param_name].shape != param.shape:
215222
if (
216223
is_quantized
217224
and hf_quantizer.pre_quantized
218225
and hf_quantizer.check_if_quantized_param(model, param, param_name, state_dict, param_device=device)
219226
):
220-
hf_quantizer.check_quantized_param_shape(param_name, empty_state_dict[param_name].shape, param.shape)
227+
hf_quantizer.check_quantized_param_shape(param_name, empty_state_dict[param_name], param)
221228
else:
222229
model_name_or_path_str = f"{model_name_or_path} " if model_name_or_path is not None else ""
223230
raise ValueError(
@@ -396,3 +403,78 @@ def _fetch_index_file_legacy(
396403
index_file = None
397404

398405
return index_file
406+
407+
408+
def _gguf_parse_value(_value, data_type):
409+
if not isinstance(data_type, list):
410+
data_type = [data_type]
411+
if len(data_type) == 1:
412+
data_type = data_type[0]
413+
array_data_type = None
414+
else:
415+
if data_type[0] != 9:
416+
raise ValueError("Received multiple types, therefore expected the first type to indicate an array.")
417+
data_type, array_data_type = data_type
418+
419+
if data_type in [0, 1, 2, 3, 4, 5, 10, 11]:
420+
_value = int(_value[0])
421+
elif data_type in [6, 12]:
422+
_value = float(_value[0])
423+
elif data_type in [7]:
424+
_value = bool(_value[0])
425+
elif data_type in [8]:
426+
_value = array("B", list(_value)).tobytes().decode()
427+
elif data_type in [9]:
428+
_value = _gguf_parse_value(_value, array_data_type)
429+
return _value
430+
431+
432+
def load_gguf_checkpoint(gguf_checkpoint_path, return_tensors=False):
433+
"""
434+
Load a GGUF file and return a dictionary of parsed parameters containing tensors, the parsed tokenizer and config
435+
attributes.
436+
437+
Args:
438+
gguf_checkpoint_path (`str`):
439+
The path the to GGUF file to load
440+
return_tensors (`bool`, defaults to `True`):
441+
Whether to read the tensors from the file and return them. Not doing so is faster and only loads the
442+
metadata in memory.
443+
"""
444+
445+
if is_gguf_available() and is_torch_available():
446+
import gguf
447+
from gguf import GGUFReader
448+
449+
from ..quantizers.gguf.utils import SUPPORTED_GGUF_QUANT_TYPES, GGUFParameter
450+
else:
451+
logger.error(
452+
"Loading a GGUF checkpoint in PyTorch, requires both PyTorch and GGUF>=0.10.0 to be installed. Please see "
453+
"https://pytorch.org/ and https://github.com/ggerganov/llama.cpp/tree/master/gguf-py for installation instructions."
454+
)
455+
raise ImportError("Please install torch and gguf>=0.10.0 to load a GGUF checkpoint in PyTorch.")
456+
457+
reader = GGUFReader(gguf_checkpoint_path)
458+
459+
parsed_parameters = {}
460+
for tensor in reader.tensors:
461+
name = tensor.name
462+
quant_type = tensor.tensor_type
463+
464+
# if the tensor is a torch supported dtype do not use GGUFParameter
465+
is_gguf_quant = quant_type not in [gguf.GGMLQuantizationType.F32, gguf.GGMLQuantizationType.F16]
466+
if is_gguf_quant and quant_type not in SUPPORTED_GGUF_QUANT_TYPES:
467+
_supported_quants_str = "\n".join([str(type) for type in SUPPORTED_GGUF_QUANT_TYPES])
468+
raise ValueError(
469+
(
470+
f"{name} has a quantization type: {str(quant_type)} which is unsupported."
471+
"\n\nCurrently the following quantization types are supported: \n\n"
472+
f"{_supported_quants_str}"
473+
"\n\nTo request support for this quantization type please open an issue here: https://github.com/huggingface/diffusers"
474+
)
475+
)
476+
477+
weights = torch.from_numpy(tensor.data.copy())
478+
parsed_parameters[name] = GGUFParameter(weights, quant_type=quant_type) if is_gguf_quant else weights
479+
480+
return parsed_parameters

0 commit comments

Comments
 (0)