[Feature]: Make `auto` load format handle bitsandbytes models #11867

alugowski · 2025-01-08T23:15:39Z

🚀 The feature, motivation and pitch

Common bitsandbytes models like unsloth/meta-llama-3.1-8b-bnb-4bit require the user to pass --load-format bitsandbytes --quantization bitsandbytes command-line arguments.

I could be wrong, but I believe both of these could be auto-detected by vLLM. The default load format auto could select bitsandbytes if a bitsandbytes model is selected.

AFAIK this detection should work:

config.get("quantization_config", {}).get("quant_method") == "bitsandbytes"

Similarly the --quantization bitsandbytes argument seems redundant since the quantization is specified in the model config, but if the user omits it then this happens:

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1034, in create_engine_config
    raise ValueError(
ValueError: BitsAndBytes load format and QLoRA adapter only support 'bitsandbytes' quantization, but got None

Alternatives

No response

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

noooop · 2025-01-09T05:19:25Z

https://github.com/vllm-project/vllm/blob/fd3a62a122fcbc9331d000b325e72687629ef1bd/vllm/config.py#L559C1-L576C26

        if self.quantization is not None:
            self.quantization = self.quantization.lower()

        # Parse quantization method from the HF model config, if available.
        quant_cfg = self._parse_quant_hf_config()

        if quant_cfg is not None:
            quant_method = quant_cfg.get("quant_method", "").lower()

            # Detect which checkpoint is it
            for name in QUANTIZATION_METHODS:
                method = get_quantization_config(name)
                quantization_override = method.override_quantization_method(
                    quant_cfg, self.quantization)
                if quantization_override:
                    quant_method = quantization_override
                    self.quantization = quantization_override
                    break

I feel that _verify_quantization has already done automatic detection.
Will it not work if removed --quantization bitsandbytes?

alugowski · 2025-01-15T18:29:46Z

https://github.com/vllm-project/vllm/blob/fd3a62a122fcbc9331d000b325e72687629ef1bd/vllm/config.py#L559C1-L576C26

        if self.quantization is not None:
            self.quantization = self.quantization.lower()

        # Parse quantization method from the HF model config, if available.
        quant_cfg = self._parse_quant_hf_config()

        if quant_cfg is not None:
            quant_method = quant_cfg.get("quant_method", "").lower()

            # Detect which checkpoint is it
            for name in QUANTIZATION_METHODS:
                method = get_quantization_config(name)
                quantization_override = method.override_quantization_method(
                    quant_cfg, self.quantization)
                if quantization_override:
                    quant_method = quantization_override
                    self.quantization = quantization_override
                    break

I feel that _verify_quantization has already done automatic detection. Will it not work if removed --quantization bitsandbytes?

Nope.

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1034, in create_engine_config
    raise ValueError(
ValueError: BitsAndBytes load format and QLoRA adapter only support 'bitsandbytes' quantization, but got None

noooop · 2025-01-16T03:09:01Z

https://github.com/vllm-project/vllm/blob/cd9d06fb8d1f89fc1bcc9305bc20d57c6d8b73d8/vllm/engine/arg_utils.py#L1022C1-L1043C50

        # bitsandbytes quantization needs a specific model loader
        # so we make sure the quant method and the load format are consistent
        if (self.quantization == "bitsandbytes" or
           self.qlora_adapter_name_or_path is not None) and \
           self.load_format != "bitsandbytes":
            raise ValueError(
                "BitsAndBytes quantization and QLoRA adapter only support "
                f"'bitsandbytes' load format, but got {self.load_format}")

        if (self.load_format == "bitsandbytes" or
            self.qlora_adapter_name_or_path is not None) and \
            self.quantization != "bitsandbytes":
            raise ValueError(
                "BitsAndBytes load format and QLoRA adapter only support "
                f"'bitsandbytes' quantization, but got {self.quantization}")

        assert self.cpu_offload_gb >= 0, (
            "CPU offload space must be non-negative"
            f", but got {self.cpu_offload_gb}")

        device_config = DeviceConfig(device=self.device)
        model_config = self.create_model_config()     # <- do auto detection there

Test self.quantization == "bitsandbytes" before auto detection.

sad

alugowski added the feature request New feature or request label Jan 8, 2025

ubergarm mentioned this issue Jan 20, 2025

[Misc]: Finetuned llama3.2 vision instruct model is failing during VLLM weight_loader #11765

Open

1 task

tristanleclercq mentioned this issue Apr 3, 2025

[Misc] Auto detect bitsandbytes pre-quantized models #16027

Merged

vllm-bot closed this as completed in #16027 Apr 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Make `auto` load format handle bitsandbytes models #11867

[Feature]: Make `auto` load format handle bitsandbytes models #11867

alugowski commented Jan 8, 2025

noooop commented Jan 9, 2025

alugowski commented Jan 15, 2025

noooop commented Jan 16, 2025

[Feature]: Make auto load format handle bitsandbytes models #11867

[Feature]: Make auto load format handle bitsandbytes models #11867

Comments

alugowski commented Jan 8, 2025

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

noooop commented Jan 9, 2025

alugowski commented Jan 15, 2025

noooop commented Jan 16, 2025

[Feature]: Make `auto` load format handle bitsandbytes models #11867

[Feature]: Make `auto` load format handle bitsandbytes models #11867