Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Make auto load format handle bitsandbytes models #11867

Closed
1 task done
alugowski opened this issue Jan 8, 2025 · 3 comments · Fixed by #16027
Closed
1 task done

[Feature]: Make auto load format handle bitsandbytes models #11867

alugowski opened this issue Jan 8, 2025 · 3 comments · Fixed by #16027
Labels
feature request New feature or request

Comments

@alugowski
Copy link
Contributor

🚀 The feature, motivation and pitch

Common bitsandbytes models like unsloth/meta-llama-3.1-8b-bnb-4bit require the user to pass --load-format bitsandbytes --quantization bitsandbytes command-line arguments.

I could be wrong, but I believe both of these could be auto-detected by vLLM. The default load format auto could select bitsandbytes if a bitsandbytes model is selected.

AFAIK this detection should work:

config.get("quantization_config", {}).get("quant_method") == "bitsandbytes"

Similarly the --quantization bitsandbytes argument seems redundant since the quantization is specified in the model config, but if the user omits it then this happens:

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1034, in create_engine_config
    raise ValueError(
ValueError: BitsAndBytes load format and QLoRA adapter only support 'bitsandbytes' quantization, but got None

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@alugowski alugowski added the feature request New feature or request label Jan 8, 2025
@noooop
Copy link
Contributor

noooop commented Jan 9, 2025

https://github.com/vllm-project/vllm/blob/fd3a62a122fcbc9331d000b325e72687629ef1bd/vllm/config.py#L559C1-L576C26

        if self.quantization is not None:
            self.quantization = self.quantization.lower()

        # Parse quantization method from the HF model config, if available.
        quant_cfg = self._parse_quant_hf_config()

        if quant_cfg is not None:
            quant_method = quant_cfg.get("quant_method", "").lower()

            # Detect which checkpoint is it
            for name in QUANTIZATION_METHODS:
                method = get_quantization_config(name)
                quantization_override = method.override_quantization_method(
                    quant_cfg, self.quantization)
                if quantization_override:
                    quant_method = quantization_override
                    self.quantization = quantization_override
                    break

I feel that _verify_quantization has already done automatic detection.
Will it not work if removed --quantization bitsandbytes?

@alugowski
Copy link
Contributor Author

https://github.com/vllm-project/vllm/blob/fd3a62a122fcbc9331d000b325e72687629ef1bd/vllm/config.py#L559C1-L576C26

        if self.quantization is not None:
            self.quantization = self.quantization.lower()

        # Parse quantization method from the HF model config, if available.
        quant_cfg = self._parse_quant_hf_config()

        if quant_cfg is not None:
            quant_method = quant_cfg.get("quant_method", "").lower()

            # Detect which checkpoint is it
            for name in QUANTIZATION_METHODS:
                method = get_quantization_config(name)
                quantization_override = method.override_quantization_method(
                    quant_cfg, self.quantization)
                if quantization_override:
                    quant_method = quantization_override
                    self.quantization = quantization_override
                    break

I feel that _verify_quantization has already done automatic detection. Will it not work if removed --quantization bitsandbytes?

Nope.

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1034, in create_engine_config
    raise ValueError(
ValueError: BitsAndBytes load format and QLoRA adapter only support 'bitsandbytes' quantization, but got None

@noooop
Copy link
Contributor

noooop commented Jan 16, 2025

https://github.com/vllm-project/vllm/blob/cd9d06fb8d1f89fc1bcc9305bc20d57c6d8b73d8/vllm/engine/arg_utils.py#L1022C1-L1043C50

        # bitsandbytes quantization needs a specific model loader
        # so we make sure the quant method and the load format are consistent
        if (self.quantization == "bitsandbytes" or
           self.qlora_adapter_name_or_path is not None) and \
           self.load_format != "bitsandbytes":
            raise ValueError(
                "BitsAndBytes quantization and QLoRA adapter only support "
                f"'bitsandbytes' load format, but got {self.load_format}")

        if (self.load_format == "bitsandbytes" or
            self.qlora_adapter_name_or_path is not None) and \
            self.quantization != "bitsandbytes":
            raise ValueError(
                "BitsAndBytes load format and QLoRA adapter only support "
                f"'bitsandbytes' quantization, but got {self.quantization}")

        assert self.cpu_offload_gb >= 0, (
            "CPU offload space must be non-negative"
            f", but got {self.cpu_offload_gb}")

        device_config = DeviceConfig(device=self.device)
        model_config = self.create_model_config()     # <- do auto detection there

Test self.quantization == "bitsandbytes" before auto detection.

sad

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants