Skip to content

Support for Apple silicon #252

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rickardp opened this issue Apr 1, 2023 · 68 comments
Open

Support for Apple silicon #252

rickardp opened this issue Apr 1, 2023 · 68 comments
Labels
Cross Platform Duplicate This issue or pull request already exists Enhancement New feature or request macOS

Comments

@rickardp
Copy link
Contributor

rickardp commented Apr 1, 2023

Would it make sense for this library to support platforms other than cuda on x64 Linux? I am specifically looking for Apple silicon support. Currently not even cpuonly works since it assumes SSE2 support (Even without Neon. Support).

i would guess that the first step would be a full cross platform compile (arm64), then ideally support for Metal Performance Shaders as an alternative to CUDA (assuming it is at all feasible).

I could probably contribute some towards support if there is interest for bitsandbytes to be multi platform. I have some experience setting up cross platform Python libraries.

@TheStoneMX
Copy link

TheStoneMX commented Apr 2, 2023

Hi there, I will contribute too, in order to get it to work on Metal Apple M1

this is my trace:

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: Required library version not found: libsbitsandbytes_cpu.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...
dlopen([/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so](https://file+.vscode-resource.vscode-cdn.net/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so), 0x0006): tried: '[/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so](https://file+.vscode-resource.vscode-cdn.net/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so)' (not a mach-o file), '[/System/Volumes/Preboot/Cryptexes/OS/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so](https://file+.vscode-resource.vscode-cdn.net/System/Volumes/Preboot/Cryptexes/OS/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so)' (no such file), '[/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so](https://file+.vscode-resource.vscode-cdn.net/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so)' (not a mach-o file)
CUDA SETUP: Required library version not found: libsbitsandbytes_cpu.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...
dlopen([/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so](https://file+.vscode-resource.vscode-cdn.net/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so), 0x0006): tried: '[/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so](https://file+.vscode-resource.vscode-cdn.net/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so)' (not a mach-o file), '[/System/Volumes/Preboot/Cryptexes/OS/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so](https://file+.vscode-resource.vscode-cdn.net/System/Volumes/Preboot/Cryptexes/OS/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so)' (no such file), '[/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so](https://file+.vscode-resource.vscode-cdn.net/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so)' (not a mach-o file)
[/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/cextension.py:31](https://file+.vscode-resource.vscode-cdn.net/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/cextension.py:31): UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
--------------------------------------------------------------------------------------------
# What version of Python do you have?
import sys
import platform
import torch

has_gpu = torch.cuda.is_available()
has_mps = getattr(torch,'has_mps',False)
print('has_mps', has_mps)
device = "mps" if getattr(torch,'has_mps',False) \
    else "gpu" if torch.cuda.is_available() else "cpu"

print(f"Python Platform: {platform.platform()}")
print(f"PyTorch Version: {torch.__version__}")
print()
print(f"Python {sys.version}")
print("GPU is", "available" if has_gpu else "NOT AVAILABLE")
print("MPS (Apple Metal) is", "AVAILABLE" if has_mps else "NOT AVAILABLE")
print(f"Target device is {device}")
----------------------------------------------------------------------------------
has_mps True
Python Platform: macOS-13.3-arm64-arm-64bit
PyTorch Version: 2.0.0

Python 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:38:11) 
[Clang 14.0.6 ]
GPU is NOT AVAILABLE
MPS (Apple Metal) is AVAILABLE
Target device is mps

@rickardp
Copy link
Contributor Author

rickardp commented Apr 3, 2023

Nice to hear! It would be good to hear from the maintainers that they are at all interested in making this package cross-platform. It is very much CUDA focused at the moment.

Getting libbitsandbytes_cpu.so to compile for macOS arm64 was not at all difficult, just an exercise in moving around some #ifdefs, but CPU support would obviously need to add Neon (SIMD) to make any sense IMHO. Then, of course the MPS support would be needed at one point (though I expect it's quite a lot more work).

I've just started looking at the unit tests and the Python libraries.

The C++ code is quite nicely structured, but the Python code would need some refactoring since most of the calls assume CUDA (x.cuda() instead of x.to(device), etc). Also, since the CPU version does not cover 100% of the feature set, testing is going to be quite some work as there is no real baseline. I suppose one question is if it would make sense to make the CPU cover 100% of the API calls, even if inefficient, just to provide a baseline that the GPU implementations could compare against?

If pursuing this, I propose implementing cross-platform CPU support first, then tackling MPS. MPS is of course what makes it useful.

(I have the exact same setup BTW, 2021 MBP)

Edit: Specifically, here's how I imagine the unit tests would have to work
https://github.com/TimDettmers/bitsandbytes/pull/257/files#diff-659bad232c71219167252c1a5ccbc427b6f54925b78741df18613c3c49aaa4c1R153

image

So at least one CPU test pass on my M1 Mac :)

@janrinze
Copy link

janrinze commented Apr 5, 2023

please have a look at Building on Jetson AGX Xavier Development Kit fails #221
It addresses the same AArch64 issue but on CUDA supported platforms like NVidia Jetson.

@UserHIJ
Copy link

UserHIJ commented Jun 24, 2023

Wow .. not to be inflammatory , but are we saying that there's no immediate solution for this if you have any macbook in the last like .. 5 years? Yuck.

@janrinze
Copy link

https://en.wikipedia.org/wiki/Apple_M1 introduced less than 3 years ago.
Things take time in the world of open-source. Specially when using hardware such as Apple.

@KotlinFactory
Copy link

when will this be done?

@benjaminhuo
Copy link

benjaminhuo commented Aug 21, 2023

Would it make sense for this library to support platforms other than cuda on x64 Linux? I am specifically looking for Apple silicon support. Currently not even cpuonly works since it assumes SSE2 support (Even without Neon. Support).

i would guess that the first step would be a full cross platform compile (arm64), then ideally support for Metal Performance Shaders as an alternative to CUDA (assuming it is at all feasible).

I could probably contribute some towards support if there is interest for bitsandbytes to be multi platform. I have some experience setting up cross platform Python libraries.

Looking forward to the support for this too, got the below errors when I tried to fine-tune llama2 7B with load_in_8bit=True enabled on my Macbook M2, PyTorch‘s support to MPS is getting better and I hope this project could support this as well:

  File "/Users/ben/opt/miniconda3/envs/finetune/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/Users/ben/opt/miniconda3/envs/finetune/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 293, in forward
    using_igemmlt = supports_igemmlt(A.device) and not state.force_no_igemmlt
  File "/Users/ben/opt/miniconda3/envs/finetune/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 226, in supports_igemmlt
    if torch.cuda.get_device_capability(device=device) < (7, 5):
  File "/Users/ben/opt/miniconda3/envs/finetune/lib/python3.10/site-packages/torch/cuda/__init__.py", line 381, in get_device_capability
    prop = get_device_properties(device)
  File "/Users/ben/opt/miniconda3/envs/finetune/lib/python3.10/site-packages/torch/cuda/__init__.py", line 395, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/Users/ben/opt/miniconda3/envs/finetune/lib/python3.10/site-packages/torch/cuda/__init__.py", line 239, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

@AlexandreCassagne
Copy link

@benjaminhuo Getting the same issue as you.

@id4thomas
Copy link

  File "/Users/ben/opt/miniconda3/envs/finetune/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/Users/ben/opt/miniconda3/envs/finetune/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 293, in forward
    using_igemmlt = supports_igemmlt(A.device) and not state.force_no_igemmlt
  File "/Users/ben/opt/miniconda3/envs/finetune/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 226, in supports_igemmlt
    if torch.cuda.get_device_capability(device=device) < (7, 5):
  File "/Users/ben/opt/miniconda3/envs/finetune/lib/python3.10/site-packages/torch/cuda/__init__.py", line 381, in get_device_capability
    prop = get_device_properties(device)
  File "/Users/ben/opt/miniconda3/envs/finetune/lib/python3.10/site-packages/torch/cuda/__init__.py", line 395, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/Users/ben/opt/miniconda3/envs/finetune/lib/python3.10/site-packages/torch/cuda/__init__.py", line 239, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

https://github.com/TimDettmers/bitsandbytes/blob/18e827d666fa2b70a12d539ccedc17aa51b2c97c/bitsandbytes/autograd/_functions.py#L227

This seems to be due to calling torch.cuda even if the device type isn't cuda.
One way to patch these unchecked torch.cuda calls is adding device checks like

if device.type != 'cuda':
    return False

mps returns "mps" as device.type

@pechaut78
Copy link

same issue here, MPS seems to be the problem

@ProjectProgramAMark
Copy link

getting same issue with apple silicon. would love to see some support for it soon!

@ivan-digital
Copy link

Same issue. Would be nice to have support for MPS.

@ageorgios
Copy link

Same here, please have support for MPS
https://github.com/ml-explore/mlx

@592319702
Copy link

(torch-gpu) I542464@DY4GPKX1J0 test % python3 fine_tune_llama_2_in_google_colab.py /Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. warn("The installed version of bitsandbytes was compiled without GPU support. " 'NoneType' object has no attribute 'cadam32bit_grad_fp32' Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:32<00:00, 16.06s/it] /Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/peft/utils/other.py:102: FutureWarning: prepare_model_for_int8_training is deprecated and will be removed in a future version. Use prepare_model_for_kbit_training instead. warnings.warn( /Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:159: UserWarning: You didn't pass a max_seq_lengthargument to the SFTTrainer, this will default to 1024 warnings.warn( 0%| | 0/250 [00:00<?, ?it/s]You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using thecallmethod is faster than using a method to encode the text followed by a call to thepadmethod to get a padded encoding.use_cache=Trueis incompatible with gradient checkpointing. Settinguse_cache=False... /Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( FP4 quantization state not initialized. Please call .cuda() or .to(device) on the LinearFP4 layer first. Traceback (most recent call last): File "/Users/I542464/test/fine_tune_llama_2_in_google_colab.py", line 229, in <module> trainer.train() File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( ^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/transformers/trainer.py", line 2654, in training_step loss = self.compute_loss(model, inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/transformers/trainer.py", line 2679, in compute_loss outputs = model(**inputs) ^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/peft/peft_model.py", line 922, in forward return self.base_model( ^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 806, in forward outputs = self.model( ^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 685, in forward layer_outputs = torch.utils.checkpoint.checkpoint( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/_compile.py", line 24, in inner return torch._dynamo.disable(fn, recursive)(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/_dynamo/external_utils.py", line 17, in inner return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 451, in checkpoint return CheckpointFunction.apply(function, preserve, *args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/autograd/function.py", line 539, in apply return super().apply(*args, **kwargs) # type: ignore[misc] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 230, in forward outputs = run_function(*args) ^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 681, in custom_forward return module(*inputs, output_attentions, None) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( ^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 305, in forward query_states = self.q_proj(hidden_states) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/peft/tuners/lora.py", line 1123, in forward result = super().forward(x) ^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/bitsandbytes/nn/modules.py", line 221, in forward out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/bitsandbytes/autograd/_functions.py", line 567, in matmul_4bit assert quant_state is not None ^^^^^^^^^^^^^^^^^^^^^^^ AssertionError 0%| | 0/250 [00:01<?, ?it/s]

@mbtre
Copy link

mbtre commented Feb 15, 2024

+1 MPS support would be absolutely great!

@morkapronczay
Copy link

adding a comment to keep this alive. MPS support would be awesome!

@rickardp
Copy link
Contributor Author

Once the device abstraction has been been merged, we can start adding MPS-accelerated versions of the functions

@Satyam7166-tech
Copy link

Once the device abstraction has been been merged, we can start adding MPS-accelerated versions of the functions

Yay. Thanks to all your efforts.
One a side note: how can someone be skilled enough to contribute to this stuff? Like what topics should they cover?

@sislam-provenir
Copy link

Looking forward to MPS support!

@anilkul98
Copy link

Looking forward to MPS Support!!!!

@JohnSilverman
Copy link

looking forward to mps support

@anelook
Copy link

anelook commented Nov 16, 2024

+1

1 similar comment
@sojkin
Copy link

sojkin commented Nov 16, 2024

+1

@liygzting
Copy link

Make MPS support! please

@Limeslices
Copy link

+1

3 similar comments
@rholdorf
Copy link

rholdorf commented Dec 2, 2024

+1

@Phuket2
Copy link

Phuket2 commented Dec 17, 2024

+1

@ITHwang
Copy link

ITHwang commented Dec 18, 2024

+1

@SebasCrucer
Copy link

MPS support! Don't let it die

@rajneesh-git
Copy link

rajneesh-git commented Dec 28, 2024

Looking forward for MPS support!

@benkap
Copy link

benkap commented Jan 3, 2025

+1

@lu4p
Copy link

lu4p commented Jan 6, 2025

I don't know if it helps, but here's https://github.com/filipstrand/mflux with support for quantized flux on mlx.

@ZerocoolZa
Copy link

Here’s the updated inline text with a detailed comparison table, including Python version, required packages, pros, and cons:

Lesson: Optimizing Llama Model on Apple M1 Mac with Quantization

Introduction

For those working with Llama models on Apple M1 Macs, using bitsandbytes for 8-bit quantization can present compatibility issues due to limited GPU support. Instead, leveraging PyTorch’s native MPS (Metal Performance Shaders) backend for quantization provides a robust solution that works seamlessly.

Prerequisites
• Python Version: 3.10.x
• Packages Installed:

pip install torch torchvision numpy

•	Model path to Llama model checkpoint

Step-by-Step Guide

  1. Install Required Packages

Ensure the necessary libraries are installed:

pip install torch torchvision numpy

  1. Load Your Model

Load your pre-trained Llama model using PyTorch:

import torch

Load the pre-trained Llama model

model_path = '/path/to/llama_model.pt'
model = torch.load(model_path)

Move the model to MPS device

device = torch.device('mps')
model.to(device)

  1. Quantize the Model

Use PyTorch’s dynamic quantization feature for MPS:

from torch.quantization import quantize_dynamic

Quantize the model

quantized_model = quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)

Save the quantized model

quantized_model_path = '/path/to/quantized_llama_model.pt'
torch.save(quantized_model, quantized_model_path)

print(f"Quantized Model Saved at: {quantized_model_path}")

  1. Inference with Quantized Model

Perform inference using the quantized model:

input_tensor = torch.randn(1, *model.input_size, device=device)
output = quantized_model(input_tensor)

print(f"Inference Output: {output}")

Key Benefits of This Approach
1. No Need for bitsandbytes: By utilizing native PyTorch functionality, there’s no dependency on external libraries like bitsandbytes.
2. Optimized for MPS: PyTorch’s dynamic quantization handles 8-bit quantization efficiently, leveraging Metal Performance Shaders for GPU-like acceleration.
3. Simple and Seamless: This method integrates smoothly with your existing PyTorch workflow without requiring additional configuration or troubleshooting.
4. Cross-Platform Flexibility: With MPS support, your model is optimized for Apple M1, ensuring compatibility and performance.
Comparison Table

Feature bitsandbytes PyTorch MPS Quantization
Python Version 3.x 3.10.x
Required Packages bitsandbytes, numpy, scipy torch, torchvision, numpy
Backend GPU (CUDA) MPS (Metal Performance Shaders)
Dependency External Library Native PyTorch
Compatibility Limited GPU support Full MPS backend support
Optimization Focused on image models Flexible for both image & text
Performance Dependent on GPU Leverages Metal Performance Shaders
Seamlessness Requires additional setup Integrates seamlessly
Cross-Platform Limited Fully cross-platform
Pros High-performance GPU quantization Easy to use, seamless integration, optimized for M1 Mac
Cons Complex setup, external dependency Limited quantization options compared to bitsandbytes
Conclusion

This guide provides a practical solution for optimizing Llama models on Apple M1 Macs using native PyTorch MPS quantization. By avoiding external libraries like bitsandbytes, the process remains efficient, reliable, and straightforward for machine learning practitioners.

@chigkim
Copy link

chigkim commented Feb 4, 2025

I'd really appreciate MPS support please!

@matthewdouglas matthewdouglas added the Duplicate This issue or pull request already exists label Feb 28, 2025
@matthewdouglas matthewdouglas marked this as a duplicate of #1460 Mar 3, 2025
@matthewdouglas matthewdouglas marked this as a duplicate of #1406 Mar 3, 2025
@matthewdouglas matthewdouglas marked this as a duplicate of #485 Mar 3, 2025
@attentionmech
Copy link

mps support please!

@foggyghost0
Copy link

mps support please! how many more years?

@giadefa
Copy link

giadefa commented Apr 7, 2025

Having the capability to install it and run it would allow me to debug the code locally.
It does not have to be super optimized at first.

@reneleonhardt
Copy link

reneleonhardt commented Apr 8, 2025

Prototype contribution #947 from Jan 2, 2024 was closed despite all that hard work... because it was not needed? 🤔

That seems to be the reasoning: #257 (comment)

We will try to add Apple silicon support soon along with AMD and Intel support. We need to figure out next steps first though. It will likely take some weeks, but we will get there. This is one of the highest-priority features that we have and we start working on this soon.

Many contributors offered to help with Apple Silicon since then #1340.
What happened, are the efforts being coordinated?

And if there already is a native way in PyTorch #252 (comment) can't that be integrated into bitsandbytes, at least as a fallback until a new implementation is ready?

@rickardp
Copy link
Contributor Author

rickardp commented Apr 8, 2025

Prototype contribution #947 from Jan 2, 2024 was closed despite all that hard work... because it was not needed? 🤔

Most of this was actually merged in separate PRs to make it more manageable. This PR was mainly about making the library portable at all.

Before contributing, I think the architectural direction for this library has to be established. I was at the time arguing that we need a 100% test covered CPU implementation to start from so MPS support can be added gradually. If my understanding is correct, the latest architectural direction is to unify backend initialization with PyTorch. Once we get to a level where kernels can be added one by one and there are unit tests to verify correctness , I think a lot more can be done by the community. Right now IMHO there is quite a dependence on the core maintainers

@matthewdouglas
Copy link
Member

Hi folks,

We're not quite there yet, but after merging #1544 we've started to pave the path forward. I want to make it clear that we haven't abandoned Apple silicon support, but instead have had competing priorities to shift through. We're still working toward this though.

In fact, we've now got a CI test suite that is stable and can be reliably deterministic on our main branch for the CUDA implementation. The CUDA implementation is always going to be the "gold standard" reference implementation, but this is a step forward toward being able to validate implementations for new hardware.

We have some new PyTorch native fallback implementations of some custom operators which can be used for reference/fallback on CPU as well as other devices. To @rickardp's point, this will allow incremental support for more optimal implementations on additional platforms, as well as serve as a secondary reference to evaluate against.

I have a branch which is a WIP that I have not pushed yet, but it has an implementation of the NF4 quantization/dequantization ops which can run on CPU and even on MPS with a little bit of work. I've also acquired an M4 MacBook Pro for future development/validation. Soon we should be able to build out enough plumbing to better enable community contributions for kernel implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Cross Platform Duplicate This issue or pull request already exists Enhancement New feature or request macOS
Projects
None yet
Development

No branches or pull requests