-
Notifications
You must be signed in to change notification settings - Fork 698
Support for Apple silicon #252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi there, I will contribute too, in order to get it to work on Metal Apple M1 this is my trace:
|
Nice to hear! It would be good to hear from the maintainers that they are at all interested in making this package cross-platform. It is very much CUDA focused at the moment. Getting I've just started looking at the unit tests and the Python libraries. The C++ code is quite nicely structured, but the Python code would need some refactoring since most of the calls assume CUDA (x.cuda() instead of x.to(device), etc). Also, since the CPU version does not cover 100% of the feature set, testing is going to be quite some work as there is no real baseline. I suppose one question is if it would make sense to make the CPU cover 100% of the API calls, even if inefficient, just to provide a baseline that the GPU implementations could compare against? If pursuing this, I propose implementing cross-platform CPU support first, then tackling MPS. MPS is of course what makes it useful. (I have the exact same setup BTW, 2021 MBP) Edit: Specifically, here's how I imagine the unit tests would have to work So at least one CPU test pass on my M1 Mac :) |
please have a look at Building on Jetson AGX Xavier Development Kit fails #221 |
Wow .. not to be inflammatory , but are we saying that there's no immediate solution for this if you have any macbook in the last like .. 5 years? Yuck. |
https://en.wikipedia.org/wiki/Apple_M1 introduced less than 3 years ago. |
when will this be done? |
Looking forward to the support for this too, got the below errors when I tried to fine-tune llama2 7B with File "/Users/ben/opt/miniconda3/envs/finetune/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/Users/ben/opt/miniconda3/envs/finetune/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 293, in forward
using_igemmlt = supports_igemmlt(A.device) and not state.force_no_igemmlt
File "/Users/ben/opt/miniconda3/envs/finetune/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 226, in supports_igemmlt
if torch.cuda.get_device_capability(device=device) < (7, 5):
File "/Users/ben/opt/miniconda3/envs/finetune/lib/python3.10/site-packages/torch/cuda/__init__.py", line 381, in get_device_capability
prop = get_device_properties(device)
File "/Users/ben/opt/miniconda3/envs/finetune/lib/python3.10/site-packages/torch/cuda/__init__.py", line 395, in get_device_properties
_lazy_init() # will define _get_device_properties
File "/Users/ben/opt/miniconda3/envs/finetune/lib/python3.10/site-packages/torch/cuda/__init__.py", line 239, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled |
@benjaminhuo Getting the same issue as you. |
This seems to be due to calling torch.cuda even if the device type isn't cuda. if device.type != 'cuda':
return False mps returns "mps" as device.type |
same issue here, MPS seems to be the problem |
getting same issue with apple silicon. would love to see some support for it soon! |
Same issue. Would be nice to have support for MPS. |
Same here, please have support for MPS |
|
+1 MPS support would be absolutely great! |
adding a comment to keep this alive. MPS support would be awesome! |
Once the device abstraction has been been merged, we can start adding MPS-accelerated versions of the functions |
Yay. Thanks to all your efforts. |
Looking forward to MPS support! |
Looking forward to MPS Support!!!! |
looking forward to mps support |
+1 |
1 similar comment
+1 |
Make MPS support! please |
+1 |
3 similar comments
+1 |
+1 |
+1 |
MPS support! Don't let it die |
Looking forward for MPS support! |
+1 |
I don't know if it helps, but here's https://github.com/filipstrand/mflux with support for quantized flux on mlx. |
Here’s the updated inline text with a detailed comparison table, including Python version, required packages, pros, and cons: Lesson: Optimizing Llama Model on Apple M1 Mac with Quantization Introduction For those working with Llama models on Apple M1 Macs, using bitsandbytes for 8-bit quantization can present compatibility issues due to limited GPU support. Instead, leveraging PyTorch’s native MPS (Metal Performance Shaders) backend for quantization provides a robust solution that works seamlessly. Prerequisites pip install torch torchvision numpy
Step-by-Step Guide
Ensure the necessary libraries are installed: pip install torch torchvision numpy
Load your pre-trained Llama model using PyTorch: import torch Load the pre-trained Llama modelmodel_path = '/path/to/llama_model.pt' Move the model to MPS devicedevice = torch.device('mps')
Use PyTorch’s dynamic quantization feature for MPS: from torch.quantization import quantize_dynamic Quantize the modelquantized_model = quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8) Save the quantized modelquantized_model_path = '/path/to/quantized_llama_model.pt' print(f"Quantized Model Saved at: {quantized_model_path}")
Perform inference using the quantized model: input_tensor = torch.randn(1, *model.input_size, device=device) print(f"Inference Output: {output}") Key Benefits of This Approach
This guide provides a practical solution for optimizing Llama models on Apple M1 Macs using native PyTorch MPS quantization. By avoiding external libraries like bitsandbytes, the process remains efficient, reliable, and straightforward for machine learning practitioners. |
I'd really appreciate MPS support please! |
mps support please! |
mps support please! how many more years? |
Having the capability to install it and run it would allow me to debug the code locally. |
Prototype contribution #947 from Jan 2, 2024 was closed despite all that hard work... because it was not needed? 🤔 That seems to be the reasoning: #257 (comment)
Many contributors offered to help with Apple Silicon since then #1340. And if there already is a native way in PyTorch #252 (comment) can't that be integrated into bitsandbytes, at least as a fallback until a new implementation is ready? |
Most of this was actually merged in separate PRs to make it more manageable. This PR was mainly about making the library portable at all. Before contributing, I think the architectural direction for this library has to be established. I was at the time arguing that we need a 100% test covered CPU implementation to start from so MPS support can be added gradually. If my understanding is correct, the latest architectural direction is to unify backend initialization with PyTorch. Once we get to a level where kernels can be added one by one and there are unit tests to verify correctness , I think a lot more can be done by the community. Right now IMHO there is quite a dependence on the core maintainers |
Hi folks, We're not quite there yet, but after merging #1544 we've started to pave the path forward. I want to make it clear that we haven't abandoned Apple silicon support, but instead have had competing priorities to shift through. We're still working toward this though. In fact, we've now got a CI test suite that is stable and can be reliably deterministic on our We have some new PyTorch native fallback implementations of some custom operators which can be used for reference/fallback on CPU as well as other devices. To @rickardp's point, this will allow incremental support for more optimal implementations on additional platforms, as well as serve as a secondary reference to evaluate against. I have a branch which is a WIP that I have not pushed yet, but it has an implementation of the NF4 quantization/dequantization ops which can run on CPU and even on MPS with a little bit of work. I've also acquired an M4 MacBook Pro for future development/validation. Soon we should be able to build out enough plumbing to better enable community contributions for kernel implementations. |
Would it make sense for this library to support platforms other than cuda on x64 Linux? I am specifically looking for Apple silicon support. Currently not even cpuonly works since it assumes SSE2 support (Even without Neon. Support).
i would guess that the first step would be a full cross platform compile (arm64), then ideally support for Metal Performance Shaders as an alternative to CUDA (assuming it is at all feasible).
I could probably contribute some towards support if there is interest for bitsandbytes to be multi platform. I have some experience setting up cross platform Python libraries.
The text was updated successfully, but these errors were encountered: