Hugging Face Inference Endpoints now supports GGUF out of the box! #9669
Pinned
ngxson
started this conversation in
Show and tell
Replies: 1 comment
-
Hermes 405B model can be deployed on 2xA100. The generation speed is around 8t/s, which is not bad! ![]() |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
You can now deploy any GGUF model on your own endpoint, in just a few clicks!
Simply select GGUF, select hardware configuration and done! An endpoint powered by llama-server (built from
master
branch) will be deployed automatically. It works with all llama.cpp-compatible models, with all size, from 0.1B up to 405B parameters.Try it now --> https://ui.endpoints.huggingface.co/
And the best part is:
A huge thanks to @ggerganov @slaren and @huggingface team for making this possible!
llama.hfe.ok.mp4
Beta Was this translation helpful? Give feedback.
All reactions