Skip to content

Commit a4cdba6

Browse files
committed
fix(review): add suggested changes
1 parent da44046 commit a4cdba6

File tree

1 file changed

+13
-8
lines changed

1 file changed

+13
-8
lines changed

β€Žintel-gaudi-backend-for-tgi.md

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,8 @@ We're excited to announce the native integration of Intel Gaudi hardware support
1111

1212
## ✨ What's New?
1313

14-
We've fully integrated Gaudi support into TGI's main codebase. Previously, we maintained a separate fork for Gaudi devices at [tgi-gaudi](https://github.com/huggingface/tgi-gaudi). This was cumbersome for users and prevented us from supporting the latest TGI features at launch. Now using the new [TGI multi-backend architecture](https://huggingface.co/blog/tgi-multi-backend), we support Gaudi directly on TGI no more finicking on a custom repository πŸ™Œ
14+
We've fully integrated Gaudi support into TGI's main codebase in PR [#3091](https://github.com/huggingface/text-generation-inference/pull/3091). Previously, we maintained a separate fork for Gaudi devices at [tgi-gaudi](https://github.com/huggingface/tgi-gaudi). This was cumbersome for users and prevented us from supporting the latest TGI features at launch. Now using the new [TGI multi-backend architecture](https://huggingface.co/blog/tgi-multi-backend), we support Gaudi directly on TGI – no more finicking on a custom repository πŸ™Œ
15+
1516
This integration supports Intel's full line of Gaudi hardware:
1617
- Gaudi1 πŸ’»: Available on [AWS EC2 DL1 instances](https://aws.amazon.com/ec2/instance-types/dl1/)
1718
- Gaudi2 πŸ’»πŸ’»: Available on [Intel Dev Cloud](https://ai.cloud.intel.com/)
@@ -34,11 +35,13 @@ The easiest way to run TGI on Gaudi is to use our official Docker image. You nee
3435
model=meta-llama/Meta-Llama-3.1-8B-Instruct
3536
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
3637
hf_token=YOUR_HF_ACCESS_TOKEN
37-
docker run --runtime=habana --cap-add=sys_nice --ipc=host
38-
-p 8080:80 -v $volume:/data
39-
-e HF_TOKEN=$hf_token
40-
-e HABANA_VISIBLE_DEVICES=all
41-
ghcr.io/huggingface/text-generation-inference:3.2.1-gaudi
38+
39+
docker run --runtime=habana --cap-add=sys_nice --ipc=host \
40+
-p 8080:80 \
41+
-v $volume:/data \
42+
-e HF_TOKEN=$hf_token \
43+
-e HABANA_VISIBLE_DEVICES=all \
44+
ghcr.io/huggingface/text-generation-inference:3.2.1-gaudi \
4245
--model-id $model
4346
```
4447

@@ -51,6 +54,8 @@ curl 127.0.0.1:8080/generate
5154
-H 'Content-Type: application/json'
5255
```
5356

57+
For comprehensive documentation on using TGI with Gaudi, including how-to guides and advanced configurations, refer to the new dedicated [Gaudi backend documentation](https://huggingface.co/docs/text-generation-inference/backends/gaudi).
58+
5459
## πŸŽ‰ Top features
5560

5661
We have optimized the following models for both single and multi-card configurations. This means these models run as fast as possible on Intel Gaudi. We've specifically optimized the modeling code to target Intel Gaudi hardware, ensuring we offer the best performance and fully utilize Gaudi's capabilities:
@@ -65,10 +70,10 @@ We have optimized the following models for both single and multi-card configurat
6570
- Gemma (7B)
6671
- Llava-v1.6-Mistral-7B
6772

68-
Furthermore, we also support all models implemented in Transformers library, providing a [fallback mechanism](https://huggingface.co/docs/text-generation-inference/basic_tutorials/non_core_models) that ensures you can still run any model on Gaudi hardware even if it's not yet specifically optimized.
73+
Furthermore, we also support all models implemented in the [Transformers library](https://huggingface.co/docs/transformers/index), providing a [fallback mechanism](https://huggingface.co/docs/text-generation-inference/basic_tutorials/non_core_models) that ensures you can still run any model on Gaudi hardware even if it's not yet specifically optimized.
6974

7075
πŸƒβ€β™‚οΈ We also offer many advanced features on Gaudi hardware, such as FP8 quantization thanks to [Intel Neural Compressor (INC)](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Quantization/Inference_Using_FP8.html), enabling even greater performance optimizations.
7176

7277
## πŸ’ͺ Getting Involved
7378

74-
We invite the community to try out TGI on Gaudi hardware and provide feedback. The full documentation is available in the TGI repository. πŸ“š If you're interested in contributing, check out our contribution guidelines or open an issue with your feedback on GitHub. 🀝 By bringing Intel Gaudi support directly into TGI, we're continuing our mission to provide flexible, efficient, and production-ready tools for deploying LLMs. We're excited to see what you'll build with this new capability! πŸŽ‰
79+
We invite the community to try out TGI on Gaudi hardware and provide feedback. The full documentation is available in the [TGI Gaudi backend documentation](https://huggingface.co/docs/text-generation-inference/backends/gaudi). πŸ“š If you're interested in contributing, check out our contribution guidelines or open an issue with your feedback on GitHub. 🀝 By bringing Intel Gaudi support directly into TGI, we're continuing our mission to provide flexible, efficient, and production-ready tools for deploying LLMs. We're excited to see what you'll build with this new capability! πŸŽ‰

0 commit comments

Comments
Β (0)