Skip to content

Commit 746d6aa

Browse files
Gaudi on TGI (#2752)
* Gaudi on TGI * fix(review): add suggested changes * remove fallback to transformer section * fix: add new thumbnail and rename intel dev cloud to Intel Tiber AI Cloud * change date to today * fix: add suggested changes by intel * Update intel-gaudi-backend-for-tgi.md Added link to Gaudi product page * Update intel-gaudi-backend-for-tgi.md Added co-authors * Update _blog.yml Updated * fix date --------- Co-authored-by: Jeff Boudier <[email protected]>
1 parent a66c534 commit 746d6aa

File tree

3 files changed

+103
-0
lines changed

3 files changed

+103
-0
lines changed

β€Ž_blog.yml

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5760,3 +5760,16 @@
57605760
- guide
57615761
- community
57625762
- open-source
5763+
5764+
- local: intel-gaudi-backend-for-tgi
5765+
title: "Accelerating LLM Inference with TGI on Intel Gaudi"
5766+
author: baptistecolle
5767+
thumbnail: /blog/assets/optimum_intel/intel_thumbnail.png
5768+
date: March 28, 2025
5769+
tags:
5770+
- tgi
5771+
- intel
5772+
- gaudi
5773+
- llm
5774+
- inference
5775+
- partnerships
Loading

β€Žintel-gaudi-backend-for-tgi.md

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
---
2+
title: "πŸš€ Accelerating LLM Inference with TGI on Intel Gaudi"
3+
thumbnail: /blog/assets/intel-gaudi-backend-for-tgi/tgi-gaudi-thumbnail.png
4+
authors:
5+
- user: baptistecolle
6+
- user: regisss
7+
- user: IlyasMoutawwakil
8+
- user: echarlaix
9+
- user: kding1
10+
guest: true
11+
org: intel
12+
---
13+
14+
# πŸš€ Accelerating LLM Inference with TGI on Intel Gaudi
15+
16+
We're excited to announce the native integration of Intel Gaudi hardware support directly into Text Generation Inference (TGI), our production-ready serving solution for Large Language Models (LLMs). This integration brings the power of Intel's specialized AI accelerators to our high-performance inference stack, enabling more deployment options for the open-source AI community πŸŽ‰
17+
18+
## ✨ What's New?
19+
20+
We've fully integrated Gaudi support into TGI's main codebase in PR [#3091](https://github.com/huggingface/text-generation-inference/pull/3091). Previously, we maintained a separate fork for Gaudi devices at [tgi-gaudi](https://github.com/huggingface/tgi-gaudi). This was cumbersome for users and prevented us from supporting the latest TGI features at launch. Now using the new [TGI multi-backend architecture](https://huggingface.co/blog/tgi-multi-backend), we support Gaudi directly on TGI – no more finicking on a custom repository πŸ™Œ
21+
22+
This integration supports Intel's full line of [Gaudi hardware](https://www.intel.com/content/www/us/en/developer/platform/gaudi/develop/overview.html):
23+
- Gaudi1 πŸ’»: Available on [AWS EC2 DL1 instances](https://aws.amazon.com/ec2/instance-types/dl1/)
24+
- Gaudi2 πŸ’»πŸ’»: Available on [Intel Tiber AI Cloud](https://ai.cloud.intel.com/) and [Denvr Dataworks](https://www.denvrdata.com/guadi2)
25+
- Gaudi3 πŸ’»πŸ’»πŸ’»: Available on [Intel Tiber AI Cloud](https://ai.cloud.intel.com/), [IBM Cloud](https://www.ibm.com/cloud) and from OEM such as [Dell](https://www.dell.com/en-us/lp/intel-gaudi), [HP](https://www.hpe.com/us/en/compute/proliant-xd680.html) and [Supermicro](https://www.supermicro.com/en/accelerators/intel)
26+
27+
You can also find more information on Gaudi hardware on the [Intel's Gaudi product page](https://www.intel.com/content/www/us/en/developer/platform/gaudi/develop/overview.html)
28+
29+
## 🌟 Why This Matters
30+
31+
The Gaudi backend for TGI provides several key benefits:
32+
- Hardware Diversity πŸ”„: More options for deploying LLMs in production beyond traditional GPUs
33+
- Cost Efficiency πŸ’°: Gaudi hardware often provides compelling price-performance for specific workloads
34+
- Production-Ready βš™οΈ: All the robustness of TGI (dynamic batching, streamed responses, etc.) now available on Gaudi
35+
- Model Support πŸ€–: Run popular models like Llama 3.1, Mixtral, Mistral, and more on Gaudi hardware
36+
- Advanced Features πŸ”₯: Support for multi-card inference (sharding), vision-language models, and FP8 precision
37+
38+
## 🚦 Getting Started with TGI on Gaudi
39+
40+
The easiest way to run TGI on Gaudi is to use our official Docker image. You need to run the image on a Gaudi hardware machine. Here's a basic example to get you started:
41+
42+
```bash
43+
model=meta-llama/Meta-Llama-3.1-8B-Instruct
44+
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
45+
hf_token=YOUR_HF_ACCESS_TOKEN
46+
47+
docker run --runtime=habana --cap-add=sys_nice --ipc=host \
48+
-p 8080:80 \
49+
-v $volume:/data \
50+
-e HF_TOKEN=$hf_token \
51+
-e HABANA_VISIBLE_DEVICES=all \
52+
ghcr.io/huggingface/text-generation-inference:3.2.1-gaudi \
53+
--model-id $model
54+
```
55+
56+
Once the server is running, you can send inference requests:
57+
58+
```bash
59+
curl 127.0.0.1:8080/generate
60+
-X POST
61+
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":32}}'
62+
-H 'Content-Type: application/json'
63+
```
64+
65+
For comprehensive documentation on using TGI with Gaudi, including how-to guides and advanced configurations, refer to the new dedicated [Gaudi backend documentation](https://huggingface.co/docs/text-generation-inference/backends/gaudi).
66+
67+
## πŸŽ‰ Top features
68+
69+
We have optimized the following models for both single and multi-card configurations. This means these models run as fast as possible on Intel Gaudi. We've specifically optimized the modeling code to target Intel Gaudi hardware, ensuring we offer the best performance and fully utilize Gaudi's capabilities:
70+
71+
- Llama 3.1 (8B and 70B)
72+
- Llama 3.3 (70B)
73+
- Llama 3.2 Vision (11B)
74+
- Mistral (7B)
75+
- Mixtral (8x7B)
76+
- CodeLlama (13B)
77+
- Falcon (180B)
78+
- Qwen2 (72B)
79+
- Starcoder and Starcoder2
80+
- Gemma (7B)
81+
- Llava-v1.6-Mistral-7B
82+
- Phi-2
83+
84+
πŸƒβ€β™‚οΈ We also offer many advanced features on Gaudi hardware, such as FP8 quantization thanks to [Intel Neural Compressor (INC)](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Quantization/Inference_Using_FP8.html), enabling even greater performance optimizations.
85+
86+
✨ Coming soon! We're excited to expand our model lineup with cutting-edge additions including DeepSeek-r1/v3, QWen-VL, and more powerful models to power your AI applications! πŸš€
87+
88+
## πŸ’ͺ Getting Involved
89+
90+
We invite the community to try out TGI on Gaudi hardware and provide feedback. The full documentation is available in the [TGI Gaudi backend documentation](https://huggingface.co/docs/text-generation-inference/backends/gaudi). πŸ“š If you're interested in contributing, check out our contribution guidelines or open an issue with your feedback on GitHub. 🀝 By bringing Intel Gaudi support directly into TGI, we're continuing our mission to provide flexible, efficient, and production-ready tools for deploying LLMs. We're excited to see what you'll build with this new capability! πŸŽ‰

0 commit comments

Comments
Β (0)