Skip to content

Commit 40fc870

Browse files
authored
Add GPU example for GLM-4 (#11267)
* Add GPU example for GLM-4 * Update streamchat.py * Fix pretrianed arguments Fix pretrained arguments in generate and streamchat.py * Update Readme Update install tiktoken required for GLM-4 * Update comments in generate.py
1 parent 0d9cc9c commit 40fc870

File tree

5 files changed

+636
-0
lines changed

5 files changed

+636
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,265 @@
1+
# GLM-4
2+
In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on GLM-4 models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat) as a reference InternLM model.
3+
4+
## 0. Requirements
5+
To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
6+
7+
## Example 1: Predict Tokens using `generate()` API
8+
In the example [generate.py](./generate.py), we show a basic use case for a GLM-4 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel GPUs.
9+
### 1. Install
10+
#### 1.1 Installation on Linux
11+
We suggest using conda to manage environment:
12+
```bash
13+
conda create -n llm python=3.11
14+
conda activate llm
15+
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
16+
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
17+
18+
# install tiktoken required for GLM-4
19+
pip install tiktoken
20+
```
21+
22+
#### 1.2 Installation on Windows
23+
We suggest using conda to manage environment:
24+
```bash
25+
conda create -n llm python=3.11 libuv
26+
conda activate llm
27+
28+
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
29+
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
30+
31+
# install tiktoken required for GLM-4
32+
pip install tiktoken
33+
```
34+
35+
### 2. Configures OneAPI environment variables for Linux
36+
37+
> [!NOTE]
38+
> Skip this step if you are running on Windows.
39+
40+
This is a required step on Linux for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI.
41+
42+
```bash
43+
source /opt/intel/oneapi/setvars.sh
44+
```
45+
46+
### 3. Runtime Configurations
47+
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
48+
#### 3.1 Configurations for Linux
49+
<details>
50+
51+
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
52+
53+
```bash
54+
export USE_XETLA=OFF
55+
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
56+
export SYCL_CACHE_PERSISTENT=1
57+
```
58+
59+
</details>
60+
61+
<details>
62+
63+
<summary>For Intel Data Center GPU Max Series</summary>
64+
65+
```bash
66+
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
67+
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
68+
export SYCL_CACHE_PERSISTENT=1
69+
export ENABLE_SDP_FUSION=1
70+
```
71+
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
72+
</details>
73+
74+
<details>
75+
76+
<summary>For Intel iGPU</summary>
77+
78+
```bash
79+
export SYCL_CACHE_PERSISTENT=1
80+
export BIGDL_LLM_XMX_DISABLED=1
81+
```
82+
83+
</details>
84+
85+
#### 3.2 Configurations for Windows
86+
<details>
87+
88+
<summary>For Intel iGPU</summary>
89+
90+
```cmd
91+
set SYCL_CACHE_PERSISTENT=1
92+
set BIGDL_LLM_XMX_DISABLED=1
93+
```
94+
95+
</details>
96+
97+
<details>
98+
99+
<summary>For Intel Arc™ A-Series Graphics</summary>
100+
101+
```cmd
102+
set SYCL_CACHE_PERSISTENT=1
103+
```
104+
105+
</details>
106+
107+
> [!NOTE]
108+
> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
109+
### 4. Running examples
110+
111+
```
112+
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
113+
```
114+
115+
Arguments info:
116+
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the GLM-4 model (e.g. `THUDM/glm-4-9b-chat`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'THUDM/glm-4-9b-chat'`.
117+
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`.
118+
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
119+
120+
#### Sample Output
121+
##### [THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat)
122+
```log
123+
Inference time: xxxx s
124+
-------------------- Prompt --------------------
125+
<|user|>
126+
AI是什么?
127+
<|assistant|>
128+
-------------------- Output --------------------
129+
130+
AI是什么?
131+
132+
AI,即人工智能(Artificial Intelligence),是指由人创造出来的,能够模拟、延伸和扩展人的智能的计算机系统或机器。人工智能的目标
133+
```
134+
135+
```log
136+
Inference time: xxxx s
137+
-------------------- Prompt --------------------
138+
<|user|>
139+
What is AI?
140+
<|assistant|>
141+
-------------------- Output --------------------
142+
143+
What is AI?
144+
145+
Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. The term "art
146+
```
147+
148+
## Example 2: Stream Chat using `stream_chat()` API
149+
In the example [streamchat.py](./streamchat.py), we show a basic use case for a GLM-4 model to stream chat, with IPEX-LLM INT4 optimizations.
150+
### 1. Install
151+
#### 1.1 Installation on Linux
152+
We suggest using conda to manage environment:
153+
```bash
154+
conda create -n llm python=3.11
155+
conda activate llm
156+
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
157+
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
158+
159+
# install tiktoken required for GLM-4
160+
pip install tiktoken
161+
```
162+
163+
#### 1.2 Installation on Windows
164+
We suggest using conda to manage environment:
165+
```bash
166+
conda create -n llm python=3.11 libuv
167+
conda activate llm
168+
169+
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
170+
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
171+
172+
# install tiktoken required for GLM-4
173+
pip install tiktoken
174+
```
175+
176+
### 2. Configures OneAPI environment variables for Linux
177+
178+
> [!NOTE]
179+
> Skip this step if you are running on Windows.
180+
181+
This is a required step on Linux for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI.
182+
183+
```bash
184+
source /opt/intel/oneapi/setvars.sh
185+
```
186+
187+
### 3. Runtime Configurations
188+
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
189+
#### 3.1 Configurations for Linux
190+
<details>
191+
192+
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
193+
194+
```bash
195+
export USE_XETLA=OFF
196+
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
197+
export SYCL_CACHE_PERSISTENT=1
198+
```
199+
200+
</details>
201+
202+
<details>
203+
204+
<summary>For Intel Data Center GPU Max Series</summary>
205+
206+
```bash
207+
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
208+
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
209+
export SYCL_CACHE_PERSISTENT=1
210+
export ENABLE_SDP_FUSION=1
211+
```
212+
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
213+
</details>
214+
215+
<details>
216+
217+
<summary>For Intel iGPU</summary>
218+
219+
```bash
220+
export SYCL_CACHE_PERSISTENT=1
221+
export BIGDL_LLM_XMX_DISABLED=1
222+
```
223+
224+
</details>
225+
226+
#### 3.2 Configurations for Windows
227+
<details>
228+
229+
<summary>For Intel iGPU</summary>
230+
231+
```cmd
232+
set SYCL_CACHE_PERSISTENT=1
233+
set BIGDL_LLM_XMX_DISABLED=1
234+
```
235+
236+
</details>
237+
238+
<details>
239+
240+
<summary>For Intel Arc™ A-Series Graphics</summary>
241+
242+
```cmd
243+
set SYCL_CACHE_PERSISTENT=1
244+
```
245+
246+
</details>
247+
248+
> [!NOTE]
249+
> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
250+
### 4. Running examples
251+
252+
**Stream Chat using `stream_chat()` API**:
253+
```
254+
python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION
255+
```
256+
257+
**Chat using `chat()` API**:
258+
```
259+
python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION --disable-stream
260+
```
261+
262+
Arguments info:
263+
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the GLM-4 model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'THUDM/glm-4-9b-chat'`.
264+
- `--question QUESTION`: argument defining the question to ask. It is default to be `"AI是什么?"`.
265+
- `--disable-stream`: argument defining whether to stream chat. If include `--disable-stream` when running the script, the stream chat is disabled and `chat()` API is used.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
#
2+
# Copyright 2016 The BigDL Authors.
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
#
16+
17+
import torch
18+
import time
19+
import argparse
20+
import numpy as np
21+
22+
from ipex_llm.transformers import AutoModel
23+
from transformers import AutoTokenizer
24+
25+
# you could tune the prompt based on your own model,
26+
# here the prompt tuning refers to https://huggingface.co/THUDM/glm-4-9b-chat/blob/main/tokenization_chatglm.py
27+
GLM4_PROMPT_FORMAT = "<|user|>\n{prompt}\n<|assistant|>"
28+
29+
if __name__ == '__main__':
30+
parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for GLM-4 model')
31+
parser.add_argument('--repo-id-or-model-path', type=str, default="THUDM/glm-4-9b-chat",
32+
help='The huggingface repo id for the GLM-4 model to be downloaded'
33+
', or the path to the huggingface checkpoint folder')
34+
parser.add_argument('--prompt', type=str, default="AI是什么?",
35+
help='Prompt to infer')
36+
parser.add_argument('--n-predict', type=int, default=32,
37+
help='Max tokens to predict')
38+
39+
args = parser.parse_args()
40+
model_path = args.repo_id_or_model_path
41+
42+
# Load model in 4 bit,
43+
# which convert the relevant layers in the model into INT4 format
44+
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
45+
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
46+
model = AutoModel.from_pretrained(model_path,
47+
load_in_4bit=True,
48+
optimize_model=True,
49+
trust_remote_code=True,
50+
use_cache=True)
51+
model = model.to("xpu")
52+
53+
# Load tokenizer
54+
tokenizer = AutoTokenizer.from_pretrained(model_path,
55+
trust_remote_code=True)
56+
57+
# Generate predicted tokens
58+
with torch.inference_mode():
59+
prompt = GLM4_PROMPT_FORMAT.format(prompt=args.prompt)
60+
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
61+
62+
# ipex_llm model needs a warmup, then inference time can be accurate
63+
output = model.generate(input_ids,
64+
max_new_tokens=args.n_predict)
65+
66+
st = time.time()
67+
68+
output = model.generate(input_ids,
69+
max_new_tokens=args.n_predict)
70+
71+
torch.xpu.synchronize()
72+
end = time.time()
73+
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
74+
print(f'Inference time: {end-st} s')
75+
print('-'*20, 'Prompt', '-'*20)
76+
print(prompt)
77+
print('-'*20, 'Output', '-'*20)
78+
print(output_str)

0 commit comments

Comments
 (0)