|
| 1 | +# GLM-4 |
| 2 | +In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on GLM-4 models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat) as a reference InternLM model. |
| 3 | + |
| 4 | +## 0. Requirements |
| 5 | +To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information. |
| 6 | + |
| 7 | +## Example 1: Predict Tokens using `generate()` API |
| 8 | +In the example [generate.py](./generate.py), we show a basic use case for a GLM-4 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel GPUs. |
| 9 | +### 1. Install |
| 10 | +#### 1.1 Installation on Linux |
| 11 | +We suggest using conda to manage environment: |
| 12 | +```bash |
| 13 | +conda create -n llm python=3.11 |
| 14 | +conda activate llm |
| 15 | +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default |
| 16 | +pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ |
| 17 | + |
| 18 | +# install tiktoken required for GLM-4 |
| 19 | +pip install tiktoken |
| 20 | +``` |
| 21 | + |
| 22 | +#### 1.2 Installation on Windows |
| 23 | +We suggest using conda to manage environment: |
| 24 | +```bash |
| 25 | +conda create -n llm python=3.11 libuv |
| 26 | +conda activate llm |
| 27 | + |
| 28 | +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default |
| 29 | +pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ |
| 30 | + |
| 31 | +# install tiktoken required for GLM-4 |
| 32 | +pip install tiktoken |
| 33 | +``` |
| 34 | + |
| 35 | +### 2. Configures OneAPI environment variables for Linux |
| 36 | + |
| 37 | +> [!NOTE] |
| 38 | +> Skip this step if you are running on Windows. |
| 39 | +
|
| 40 | +This is a required step on Linux for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI. |
| 41 | + |
| 42 | +```bash |
| 43 | +source /opt/intel/oneapi/setvars.sh |
| 44 | +``` |
| 45 | + |
| 46 | +### 3. Runtime Configurations |
| 47 | +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. |
| 48 | +#### 3.1 Configurations for Linux |
| 49 | +<details> |
| 50 | + |
| 51 | +<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary> |
| 52 | + |
| 53 | +```bash |
| 54 | +export USE_XETLA=OFF |
| 55 | +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 |
| 56 | +export SYCL_CACHE_PERSISTENT=1 |
| 57 | +``` |
| 58 | + |
| 59 | +</details> |
| 60 | + |
| 61 | +<details> |
| 62 | + |
| 63 | +<summary>For Intel Data Center GPU Max Series</summary> |
| 64 | + |
| 65 | +```bash |
| 66 | +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so |
| 67 | +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 |
| 68 | +export SYCL_CACHE_PERSISTENT=1 |
| 69 | +export ENABLE_SDP_FUSION=1 |
| 70 | +``` |
| 71 | +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. |
| 72 | +</details> |
| 73 | +
|
| 74 | +<details> |
| 75 | + |
| 76 | +<summary>For Intel iGPU</summary> |
| 77 | + |
| 78 | +```bash |
| 79 | +export SYCL_CACHE_PERSISTENT=1 |
| 80 | +export BIGDL_LLM_XMX_DISABLED=1 |
| 81 | +``` |
| 82 | + |
| 83 | +</details> |
| 84 | + |
| 85 | +#### 3.2 Configurations for Windows |
| 86 | +<details> |
| 87 | + |
| 88 | +<summary>For Intel iGPU</summary> |
| 89 | + |
| 90 | +```cmd |
| 91 | +set SYCL_CACHE_PERSISTENT=1 |
| 92 | +set BIGDL_LLM_XMX_DISABLED=1 |
| 93 | +``` |
| 94 | + |
| 95 | +</details> |
| 96 | + |
| 97 | +<details> |
| 98 | + |
| 99 | +<summary>For Intel Arc™ A-Series Graphics</summary> |
| 100 | + |
| 101 | +```cmd |
| 102 | +set SYCL_CACHE_PERSISTENT=1 |
| 103 | +``` |
| 104 | + |
| 105 | +</details> |
| 106 | + |
| 107 | +> [!NOTE] |
| 108 | +> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. |
| 109 | +### 4. Running examples |
| 110 | + |
| 111 | +``` |
| 112 | +python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT |
| 113 | +``` |
| 114 | + |
| 115 | +Arguments info: |
| 116 | +- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the GLM-4 model (e.g. `THUDM/glm-4-9b-chat`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'THUDM/glm-4-9b-chat'`. |
| 117 | +- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`. |
| 118 | +- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. |
| 119 | + |
| 120 | +#### Sample Output |
| 121 | +##### [THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat) |
| 122 | +```log |
| 123 | +Inference time: xxxx s |
| 124 | +-------------------- Prompt -------------------- |
| 125 | +<|user|> |
| 126 | +AI是什么? |
| 127 | +<|assistant|> |
| 128 | +-------------------- Output -------------------- |
| 129 | +
|
| 130 | +AI是什么? |
| 131 | +
|
| 132 | +AI,即人工智能(Artificial Intelligence),是指由人创造出来的,能够模拟、延伸和扩展人的智能的计算机系统或机器。人工智能的目标 |
| 133 | +``` |
| 134 | + |
| 135 | +```log |
| 136 | +Inference time: xxxx s |
| 137 | +-------------------- Prompt -------------------- |
| 138 | +<|user|> |
| 139 | +What is AI? |
| 140 | +<|assistant|> |
| 141 | +-------------------- Output -------------------- |
| 142 | +
|
| 143 | +What is AI? |
| 144 | +
|
| 145 | +Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. The term "art |
| 146 | +``` |
| 147 | + |
| 148 | +## Example 2: Stream Chat using `stream_chat()` API |
| 149 | +In the example [streamchat.py](./streamchat.py), we show a basic use case for a GLM-4 model to stream chat, with IPEX-LLM INT4 optimizations. |
| 150 | +### 1. Install |
| 151 | +#### 1.1 Installation on Linux |
| 152 | +We suggest using conda to manage environment: |
| 153 | +```bash |
| 154 | +conda create -n llm python=3.11 |
| 155 | +conda activate llm |
| 156 | +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default |
| 157 | +pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ |
| 158 | + |
| 159 | +# install tiktoken required for GLM-4 |
| 160 | +pip install tiktoken |
| 161 | +``` |
| 162 | + |
| 163 | +#### 1.2 Installation on Windows |
| 164 | +We suggest using conda to manage environment: |
| 165 | +```bash |
| 166 | +conda create -n llm python=3.11 libuv |
| 167 | +conda activate llm |
| 168 | + |
| 169 | +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default |
| 170 | +pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ |
| 171 | + |
| 172 | +# install tiktoken required for GLM-4 |
| 173 | +pip install tiktoken |
| 174 | +``` |
| 175 | + |
| 176 | +### 2. Configures OneAPI environment variables for Linux |
| 177 | + |
| 178 | +> [!NOTE] |
| 179 | +> Skip this step if you are running on Windows. |
| 180 | +
|
| 181 | +This is a required step on Linux for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI. |
| 182 | + |
| 183 | +```bash |
| 184 | +source /opt/intel/oneapi/setvars.sh |
| 185 | +``` |
| 186 | + |
| 187 | +### 3. Runtime Configurations |
| 188 | +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. |
| 189 | +#### 3.1 Configurations for Linux |
| 190 | +<details> |
| 191 | + |
| 192 | +<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary> |
| 193 | + |
| 194 | +```bash |
| 195 | +export USE_XETLA=OFF |
| 196 | +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 |
| 197 | +export SYCL_CACHE_PERSISTENT=1 |
| 198 | +``` |
| 199 | + |
| 200 | +</details> |
| 201 | + |
| 202 | +<details> |
| 203 | + |
| 204 | +<summary>For Intel Data Center GPU Max Series</summary> |
| 205 | + |
| 206 | +```bash |
| 207 | +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so |
| 208 | +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 |
| 209 | +export SYCL_CACHE_PERSISTENT=1 |
| 210 | +export ENABLE_SDP_FUSION=1 |
| 211 | +``` |
| 212 | +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. |
| 213 | +</details> |
| 214 | +
|
| 215 | +<details> |
| 216 | + |
| 217 | +<summary>For Intel iGPU</summary> |
| 218 | + |
| 219 | +```bash |
| 220 | +export SYCL_CACHE_PERSISTENT=1 |
| 221 | +export BIGDL_LLM_XMX_DISABLED=1 |
| 222 | +``` |
| 223 | + |
| 224 | +</details> |
| 225 | + |
| 226 | +#### 3.2 Configurations for Windows |
| 227 | +<details> |
| 228 | + |
| 229 | +<summary>For Intel iGPU</summary> |
| 230 | + |
| 231 | +```cmd |
| 232 | +set SYCL_CACHE_PERSISTENT=1 |
| 233 | +set BIGDL_LLM_XMX_DISABLED=1 |
| 234 | +``` |
| 235 | + |
| 236 | +</details> |
| 237 | + |
| 238 | +<details> |
| 239 | + |
| 240 | +<summary>For Intel Arc™ A-Series Graphics</summary> |
| 241 | + |
| 242 | +```cmd |
| 243 | +set SYCL_CACHE_PERSISTENT=1 |
| 244 | +``` |
| 245 | + |
| 246 | +</details> |
| 247 | + |
| 248 | +> [!NOTE] |
| 249 | +> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. |
| 250 | +### 4. Running examples |
| 251 | + |
| 252 | +**Stream Chat using `stream_chat()` API**: |
| 253 | +``` |
| 254 | +python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION |
| 255 | +``` |
| 256 | + |
| 257 | +**Chat using `chat()` API**: |
| 258 | +``` |
| 259 | +python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION --disable-stream |
| 260 | +``` |
| 261 | + |
| 262 | +Arguments info: |
| 263 | +- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the GLM-4 model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'THUDM/glm-4-9b-chat'`. |
| 264 | +- `--question QUESTION`: argument defining the question to ask. It is default to be `"AI是什么?"`. |
| 265 | +- `--disable-stream`: argument defining whether to stream chat. If include `--disable-stream` when running the script, the stream chat is disabled and `chat()` API is used. |
0 commit comments