|
| 1 | +# GLM-4 |
| 2 | + |
| 3 | +In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on GLM-4 models. For illustration purposes, we utilize the [THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat) as a reference GLM-4 model. |
| 4 | + |
| 5 | +## 0. Requirements |
| 6 | +To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. |
| 7 | + |
| 8 | +## Example 1: Predict Tokens using `generate()` API |
| 9 | +In the example [generate.py](./generate.py), we show a basic use case for a GLM-4 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations. |
| 10 | +### 1. Install |
| 11 | +We suggest using conda to manage environment: |
| 12 | + |
| 13 | +On Linux: |
| 14 | + |
| 15 | +```bash |
| 16 | +conda create -n llm python=3.11 # recommend to use Python 3.11 |
| 17 | +conda activate llm |
| 18 | + |
| 19 | +# install the latest ipex-llm nightly build with 'all' option |
| 20 | +pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu |
| 21 | + |
| 22 | +# install tiktoken required for GLM-4 |
| 23 | +pip install tiktoken |
| 24 | +``` |
| 25 | + |
| 26 | +On Windows: |
| 27 | + |
| 28 | +```cmd |
| 29 | +conda create -n llm python=3.11 |
| 30 | +conda activate llm |
| 31 | +
|
| 32 | +pip install --pre --upgrade ipex-llm[all] |
| 33 | +
|
| 34 | +pip install tiktoken |
| 35 | +``` |
| 36 | + |
| 37 | +### 2. Run |
| 38 | +``` |
| 39 | +python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT |
| 40 | +``` |
| 41 | + |
| 42 | +Arguments info: |
| 43 | +- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the GLM-4 model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'THUDM/glm-4-9b-chat'`. |
| 44 | +- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`. |
| 45 | +- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. |
| 46 | + |
| 47 | +> **Note**: When loading the model in 4-bit, IPEX-LLM converts linear layers in the model into INT4 format. In theory, a *X*B model saved in 16-bit will requires approximately 2*X* GB of memory for loading, and ~0.5*X* GB memory for further inference. |
| 48 | +> |
| 49 | +> Please select the appropriate size of the GLM-4 model based on the capabilities of your machine. |
| 50 | +
|
| 51 | +#### 2.1 Client |
| 52 | +On client Windows machine, it is recommended to run directly with full utilization of all cores: |
| 53 | +```cmd |
| 54 | +python ./generate.py |
| 55 | +``` |
| 56 | + |
| 57 | +#### 2.2 Server |
| 58 | +For optimal performance on server, it is recommended to set several environment variables (refer to [here](../README.md#best-known-configuration-on-linux) for more information), and run the example with all the physical cores of a single socket. |
| 59 | + |
| 60 | +E.g. on Linux, |
| 61 | +```bash |
| 62 | +# set IPEX-LLM env variables |
| 63 | +source ipex-llm-init |
| 64 | + |
| 65 | +# e.g. for a server with 48 cores per socket |
| 66 | +export OMP_NUM_THREADS=48 |
| 67 | +numactl -C 0-47 -m 0 python ./generate.py |
| 68 | +``` |
| 69 | + |
| 70 | +#### 2.3 Sample Output |
| 71 | +##### [THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat) |
| 72 | +```log |
| 73 | +Inference time: xxxx s |
| 74 | +-------------------- Prompt -------------------- |
| 75 | +<|user|> |
| 76 | +AI是什么? |
| 77 | +<|assistant|> |
| 78 | +-------------------- Output -------------------- |
| 79 | +
|
| 80 | +AI是什么? |
| 81 | +
|
| 82 | +AI,即人工智能(Artificial Intelligence),是指由人创造出来的,能够模拟、延伸和扩展人的智能的计算机系统或机器。人工智能技术 |
| 83 | +``` |
| 84 | + |
| 85 | +```log |
| 86 | +Inference time: xxxx s |
| 87 | +-------------------- Prompt -------------------- |
| 88 | +<|user|> |
| 89 | +What is AI? |
| 90 | +<|assistant|> |
| 91 | +-------------------- Output -------------------- |
| 92 | +
|
| 93 | +What is AI? |
| 94 | +
|
| 95 | +Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. The term "art |
| 96 | +``` |
| 97 | + |
| 98 | +## Example 2: Stream Chat using `stream_chat()` API |
| 99 | +In the example [streamchat.py](./streamchat.py), we show a basic use case for a GLM-4 model to stream chat, with IPEX-LLM INT4 optimizations. |
| 100 | +### 1. Install |
| 101 | +We suggest using conda to manage environment: |
| 102 | + |
| 103 | +On Linux: |
| 104 | + |
| 105 | +```bash |
| 106 | +conda create -n llm python=3.11 # recommend to use Python 3.11 |
| 107 | +conda activate llm |
| 108 | + |
| 109 | +# install the latest ipex-llm nightly build with 'all' option |
| 110 | +pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu |
| 111 | + |
| 112 | +# install tiktoken required for GLM-4 |
| 113 | +pip install tiktoken |
| 114 | +``` |
| 115 | + |
| 116 | +On Windows: |
| 117 | + |
| 118 | +```cmd |
| 119 | +conda create -n llm python=3.11 |
| 120 | +conda activate llm |
| 121 | +
|
| 122 | +pip install --pre --upgrade ipex-llm[all] |
| 123 | +
|
| 124 | +pip install tiktoken |
| 125 | +``` |
| 126 | + |
| 127 | +### 2. Run |
| 128 | +**Stream Chat using `stream_chat()` API**: |
| 129 | +``` |
| 130 | +python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION |
| 131 | +``` |
| 132 | + |
| 133 | +**Chat using `chat()` API**: |
| 134 | +``` |
| 135 | +python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION --disable-stream |
| 136 | +``` |
| 137 | + |
| 138 | +Arguments info: |
| 139 | +- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the GLM-4 model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'THUDM/glm-4-9b-chat'`. |
| 140 | +- `--question QUESTION`: argument defining the question to ask. It is default to be `"晚上睡不着应该怎么办"`. |
| 141 | +- `--disable-stream`: argument defining whether to stream chat. If include `--disable-stream` when running the script, the stream chat is disabled and `chat()` API is used. |
| 142 | + |
| 143 | +> **Note**: When loading the model in 4-bit, IPEX-LLM converts linear layers in the model into INT4 format. In theory, a *X*B model saved in 16-bit will requires approximately 2*X* GB of memory for loading, and ~0.5*X* GB memory for further inference. |
| 144 | +> |
| 145 | +> Please select the appropriate size of the GLM-4 model based on the capabilities of your machine. |
| 146 | +
|
| 147 | +#### 2.1 Client |
| 148 | +On client Windows machine, it is recommended to run directly with full utilization of all cores: |
| 149 | +```cmd |
| 150 | +$env:PYTHONUNBUFFERED=1 # ensure stdout and stderr streams are sent straight to terminal without being first buffered |
| 151 | +python ./streamchat.py |
| 152 | +``` |
| 153 | + |
| 154 | +#### 2.2 Server |
| 155 | +For optimal performance on server, it is recommended to set several environment variables (refer to [here](../README.md#best-known-configuration-on-linux) for more information), and run the example with all the physical cores of a single socket. |
| 156 | + |
| 157 | +E.g. on Linux, |
| 158 | +```bash |
| 159 | +# set IPEX-LLM env variables |
| 160 | +source ipex-llm-init |
| 161 | + |
| 162 | +# e.g. for a server with 48 cores per socket |
| 163 | +export OMP_NUM_THREADS=48 |
| 164 | +export PYTHONUNBUFFERED=1 # ensure stdout and stderr streams are sent straight to terminal without being first buffered |
| 165 | +numactl -C 0-47 -m 0 python ./streamchat.py |
| 166 | +``` |
0 commit comments