-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Add GLM-4 CPU example #11223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Add GLM-4 CPU example #11223
Changes from 2 commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
158 changes: 158 additions & 0 deletions
158
python/llm/example/CPU/HF-Transformers-AutoModels/Model/glm4/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,158 @@ | ||
# GLM-4 | ||
|
||
In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on GLM-4 models. For illustration purposes, we utilize the [THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat) as a reference GLM-4 model. | ||
|
||
## 0. Requirements | ||
To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. | ||
|
||
## Example 1: Predict Tokens using `generate()` API | ||
In the example [generate.py](./generate.py), we show a basic use case for a GLM-4 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations. | ||
### 1. Install | ||
We suggest using conda to manage environment: | ||
|
||
On Linux: | ||
|
||
```bash | ||
conda create -n llm python=3.11 # recommend to use Python 3.11 | ||
conda activate llm | ||
|
||
# install the latest ipex-llm nightly build with 'all' option | ||
JinBridger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu | ||
JinBridger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
pip install tiktoken # additional package required for GLM-4 to conduct generation | ||
``` | ||
|
||
On Windows: | ||
|
||
```cmd | ||
conda create -n llm python=3.11 | ||
conda activate llm | ||
|
||
pip install --pre --upgrade ipex-llm[all] | ||
``` | ||
|
||
### 2. Run | ||
``` | ||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT | ||
``` | ||
|
||
Arguments info: | ||
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the GLM-4 model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'THUDM/glm-4-9b-chat'`. | ||
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`. | ||
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. | ||
|
||
> **Note**: When loading the model in 4-bit, IPEX-LLM converts linear layers in the model into INT4 format. In theory, a *X*B model saved in 16-bit will requires approximately 2*X* GB of memory for loading, and ~0.5*X* GB memory for further inference. | ||
> | ||
> Please select the appropriate size of the GLM-4 model based on the capabilities of your machine. | ||
|
||
#### 2.1 Client | ||
On client Windows machine, it is recommended to run directly with full utilization of all cores: | ||
```cmd | ||
python ./generate.py | ||
``` | ||
|
||
#### 2.2 Server | ||
For optimal performance on server, it is recommended to set several environment variables (refer to [here](../README.md#best-known-configuration-on-linux) for more information), and run the example with all the physical cores of a single socket. | ||
|
||
E.g. on Linux, | ||
```bash | ||
# set IPEX-LLM env variables | ||
source ipex-llm-init | ||
|
||
# e.g. for a server with 48 cores per socket | ||
export OMP_NUM_THREADS=48 | ||
numactl -C 0-47 -m 0 python ./generate.py | ||
``` | ||
|
||
#### 2.3 Sample Output | ||
#### [THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat) | ||
JinBridger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
```log | ||
Inference time: xxxx s | ||
-------------------- Prompt -------------------- | ||
<|user|> | ||
AI是什么? | ||
<|assistant|> | ||
-------------------- Output -------------------- | ||
|
||
AI是什么? | ||
|
||
AI,即人工智能(Artificial Intelligence),是指由人创造出来的,能够模拟、延伸和扩展人的智能的计算机系统或机器。人工智能技术 | ||
``` | ||
|
||
```log | ||
Inference time: xxxx s | ||
-------------------- Prompt -------------------- | ||
<|user|> | ||
What is AI? | ||
<|assistant|> | ||
-------------------- Output -------------------- | ||
|
||
What is AI? | ||
|
||
Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. The term "art | ||
``` | ||
|
||
## Example 2: Stream Chat using `stream_chat()` API | ||
In the example [streamchat.py](./streamchat.py), we show a basic use case for a GLM-4 model to stream chat, with IPEX-LLM INT4 optimizations. | ||
### 1. Install | ||
We suggest using conda to manage environment: | ||
|
||
On Linux: | ||
|
||
```bash | ||
conda create -n llm python=3.11 # recommend to use Python 3.11 | ||
conda activate llm | ||
|
||
# install the latest ipex-llm nightly build with 'all' option | ||
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu | ||
``` | ||
|
||
On Windows: | ||
|
||
```cmd | ||
conda create -n llm python=3.11 | ||
conda activate llm | ||
|
||
pip install --pre --upgrade ipex-llm[all] | ||
``` | ||
|
||
### 2. Run | ||
**Stream Chat using `stream_chat()` API**: | ||
``` | ||
python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION | ||
``` | ||
|
||
**Chat using `chat()` API**: | ||
``` | ||
python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION --disable-stream | ||
``` | ||
|
||
Arguments info: | ||
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the GLM-4 model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'THUDM/glm-4-9b-chat'`. | ||
- `--question QUESTION`: argument defining the question to ask. It is default to be `"晚上睡不着应该怎么办"`. | ||
- `--disable-stream`: argument defining whether to stream chat. If include `--disable-stream` when running the script, the stream chat is disabled and `chat()` API is used. | ||
|
||
> **Note**: When loading the model in 4-bit, IPEX-LLM converts linear layers in the model into INT4 format. In theory, a *X*B model saved in 16-bit will requires approximately 2*X* GB of memory for loading, and ~0.5*X* GB memory for further inference. | ||
> | ||
> Please select the appropriate size of the GLM-4 model based on the capabilities of your machine. | ||
|
||
#### 2.1 Client | ||
On client Windows machine, it is recommended to run directly with full utilization of all cores: | ||
```cmd | ||
$env:PYTHONUNBUFFERED=1 # ensure stdout and stderr streams are sent straight to terminal without being first buffered | ||
python ./streamchat.py | ||
``` | ||
|
||
#### 2.2 Server | ||
For optimal performance on server, it is recommended to set several environment variables (refer to [here](../README.md#best-known-configuration-on-linux) for more information), and run the example with all the physical cores of a single socket. | ||
|
||
E.g. on Linux, | ||
```bash | ||
# set IPEX-LLM env variables | ||
source ipex-llm-init | ||
|
||
# e.g. for a server with 48 cores per socket | ||
export OMP_NUM_THREADS=48 | ||
export PYTHONUNBUFFERED=1 # ensure stdout and stderr streams are sent straight to terminal without being first buffered | ||
numactl -C 0-47 -m 0 python ./streamchat.py | ||
``` |
69 changes: 69 additions & 0 deletions
69
python/llm/example/CPU/HF-Transformers-AutoModels/Model/glm4/generate.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
# | ||
# Copyright 2016 The BigDL Authors. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# | ||
|
||
import torch | ||
import time | ||
import argparse | ||
import numpy as np | ||
|
||
from ipex_llm.transformers import AutoModel | ||
from transformers import AutoTokenizer | ||
|
||
# you could tune the prompt based on your own model, | ||
# here the prompt tuning refers to https://huggingface.co/THUDM/glm-4-9b-chat/blob/main/tokenization_chatglm.py | ||
GLM4_PROMPT_FORMAT = "<|user|>\n{prompt}\n<|assistant|>" | ||
|
||
if __name__ == '__main__': | ||
parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for GLM-4 model') | ||
parser.add_argument('--repo-id-or-model-path', type=str, default="THUDM/glm-4-9b-chat", | ||
help='The huggingface repo id for the GLM-4 model to be downloaded' | ||
', or the path to the huggingface checkpoint folder') | ||
parser.add_argument('--prompt', type=str, default="AI是什么?", | ||
help='Prompt to infer') | ||
parser.add_argument('--n-predict', type=int, default=32, | ||
help='Max tokens to predict') | ||
|
||
args = parser.parse_args() | ||
model_path = args.repo_id_or_model_path | ||
|
||
# Load model in 4 bit, | ||
# which convert the relevant layers in the model into INT4 format | ||
model = AutoModel.from_pretrained(model_path, | ||
JinBridger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
load_in_4bit=True, | ||
trust_remote_code=True) | ||
|
||
# Load tokenizer | ||
tokenizer = AutoTokenizer.from_pretrained(model_path, | ||
trust_remote_code=True) | ||
|
||
# Generate predicted tokens | ||
with torch.inference_mode(): | ||
prompt = GLM4_PROMPT_FORMAT.format(prompt=args.prompt) | ||
input_ids = tokenizer.encode(prompt, return_tensors="pt") | ||
st = time.time() | ||
# if your selected model is capable of utilizing previous key/value attentions | ||
# to enhance decoding speed, but has `"use_cache": false` in its model config, | ||
# it is important to set `use_cache=True` explicitly in the `generate` function | ||
# to obtain optimal performance with IPEX-LLM INT4 optimizations | ||
JinBridger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
output = model.generate(input_ids, | ||
max_new_tokens=args.n_predict) | ||
end = time.time() | ||
output_str = tokenizer.decode(output[0], skip_special_tokens=True) | ||
print(f'Inference time: {end-st} s') | ||
print('-'*20, 'Prompt', '-'*20) | ||
print(prompt) | ||
print('-'*20, 'Output', '-'*20) | ||
print(output_str) |
62 changes: 62 additions & 0 deletions
62
python/llm/example/CPU/HF-Transformers-AutoModels/Model/glm4/streamchat.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
# | ||
# Copyright 2016 The BigDL Authors. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# | ||
|
||
import torch | ||
import time | ||
import argparse | ||
import numpy as np | ||
|
||
from ipex_llm.transformers import AutoModel | ||
from transformers import AutoTokenizer | ||
|
||
|
||
if __name__ == '__main__': | ||
parser = argparse.ArgumentParser(description='Stream Chat for GLM-4 model') | ||
parser.add_argument('--repo-id-or-model-path', type=str, default="THUDM/glm-4-9b-chat", | ||
help='The huggingface repo id for the GLM-4 model to be downloaded' | ||
', or the path to the huggingface checkpoint folder') | ||
parser.add_argument('--question', type=str, default="晚上睡不着应该怎么办", | ||
help='Qustion you want to ask') | ||
parser.add_argument('--disable-stream', action="store_true", | ||
help='Disable stream chat') | ||
|
||
args = parser.parse_args() | ||
model_path = args.repo_id_or_model_path | ||
disable_stream = args.disable_stream | ||
|
||
# Load model in 4 bit, | ||
# which convert the relevant layers in the model into INT4 format | ||
model = AutoModel.from_pretrained(model_path, | ||
load_in_4bit=True, | ||
trust_remote_code=True) | ||
|
||
# Load tokenizer | ||
tokenizer = AutoTokenizer.from_pretrained(model_path, | ||
trust_remote_code=True) | ||
|
||
with torch.inference_mode(): | ||
if disable_stream: | ||
# Chat | ||
response, history = model.chat(tokenizer, args.question, history=[]) | ||
print('-'*20, 'Chat Output', '-'*20) | ||
print(response) | ||
else: | ||
# Stream chat | ||
response_ = "" | ||
print('-'*20, 'Stream Chat Output', '-'*20) | ||
for response, history in model.stream_chat(tokenizer, args.question, history=[]): | ||
print(response.replace(response_, ""), end="") | ||
response_ = response |
75 changes: 75 additions & 0 deletions
75
python/llm/example/CPU/PyTorch-Models/Model/glm4/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,75 @@ | ||
# GLM-4 | ||
In this directory, you will find examples on how you could use IPEX-LLM `optimize_model` API to accelerate GLM-4 models. For illustration purposes, we utilize the [THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat) as a reference GLM-4 model. | ||
|
||
## Requirements | ||
To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. | ||
|
||
## Example: Predict Tokens using `generate()` API | ||
In the example [generate.py](./generate.py), we show a basic use case for a GLM-4 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations. | ||
### 1. Install | ||
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://conda-forge.org/download/). | ||
|
||
After installing conda, create a Python environment for IPEX-LLM: | ||
|
||
On Linux: | ||
|
||
```bash | ||
conda create -n llm python=3.11 # recommend to use Python 3.11 | ||
conda activate llm | ||
|
||
# install the latest ipex-llm nightly build with 'all' option | ||
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu | ||
|
||
pip install tiktoken # additional package required for GLM-4 to conduct generation | ||
``` | ||
|
||
On Windows: | ||
|
||
```cmd | ||
conda create -n llm python=3.11 | ||
conda activate llm | ||
|
||
pip install --pre --upgrade ipex-llm[all] | ||
``` | ||
|
||
### 2. Run | ||
After setting up the Python environment, you could run the example by following steps. | ||
|
||
#### 2.1 Client | ||
On client Windows machines, it is recommended to run directly with full utilization of all cores: | ||
```cmd | ||
python ./generate.py --prompt 'AI是什么?' | ||
``` | ||
More information about arguments can be found in [Arguments Info](#23-arguments-info) section. The expected output can be found in [Sample Output](#24-sample-output) section. | ||
|
||
#### 2.2 Server | ||
For optimal performance on server, it is recommended to set several environment variables (refer to [here](../README.md#best-known-configuration-on-linux) for more information), and run the example with all the physical cores of a single socket. | ||
|
||
E.g. on Linux, | ||
```bash | ||
# set IPEX-LLM env variables | ||
source ipex-llm-init | ||
|
||
# e.g. for a server with 48 cores per socket | ||
export OMP_NUM_THREADS=48 | ||
numactl -C 0-47 -m 0 python ./generate.py --prompt 'AI是什么?' | ||
``` | ||
More information about arguments can be found in [Arguments Info](#23-arguments-info) section. The expected output can be found in [Sample Output](#24-sample-output) section. | ||
|
||
#### 2.3 Arguments Info | ||
In the example, several arguments can be passed to satisfy your requirements: | ||
|
||
- `--repo-id-or-model-path`: str, argument defining the huggingface repo id for the GLM-4 model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'THUDM/glm-4-9b-chat'`. | ||
- `--prompt`: str, argument defining the prompt to be inferred (with integrated prompt format for chat). It is default to be `'AI是什么?'`. | ||
- `--n-predict`: int, argument defining the max number of tokens to predict. It is default to be `32`. | ||
|
||
#### 2.4 Sample Output | ||
#### [THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat) | ||
JinBridger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
```log | ||
Inference time: xxxx s | ||
-------------------- Output -------------------- | ||
|
||
AI是什么? | ||
|
||
AI,即人工智能(Artificial Intelligence),是指由人创造出来的,能够模拟、延伸和扩展人的智能的计算机系统或机器。人工智能技术 | ||
``` | ||
JinBridger marked this conversation as resolved.
Show resolved
Hide resolved
|
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.