-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Add GLM-4 CPU example #11223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Add GLM-4 CPU example #11223
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
166 changes: 166 additions & 0 deletions
166
python/llm/example/CPU/HF-Transformers-AutoModels/Model/glm4/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,166 @@ | ||
# GLM-4 | ||
|
||
In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on GLM-4 models. For illustration purposes, we utilize the [THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat) as a reference GLM-4 model. | ||
|
||
## 0. Requirements | ||
To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. | ||
|
||
## Example 1: Predict Tokens using `generate()` API | ||
In the example [generate.py](./generate.py), we show a basic use case for a GLM-4 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations. | ||
### 1. Install | ||
We suggest using conda to manage environment: | ||
|
||
On Linux: | ||
|
||
```bash | ||
conda create -n llm python=3.11 # recommend to use Python 3.11 | ||
conda activate llm | ||
|
||
# install the latest ipex-llm nightly build with 'all' option | ||
JinBridger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu | ||
JinBridger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# install tiktoken required for GLM-4 | ||
pip install tiktoken | ||
``` | ||
|
||
On Windows: | ||
|
||
```cmd | ||
conda create -n llm python=3.11 | ||
conda activate llm | ||
|
||
pip install --pre --upgrade ipex-llm[all] | ||
|
||
pip install tiktoken | ||
``` | ||
|
||
### 2. Run | ||
``` | ||
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT | ||
``` | ||
|
||
Arguments info: | ||
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the GLM-4 model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'THUDM/glm-4-9b-chat'`. | ||
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`. | ||
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. | ||
|
||
> **Note**: When loading the model in 4-bit, IPEX-LLM converts linear layers in the model into INT4 format. In theory, a *X*B model saved in 16-bit will requires approximately 2*X* GB of memory for loading, and ~0.5*X* GB memory for further inference. | ||
> | ||
> Please select the appropriate size of the GLM-4 model based on the capabilities of your machine. | ||
|
||
#### 2.1 Client | ||
On client Windows machine, it is recommended to run directly with full utilization of all cores: | ||
```cmd | ||
python ./generate.py | ||
``` | ||
|
||
#### 2.2 Server | ||
For optimal performance on server, it is recommended to set several environment variables (refer to [here](../README.md#best-known-configuration-on-linux) for more information), and run the example with all the physical cores of a single socket. | ||
|
||
E.g. on Linux, | ||
```bash | ||
# set IPEX-LLM env variables | ||
source ipex-llm-init | ||
|
||
# e.g. for a server with 48 cores per socket | ||
export OMP_NUM_THREADS=48 | ||
numactl -C 0-47 -m 0 python ./generate.py | ||
``` | ||
|
||
#### 2.3 Sample Output | ||
##### [THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat) | ||
```log | ||
Inference time: xxxx s | ||
-------------------- Prompt -------------------- | ||
<|user|> | ||
AI是什么? | ||
<|assistant|> | ||
-------------------- Output -------------------- | ||
|
||
AI是什么? | ||
|
||
AI,即人工智能(Artificial Intelligence),是指由人创造出来的,能够模拟、延伸和扩展人的智能的计算机系统或机器。人工智能技术 | ||
``` | ||
|
||
```log | ||
Inference time: xxxx s | ||
-------------------- Prompt -------------------- | ||
<|user|> | ||
What is AI? | ||
<|assistant|> | ||
-------------------- Output -------------------- | ||
|
||
What is AI? | ||
|
||
Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. The term "art | ||
``` | ||
|
||
## Example 2: Stream Chat using `stream_chat()` API | ||
In the example [streamchat.py](./streamchat.py), we show a basic use case for a GLM-4 model to stream chat, with IPEX-LLM INT4 optimizations. | ||
### 1. Install | ||
We suggest using conda to manage environment: | ||
|
||
On Linux: | ||
|
||
```bash | ||
conda create -n llm python=3.11 # recommend to use Python 3.11 | ||
conda activate llm | ||
|
||
# install the latest ipex-llm nightly build with 'all' option | ||
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu | ||
|
||
# install tiktoken required for GLM-4 | ||
pip install tiktoken | ||
``` | ||
|
||
On Windows: | ||
|
||
```cmd | ||
conda create -n llm python=3.11 | ||
conda activate llm | ||
|
||
pip install --pre --upgrade ipex-llm[all] | ||
|
||
pip install tiktoken | ||
``` | ||
|
||
### 2. Run | ||
**Stream Chat using `stream_chat()` API**: | ||
``` | ||
python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION | ||
``` | ||
|
||
**Chat using `chat()` API**: | ||
``` | ||
python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION --disable-stream | ||
``` | ||
|
||
Arguments info: | ||
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the GLM-4 model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'THUDM/glm-4-9b-chat'`. | ||
- `--question QUESTION`: argument defining the question to ask. It is default to be `"晚上睡不着应该怎么办"`. | ||
- `--disable-stream`: argument defining whether to stream chat. If include `--disable-stream` when running the script, the stream chat is disabled and `chat()` API is used. | ||
|
||
> **Note**: When loading the model in 4-bit, IPEX-LLM converts linear layers in the model into INT4 format. In theory, a *X*B model saved in 16-bit will requires approximately 2*X* GB of memory for loading, and ~0.5*X* GB memory for further inference. | ||
> | ||
> Please select the appropriate size of the GLM-4 model based on the capabilities of your machine. | ||
|
||
#### 2.1 Client | ||
On client Windows machine, it is recommended to run directly with full utilization of all cores: | ||
```cmd | ||
$env:PYTHONUNBUFFERED=1 # ensure stdout and stderr streams are sent straight to terminal without being first buffered | ||
python ./streamchat.py | ||
``` | ||
|
||
#### 2.2 Server | ||
For optimal performance on server, it is recommended to set several environment variables (refer to [here](../README.md#best-known-configuration-on-linux) for more information), and run the example with all the physical cores of a single socket. | ||
|
||
E.g. on Linux, | ||
```bash | ||
# set IPEX-LLM env variables | ||
source ipex-llm-init | ||
|
||
# e.g. for a server with 48 cores per socket | ||
export OMP_NUM_THREADS=48 | ||
export PYTHONUNBUFFERED=1 # ensure stdout and stderr streams are sent straight to terminal without being first buffered | ||
numactl -C 0-47 -m 0 python ./streamchat.py | ||
``` |
67 changes: 67 additions & 0 deletions
67
python/llm/example/CPU/HF-Transformers-AutoModels/Model/glm4/generate.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
# | ||
# Copyright 2016 The BigDL Authors. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# | ||
|
||
import torch | ||
import time | ||
import argparse | ||
import numpy as np | ||
|
||
from ipex_llm.transformers import AutoModel | ||
from transformers import AutoTokenizer | ||
|
||
# you could tune the prompt based on your own model, | ||
# here the prompt tuning refers to https://huggingface.co/THUDM/glm-4-9b-chat/blob/main/tokenization_chatglm.py | ||
GLM4_PROMPT_FORMAT = "<|user|>\n{prompt}\n<|assistant|>" | ||
|
||
if __name__ == '__main__': | ||
parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for GLM-4 model') | ||
parser.add_argument('--repo-id-or-model-path', type=str, default="THUDM/glm-4-9b-chat", | ||
help='The huggingface repo id for the GLM-4 model to be downloaded' | ||
', or the path to the huggingface checkpoint folder') | ||
parser.add_argument('--prompt', type=str, default="AI是什么?", | ||
help='Prompt to infer') | ||
parser.add_argument('--n-predict', type=int, default=32, | ||
help='Max tokens to predict') | ||
|
||
args = parser.parse_args() | ||
model_path = args.repo_id_or_model_path | ||
|
||
# Load model in 4 bit, | ||
# which convert the relevant layers in the model into INT4 format | ||
model = AutoModel.from_pretrained(model_path, | ||
JinBridger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
load_in_4bit=True, | ||
optimize_model=True, | ||
trust_remote_code=True, | ||
use_cache=True) | ||
|
||
# Load tokenizer | ||
tokenizer = AutoTokenizer.from_pretrained(model_path, | ||
trust_remote_code=True) | ||
|
||
# Generate predicted tokens | ||
with torch.inference_mode(): | ||
prompt = GLM4_PROMPT_FORMAT.format(prompt=args.prompt) | ||
input_ids = tokenizer.encode(prompt, return_tensors="pt") | ||
st = time.time() | ||
output = model.generate(input_ids, | ||
max_new_tokens=args.n_predict) | ||
end = time.time() | ||
output_str = tokenizer.decode(output[0], skip_special_tokens=True) | ||
print(f'Inference time: {end-st} s') | ||
print('-'*20, 'Prompt', '-'*20) | ||
print(prompt) | ||
print('-'*20, 'Output', '-'*20) | ||
print(output_str) |
62 changes: 62 additions & 0 deletions
62
python/llm/example/CPU/HF-Transformers-AutoModels/Model/glm4/streamchat.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
# | ||
# Copyright 2016 The BigDL Authors. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# | ||
|
||
import torch | ||
import time | ||
import argparse | ||
import numpy as np | ||
|
||
from ipex_llm.transformers import AutoModel | ||
from transformers import AutoTokenizer | ||
|
||
|
||
if __name__ == '__main__': | ||
parser = argparse.ArgumentParser(description='Stream Chat for GLM-4 model') | ||
parser.add_argument('--repo-id-or-model-path', type=str, default="THUDM/glm-4-9b-chat", | ||
help='The huggingface repo id for the GLM-4 model to be downloaded' | ||
', or the path to the huggingface checkpoint folder') | ||
parser.add_argument('--question', type=str, default="晚上睡不着应该怎么办", | ||
help='Qustion you want to ask') | ||
parser.add_argument('--disable-stream', action="store_true", | ||
help='Disable stream chat') | ||
|
||
args = parser.parse_args() | ||
model_path = args.repo_id_or_model_path | ||
disable_stream = args.disable_stream | ||
|
||
# Load model in 4 bit, | ||
# which convert the relevant layers in the model into INT4 format | ||
model = AutoModel.from_pretrained(model_path, | ||
load_in_4bit=True, | ||
trust_remote_code=True) | ||
|
||
# Load tokenizer | ||
tokenizer = AutoTokenizer.from_pretrained(model_path, | ||
trust_remote_code=True) | ||
|
||
with torch.inference_mode(): | ||
if disable_stream: | ||
# Chat | ||
response, history = model.chat(tokenizer, args.question, history=[]) | ||
print('-'*20, 'Chat Output', '-'*20) | ||
print(response) | ||
else: | ||
# Stream chat | ||
response_ = "" | ||
print('-'*20, 'Stream Chat Output', '-'*20) | ||
for response, history in model.stream_chat(tokenizer, args.question, history=[]): | ||
print(response.replace(response_, ""), end="") | ||
response_ = response |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.