Skip to content

Commit 8c36b5b

Browse files
Add qwen2 example (#11252)
* Add GPU example for Qwen2 * Update comments in README * Update README for Qwen2 GPU example * Add CPU example for Qwen2 Sample Output under README pending * Update generate.py and README for CPU Qwen2 * Update GPU example for Qwen2 * Small update * Small fix * Add Qwen2 table * Update README for Qwen2 CPU and GPU Update sample output under README --------- Co-authored-by: Zijie Li <[email protected]>
1 parent 85df5e7 commit 8c36b5b

File tree

10 files changed

+788
-0
lines changed

10 files changed

+788
-0
lines changed

README.md

+1
Original file line numberDiff line numberDiff line change
@@ -169,6 +169,7 @@ Over 50 models have been optimized/verified on `ipex-llm`, including *LLaMA/LLaM
169169
| InternLM | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/internlm) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm) |
170170
| Qwen | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen) |
171171
| Qwen1.5 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen1.5) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen1.5) |
172+
| Qwen2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen2) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwe2) |
172173
| Qwen-VL | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen-vl) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen-vl) |
173174
| Aquila | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/aquila) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila) |
174175
| Aquila2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/aquila2) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila2) |

docs/readthedocs/source/index.rst

+7
Original file line numberDiff line numberDiff line change
@@ -363,6 +363,13 @@ Verified Models
363363
<td>
364364
<a href="https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen1.5">link</a></td>
365365
</tr>
366+
<tr>
367+
<td>Qwen2</td>
368+
<td>
369+
<a href="https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen2">link</a></td>
370+
<td>
371+
<a href="https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen2">link</a></td>
372+
</tr>
366373
<tr>
367374
<td>Qwen-VL</td>
368375
<td>
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# Qwen2
2+
3+
In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on Qwen2 models. For illustration purposes, we utilize the [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) as a reference Qwen2 model.
4+
5+
## 0. Requirements
6+
To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
7+
8+
## Example: Predict Tokens using `generate()` API
9+
In the example [generate.py](./generate.py), we show a basic use case for a Qwen model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations.
10+
### 1. Install
11+
We suggest using conda to manage environment:
12+
13+
On Linux:
14+
15+
```bash
16+
conda create -n llm python=3.11
17+
conda activate llm
18+
19+
# install ipex-llm with 'all' option
20+
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
21+
pip install transformers==4.37.0 # install the transformers which support Qwen2
22+
```
23+
24+
On Windows:
25+
26+
```cmd
27+
conda create -n llm python=3.11
28+
conda activate llm
29+
30+
pip install --pre --upgrade ipex-llm[all]
31+
pip install transformers==4.37.0
32+
```
33+
34+
### 2. Run
35+
```
36+
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
37+
```
38+
39+
Arguments info:
40+
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Qwen model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'Qwen/Qwen2-7B-Instruct'`.
41+
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`.
42+
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
43+
44+
> **Note**: When loading the model in 4-bit, IPEX-LLM converts linear layers in the model into INT4 format. In theory, a *X*B model saved in 16-bit will requires approximately 2*X* GB of memory for loading, and ~0.5*X* GB memory for further inference.
45+
>
46+
> Please select the appropriate size of the Qwen model based on the capabilities of your machine.
47+
48+
#### 2.1 Client
49+
On client Windows machine, it is recommended to run directly with full utilization of all cores:
50+
```cmd
51+
python ./generate.py
52+
```
53+
54+
#### 2.2 Server
55+
For optimal performance on server, it is recommended to set several environment variables (refer to [here](../README.md#best-known-configuration-on-linux) for more information), and run the example with all the physical cores of a single socket.
56+
57+
E.g. on Linux,
58+
```bash
59+
# set IPEX-LLM env variables
60+
source ipex-llm-init
61+
62+
# e.g. for a server with 48 cores per socket
63+
export OMP_NUM_THREADS=48
64+
numactl -C 0-47 -m 0 python ./generate.py
65+
```
66+
67+
#### 2.3 Sample Output
68+
##### [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct)
69+
```log
70+
Inference time: xxxx s
71+
-------------------- Prompt --------------------
72+
AI是什么?
73+
-------------------- Output --------------------
74+
AI,即人工智能(Artificial Intelligence),是一种计算机科学领域,旨在开发能够模拟、延伸和增强人类智能的算法和系统。人工 智能涉及许多
75+
```
76+
77+
```log
78+
Inference time: xxxx s
79+
-------------------- Prompt --------------------
80+
What is AI?
81+
-------------------- Output --------------------
82+
AI stands for Artificial Intelligence, which is the simulation of human intelligence in machines that are programmed to think and learn like humans and mimic their actions. The term may
83+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
#
2+
# Copyright 2016 The BigDL Authors.
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
#
16+
17+
import torch
18+
import time
19+
import argparse
20+
import numpy as np
21+
22+
from transformers import AutoTokenizer
23+
24+
25+
if __name__ == '__main__':
26+
parser = argparse.ArgumentParser(description='Qwen2-7B-Instruct')
27+
parser.add_argument('--repo-id-or-model-path', type=str, default="Qwen/Qwen2-7B-Instruct",
28+
help='The huggingface repo id for the Qwen2 model to be downloaded'
29+
', or the path to the huggingface checkpoint folder')
30+
parser.add_argument('--prompt', type=str, default="AI是什么?",
31+
help='Prompt to infer')
32+
parser.add_argument('--n-predict', type=int, default=32,
33+
help='Max tokens to predict')
34+
35+
args = parser.parse_args()
36+
model_path = args.repo_id_or_model_path
37+
38+
39+
from ipex_llm.transformers import AutoModelForCausalLM
40+
model = AutoModelForCausalLM.from_pretrained(model_path,
41+
load_in_4bit=True,
42+
optimize_model=True,
43+
trust_remote_code=True,
44+
use_cache=True)
45+
46+
# Load tokenizer
47+
tokenizer = AutoTokenizer.from_pretrained(model_path,
48+
trust_remote_code=True)
49+
50+
prompt = args.prompt
51+
52+
# Generate predicted tokens
53+
with torch.inference_mode():
54+
# The following code for generation is adapted from https://huggingface.co/Qwen/Qwen2-7B-Instruct#quickstart
55+
messages = [
56+
{"role": "system", "content": "You are a helpful assistant."},
57+
{"role": "user", "content": prompt}
58+
]
59+
text = tokenizer.apply_chat_template(
60+
messages,
61+
tokenize=False,
62+
add_generation_prompt=True
63+
)
64+
model_inputs = tokenizer([text], return_tensors="pt")
65+
st = time.time()
66+
generated_ids = model.generate(
67+
model_inputs.input_ids,
68+
max_new_tokens=args.n_predict
69+
)
70+
end = time.time()
71+
generated_ids = [
72+
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
73+
]
74+
75+
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
76+
print(f'Inference time: {end-st} s')
77+
print('-'*20, 'Prompt', '-'*20)
78+
print(prompt)
79+
print('-'*20, 'Output', '-'*20)
80+
print(response)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# Qwen2
2+
In this directory, you will find examples on how you could use IPEX-LLM `optimize_model` API to accelerate Qwen2 models. For illustration purposes, we utilize the [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) as reference Qwen2 model.
3+
4+
## Requirements
5+
To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
6+
7+
## Example: Predict Tokens using `generate()` API
8+
In the example [generate.py](./generate.py), we show a basic use case for a Qwen2 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations.
9+
### 1. Install
10+
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://conda-forge.org/download/).
11+
12+
After installing conda, create a Python environment for IPEX-LLM:
13+
14+
On Linux:
15+
16+
```bash
17+
conda create -n llm python=3.11 # recommend to use Python 3.11
18+
conda activate llm
19+
20+
# install the latest ipex-llm nightly build with 'all' option
21+
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
22+
pip install transformers==4.37.0 # install transformers which supports Qwen2
23+
```
24+
25+
On Windows:
26+
27+
```cmd
28+
conda create -n llm python=3.11
29+
conda activate llm
30+
31+
pip install --pre --upgrade ipex-llm[all]
32+
pip install transformers==4.37.0
33+
```
34+
35+
### 2. Run
36+
```
37+
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
38+
```
39+
40+
Arguments info:
41+
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Qwen2 to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'Qwen/Qwen2-7B-Instruct'`.
42+
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`.
43+
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
44+
45+
> **Note**: When loading the model in 4-bit, IPEX-LLM converts linear layers in the model into INT4 format. In theory, a *X*B model saved in 16-bit will requires approximately 2*X* GB of memory for loading, and ~0.5*X* GB memory for further inference.
46+
>
47+
> Please select the appropriate size of the Qwen model based on the capabilities of your machine.
48+
49+
#### 2.1 Client
50+
On client Windows machines, it is recommended to run directly with full utilization of all cores:
51+
```cmd
52+
python ./generate.py --prompt 'What is AI?'
53+
```
54+
55+
#### 2.2 Server
56+
For optimal performance on server, it is recommended to set several environment variables (refer to [here](../README.md#best-known-configuration-on-linux) for more information), and run the example with all the physical cores of a single socket.
57+
58+
E.g. on Linux,
59+
```bash
60+
# set IPEX-LLM env variables
61+
source ipex-llm-init
62+
63+
# e.g. for a server with 48 cores per socket
64+
export OMP_NUM_THREADS=48
65+
numactl -C 0-47 -m 0 python ./generate.py --prompt 'What is AI?'
66+
```
67+
68+
#### 2.3 Sample Output
69+
##### [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct)
70+
```log
71+
Inference time: xxxx s
72+
-------------------- Prompt --------------------
73+
AI是什么?
74+
-------------------- Output --------------------
75+
AI,即人工智能(Artificial Intelligence),是一门研究、开发用于模拟、延伸和扩展人的智能的理论、方法、技术及应用系统的学科
76+
```
77+
78+
```log
79+
Inference time: xxxx s
80+
-------------------- Prompt --------------------
81+
What is AI?
82+
-------------------- Output --------------------
83+
AI stands for Artificial Intelligence, which refers to the development of computer systems that can perform tasks that typically require human intelligence. These tasks may include learning from experience,
84+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
#
2+
# Copyright 2016 The BigDL Authors.
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
#
16+
17+
import torch
18+
import time
19+
import argparse
20+
import numpy as np
21+
22+
from transformers import AutoTokenizer
23+
24+
25+
if __name__ == '__main__':
26+
parser = argparse.ArgumentParser(description='Qwen2-7B-Instruct')
27+
parser.add_argument('--repo-id-or-model-path', type=str, default="Qwen/Qwen2-7B-Instruct",
28+
help='The huggingface repo id for the Qwen2 model to be downloaded'
29+
', or the path to the huggingface checkpoint folder')
30+
parser.add_argument('--prompt', type=str, default="AI是什么?",
31+
help='Prompt to infer')
32+
parser.add_argument('--n-predict', type=int, default=32,
33+
help='Max tokens to predict')
34+
35+
args = parser.parse_args()
36+
model_path = args.repo_id_or_model_path
37+
38+
39+
from transformers import AutoModelForCausalLM
40+
model = AutoModelForCausalLM.from_pretrained(model_path,
41+
trust_remote_code=True,
42+
torch_dtype='auto',
43+
low_cpu_mem_usage=True,
44+
use_cache=True)
45+
46+
47+
# Load tokenizer
48+
tokenizer = AutoTokenizer.from_pretrained(model_path,
49+
trust_remote_code=True)
50+
from ipex_llm import optimize_model
51+
model = optimize_model(model)
52+
53+
prompt = args.prompt
54+
# Generate predicted tokens
55+
with torch.inference_mode():
56+
# The following code for generation is adapted from https://huggingface.co/Qwen/Qwen2-7B-Instruct#quickstart
57+
messages = [
58+
{"role": "system", "content": "You are a helpful assistant."},
59+
{"role": "user", "content": prompt}
60+
]
61+
text = tokenizer.apply_chat_template(
62+
messages,
63+
tokenize=False,
64+
add_generation_prompt=True
65+
)
66+
model_inputs = tokenizer([text], return_tensors="pt")
67+
st = time.time()
68+
generated_ids = model.generate(
69+
model_inputs.input_ids,
70+
max_new_tokens=args.n_predict
71+
)
72+
end = time.time()
73+
generated_ids = [
74+
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
75+
]
76+
77+
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
78+
print(f'Inference time: {end-st} s')
79+
print('-'*20, 'Prompt', '-'*20)
80+
print(prompt)
81+
print('-'*20, 'Output', '-'*20)
82+
print(response)

0 commit comments

Comments
 (0)