Skip to content

Commit 105e124

Browse files
authored
optimize phi3-v encoder npu performance and add multimodal example (#11553)
* phi3-v * readme
1 parent 70ab1a6 commit 105e124

File tree

4 files changed

+370
-0
lines changed

4 files changed

+370
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
# Run Large Multimodal Model on Intel NPU
2+
In this directory, you will find examples on how you could apply IPEX-LLM INT4 or INT8 optimizations on Large Multimodal Models on [Intel NPUs](../../../README.md). In this directory, you will find examples on how you could apply IPEX-LLM INT4 or INT8 optimizations on Large Multimodal Models on Intel NPUs. See the table blow for verified models.
3+
4+
## Verified Models
5+
6+
| Model | Model Link |
7+
|------------|----------------------------------------------------------------|
8+
| Phi-3-Vision | [microsoft/Phi-3-vision-128k-instruct](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct) |
9+
10+
## 0. Requirements
11+
To run these examples with IPEX-LLM on Intel NPUs, make sure to install the newest driver version of Intel NPU.
12+
Go to https://www.intel.com/content/www/us/en/download/794734/intel-npu-driver-windows.html to download and unzip the driver.
13+
Then go to **Device Manager**, find **Neural Processors** -> **Intel(R) AI Boost**.
14+
Right click and select **Update Driver**. And then manually select the folder unzipped from the driver.
15+
16+
## Example: Predict Tokens using `generate()` API
17+
In the example [generate.py](./generate.py), we show a basic use case for a phi-3-vision model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel NPUs.
18+
### 1. Install
19+
#### 1.1 Installation on Windows
20+
We suggest using conda to manage environment:
21+
```bash
22+
conda create -n llm python=3.10 libuv
23+
conda activate llm
24+
25+
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
26+
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
27+
28+
# below command will install intel_npu_acceleration_library
29+
pip install intel-npu-acceleration-library==1.3
30+
31+
pip install transformers==4.40
32+
```
33+
34+
### 2. Runtime Configurations
35+
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
36+
#### 2.1 Configurations for Windows
37+
38+
**Following envrionment variables are required**:
39+
40+
```cmd
41+
set BIGDL_USE_NPU=1
42+
```
43+
44+
### 3. Running examples
45+
46+
```
47+
python ./generate.py
48+
```
49+
50+
Arguments info:
51+
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Phi-3-vision model (e.g. `microsoft/Phi-3-vision-128k-instruct`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'microsoft/Phi-3-vision-128k-instruct'`, and more verified models please see the list in [Verified Models](#verified-models).
52+
- `--image-url-or-path IMAGE_URL_OR_PATH`: argument defining the image to be infered. It is default to be `'http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg'`.
53+
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is in the image?'`.
54+
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
55+
- `--load_in_low_bit`: argument defining the `load_in_low_bit` format used. It is default to be `sym_int8`, `sym_int4` can also be used.
56+
57+
#### Sample Output
58+
#### [microsoft/Phi-3-vision-128k-instruct](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct)
59+
60+
```log
61+
Inference time: xxxx s
62+
-------------------- Prompt --------------------
63+
Message: [{'role': 'user', 'content': '<|image_1|>\nWhat is in the image?'}]
64+
Image link/path: http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg
65+
-------------------- Output --------------------
66+
67+
68+
What is in the image?
69+
The image shows a young girl holding a white teddy bear. She is wearing a pink dress with a heart on it. The background includes a stone
70+
```
71+
72+
The sample input image is (which is fetched from [COCO dataset](https://cocodataset.org/#explore?id=264959)):
73+
74+
<a href="http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg"><img width=400px src="http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg" ></a>
75+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
#
2+
# Copyright 2016 The BigDL Authors.
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
#
16+
17+
import os
18+
import time
19+
import torch
20+
import argparse
21+
import requests
22+
23+
from PIL import Image
24+
from ipex_llm.transformers.npu_model import AutoModelForCausalLM
25+
from transformers import AutoProcessor
26+
27+
if __name__ == '__main__':
28+
parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for phi-3 model')
29+
parser.add_argument('--repo-id-or-model-path', type=str, default="microsoft/Phi-3-vision-128k-instruct",
30+
help='The huggingface repo id for the phi-3-vision model to be downloaded'
31+
', or the path to the huggingface checkpoint folder')
32+
parser.add_argument('--image-url-or-path', type=str,
33+
default="http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg",
34+
help='The URL or path to the image to infer')
35+
parser.add_argument('--prompt', type=str, default="What is in the image?",
36+
help='Prompt to infer')
37+
parser.add_argument('--n-predict', type=int, default=32,
38+
help='Max tokens to predict')
39+
parser.add_argument('--load_in_low_bit', type=str, default="sym_int4",
40+
help='Load in low bit to use')
41+
42+
43+
args = parser.parse_args()
44+
model_path = args.repo_id_or_model_path
45+
image_path = args.image_url_or_path
46+
47+
# Load model in SYM_INT4,
48+
# which convert the relevant layers in the model into SYM_INT4 format
49+
# You could also try `'sym_int8'` for INT8
50+
# `_attn_implementation="eager"` is required for phi-3-vision
51+
# `modules_to_not_convert=["vision_embed_tokens"]` and `model = model.half()` are for acceleration and are optional
52+
model = AutoModelForCausalLM.from_pretrained(model_path,
53+
trust_remote_code=True,
54+
load_in_low_bit=args.load_in_low_bit,
55+
_attn_implementation="eager",
56+
modules_to_not_convert=["vision_embed_tokens"])
57+
58+
# Load processor
59+
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
60+
61+
# here the message formatting refers to https://huggingface.co/microsoft/Phi-3-vision-128k-instruct#sample-inference-code
62+
messages = [
63+
{"role": "user", "content": "<|image_1|>\n{prompt}".format(prompt=args.prompt)},
64+
]
65+
prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
66+
67+
if os.path.exists(image_path):
68+
image = Image.open(image_path)
69+
else:
70+
image = Image.open(requests.get(image_path, stream=True).raw)
71+
72+
# Generate predicted tokens
73+
with torch.inference_mode():
74+
# start inference
75+
st = time.time()
76+
77+
inputs = processor(prompt, [image], return_tensors="pt")
78+
output = model.generate(**inputs,
79+
eos_token_id=processor.tokenizer.eos_token_id,
80+
num_beams=1,
81+
do_sample=False,
82+
max_new_tokens=args.n_predict,
83+
temperature=0.0)
84+
end = time.time()
85+
print(f'Inference time: {end-st} s')
86+
output_str = processor.decode(output[0],
87+
skip_special_tokens=True,
88+
clean_up_tokenization_spaces=False)
89+
print('-'*20, 'Prompt', '-'*20)
90+
print(f'Message: {messages}')
91+
print(f'Image link/path: {image_path}')
92+
print('-'*20, 'Output', '-'*20)
93+
print(output_str)

python/llm/src/ipex_llm/transformers/npu_models/convert.py

+12
Original file line numberDiff line numberDiff line change
@@ -177,3 +177,15 @@ def optimize_llm(model: torch.nn.Module):
177177
model.apply(merge_mlp)
178178

179179
convert_forward(model, module.MLP, baichuan_mlp_forward)
180+
181+
elif model.config.model_type == "phi3_v":
182+
modeling_module_name = model.__class__.__module__
183+
module = importlib.import_module(modeling_module_name)
184+
from ipex_llm.transformers.npu_models.phi3_v import merge_qkv
185+
from ipex_llm.transformers.npu_models.phi3_v import phi3v_encoder_attention_forward
186+
from ipex_llm.transformers.npu_models.phi3_v import phi3v_model_forward
187+
model.apply(merge_qkv)
188+
189+
from transformers.models.clip.modeling_clip import CLIPAttention
190+
convert_forward(model, CLIPAttention, phi3v_encoder_attention_forward)
191+
convert_forward(model, module.Phi3VModel, phi3v_model_forward)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
#
2+
# Copyright 2016 The BigDL Authors.
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
#
16+
# Some parts of this file is adapted from
17+
# https://github.com/huggingface/transformers/blob/v4.40.0/src/transformers/models/llama/modeling_llama.py
18+
# which is licensed under Apache License 2.0:
19+
#
20+
# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
21+
#
22+
# Licensed under the Apache License, Version 2.0 (the "License");
23+
# you may not use this file except in compliance with the License.
24+
# You may obtain a copy of the License at
25+
#
26+
# http://www.apache.org/licenses/LICENSE-2.0
27+
#
28+
# Unless required by applicable law or agreed to in writing, software
29+
# distributed under the License is distributed on an "AS IS" BASIS,
30+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
31+
# See the License for the specific language governing permissions and
32+
# limitations under the License.
33+
34+
35+
import torch
36+
import importlib
37+
from torch import nn
38+
from typing import Optional, Tuple, List
39+
from transformers.models.clip.modeling_clip import CLIPAttention
40+
from ipex_llm.utils.common.log4Error import invalidInputError
41+
42+
43+
def merge_qkv(module: torch.nn.Module):
44+
if isinstance(module, CLIPAttention):
45+
new_weight = torch.cat([
46+
module.q_proj.weight.data,
47+
module.k_proj.weight.data,
48+
module.v_proj.weight.data,
49+
], dim=0)
50+
51+
if module.q_proj.bias is not None:
52+
qkv_proj = torch.nn.Linear(0, 0, bias=True)
53+
new_bias = torch.cat([
54+
module.q_proj.bias.data,
55+
module.k_proj.bias.data,
56+
module.v_proj.bias.data,
57+
], dim=0)
58+
qkv_proj.bias = torch.nn.Parameter(new_bias, requires_grad=False)
59+
else:
60+
qkv_proj = torch.nn.Linear(0, 0, bias=False)
61+
qkv_proj.weight = torch.nn.Parameter(new_weight, requires_grad=False)
62+
qkv_proj.in_features = new_weight.size(1)
63+
qkv_proj.out_features = new_weight.size(0)
64+
module.qkv_proj = qkv_proj
65+
66+
del module.q_proj, module.k_proj, module.v_proj
67+
68+
69+
def phi3v_model_forward(
70+
self,
71+
input_ids: torch.LongTensor = None,
72+
attention_mask: Optional[torch.Tensor] = None,
73+
position_ids: Optional[torch.LongTensor] = None,
74+
past_key_values: Optional[List[torch.FloatTensor]] = None,
75+
inputs_embeds: Optional[torch.FloatTensor] = None,
76+
pixel_values: Optional[torch.FloatTensor] = None,
77+
image_sizes: Optional[torch.LongTensor] = None,
78+
use_cache: Optional[bool] = None,
79+
output_attentions: Optional[bool] = None,
80+
output_hidden_states: Optional[bool] = None,
81+
return_dict: Optional[bool] = None,
82+
):
83+
# ipex-llm changes start
84+
from ipex_llm.transformers.kv import DynamicNormalCache
85+
# IPEX-LLM OPT: kv cache and quantize kv cache
86+
use_cache = use_cache if use_cache is not None else self.config.use_cache
87+
if use_cache:
88+
if not isinstance(past_key_values, DynamicNormalCache):
89+
past_key_values = DynamicNormalCache.from_legacy_cache(past_key_values)
90+
modeling_module_name = self.__class__.__module__
91+
module = importlib.import_module(modeling_module_name)
92+
return module.Phi3VModel.forward(
93+
self=self,
94+
input_ids=input_ids,
95+
attention_mask=attention_mask,
96+
position_ids=position_ids,
97+
past_key_values=past_key_values,
98+
inputs_embeds=inputs_embeds,
99+
pixel_values=pixel_values,
100+
image_sizes=image_sizes,
101+
use_cache=use_cache,
102+
output_attentions=output_attentions,
103+
output_hidden_states=output_hidden_states,
104+
return_dict=return_dict,
105+
)
106+
107+
108+
def phi3v_encoder_attention_forward(
109+
self,
110+
hidden_states: torch.Tensor,
111+
attention_mask: Optional[torch.Tensor] = None,
112+
causal_attention_mask: Optional[torch.Tensor] = None,
113+
output_attentions: Optional[bool] = False,
114+
) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
115+
bsz, tgt_len, embed_dim = hidden_states.size()
116+
117+
qkv = self.qkv_proj(hidden_states)
118+
qkv = qkv.view(bsz, tgt_len, self.num_heads * 3, self.head_dim)
119+
qkv = qkv.transpose(1, 2)
120+
query_states, key_states, value_states = qkv.split([self.num_heads,
121+
self.num_heads,
122+
self.num_heads], dim=1)
123+
124+
proj_shape = (bsz * self.num_heads, -1, self.head_dim)
125+
query_states = query_states.reshape(*proj_shape)
126+
key_states = key_states.reshape(*proj_shape)
127+
value_states = value_states.reshape(*proj_shape)
128+
129+
src_len = key_states.size(1)
130+
attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))
131+
132+
if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
133+
invalidInputError(
134+
False,
135+
f"Attention weights should be of size {(bsz * self.num_heads, tgt_len, src_len)},"
136+
f" but is {attn_weights.size()}"
137+
)
138+
139+
# apply the causal_attention_mask first
140+
if causal_attention_mask is not None:
141+
if causal_attention_mask.size() != (bsz, 1, tgt_len, src_len):
142+
invalidInputError(
143+
False,
144+
f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is"
145+
f" {causal_attention_mask.size()}"
146+
)
147+
attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) \
148+
+ causal_attention_mask
149+
attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
150+
151+
if attention_mask is not None:
152+
if attention_mask.size() != (bsz, 1, tgt_len, src_len):
153+
invalidInputError(
154+
False,
155+
f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)},"
156+
f" but is {attention_mask.size()}"
157+
)
158+
attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + attention_mask
159+
attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
160+
161+
attn_weights = nn.functional.softmax(attn_weights, dim=-1)
162+
163+
if output_attentions:
164+
# this operation is a bit akward, but it's required to
165+
# make sure that attn_weights keeps its gradient.
166+
# In order to do so, attn_weights have to reshaped
167+
# twice and have to be reused in the following
168+
attn_weights_reshaped = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
169+
attn_weights = attn_weights_reshaped.view(bsz * self.num_heads, tgt_len, src_len)
170+
else:
171+
attn_weights_reshaped = None
172+
173+
attn_probs = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training)
174+
175+
attn_output = torch.bmm(attn_probs, value_states)
176+
177+
if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
178+
invalidInputError(
179+
False,
180+
f"`attn_output` should be of size {(bsz, self.num_heads, tgt_len, self.head_dim)},"
181+
f" but is {attn_output.size()}"
182+
)
183+
184+
attn_output = attn_output.view(bsz, self.num_heads, tgt_len, self.head_dim)
185+
attn_output = attn_output.transpose(1, 2)
186+
attn_output = attn_output.reshape(bsz, tgt_len, embed_dim)
187+
188+
attn_output = self.out_proj(attn_output)
189+
190+
return attn_output, attn_weights_reshaped

0 commit comments

Comments
 (0)