ONNX is an open format built to represent machine learning models. ONNX defines a common set of operators - the building blocks of machine learning and deep learning models - and a common file format to enable AI developers to use models with a variety of frameworks, tools, runtimes, and compilers.
We hope to deploy generative AI models on edge devices and use them in limited computing power or offline environments. Now we can achieve this goal by converting the model in a quantized way. We can convert the quantized model to GGUF or ONNX format.
Microsoft Olive can help you convert SLM to quantized ONNX format. The method to achieve model conversion is very simple
Install Microsoft Olive SDK
pip install olive-ai
pip install transformers
Convert CPU ONNX Support
olive auto-opt --model_name_or_path Your Phi-4-mini location --output_path Your onnx ouput location --device cpu --provider CPUExecutionProvider --precision int4 --use_model_builder --log_level 1
Note this example uses CPU
- Install ONNX Runtime GenAI
pip install --pre onnxruntime-genai
- Python Code
This is ONNX Runtime GenAI 0.5.2 version
import onnxruntime_genai as og
import numpy as np
import os
model_folder = "Your Phi-4-mini-onnx-cpu-int4 location"
model = og.Model(model_folder)
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()
search_options = {}
search_options['max_length'] = 2048
search_options['past_present_share_buffer'] = False
chat_template = "<|user|>\n{input}</s>\n<|assistant|>"
text = """Can you introduce yourself"""
prompt = f'{chat_template.format(input=text)}'
input_tokens = tokenizer.encode(prompt)
params = og.GeneratorParams(model)
params.set_search_options(**search_options)
params.input_ids = input_tokens
generator = og.Generator(model, params)
while not generator.is_done():
generator.compute_logits()
generator.generate_next_token()
new_token = generator.get_next_tokens()[0]
print(tokenizer_stream.decode(new_token), end='', flush=True)
This is ONNX Runtime GenAI 0.6.0 version
import onnxruntime_genai as og
import numpy as np
import os
import time
import psutil
model_folder = "Your Phi-4-mini-onnx model path"
model = og.Model(model_folder)
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()
search_options = {}
search_options['max_length'] = 1024
search_options['past_present_share_buffer'] = False
chat_template = "<|user|>{input}<|assistant|>"
text = """can you introduce yourself"""
prompt = f'{chat_template.format(input=text)}'
input_tokens = tokenizer.encode(prompt)
params = og.GeneratorParams(model)
params.set_search_options(**search_options)
generator = og.Generator(model, params)
generator.append_tokens(input_tokens)
while not generator.is_done():
generator.generate_next_token()
new_token = generator.get_next_tokens()[0]
token_text = tokenizer.decode(new_token)
# print(tokenizer_stream.decode(new_token), end='', flush=True)
if token_count == 0:
first_token_time = time.time()
first_response_latency = first_token_time - start_time
print(f"firstly token delpay: {first_response_latency:.4f} s")
print(token_text, end='', flush=True)
token_count += 1