This tutorial demonstrates how to create an iOS application that runs the Phi-3 or Phi-4 model on-device, using the Apple MLX framework. MLX is Apple's machine learning framework optimized for Apple Silicon chips.
- macOS with Xcode 16 (or higher)
- iOS 18 (or higher) target device with at least 8GB (iPhone or iPad compatible with Apple Intelligence requirements, as those would be similar to the quantized Phi requirements)
- basic knowledge of Swift and SwiftUI
Start by creating a new iOS project in Xcode:
- launch Xcode and select "Create a new Xcode project"
- choose "App" as the template
- name your project (e.g., "Phi3-iOS-App") and select SwiftUI as the interface
- choose a location to save your project
Add the MLX Examples package which contains all the necessary dependencies and helpers for preloading models and performing inference:
// In Xcode: File > Add Package Dependencies
// URL: https://github.com/ml-explore/mlx-swift-examples
While the base MLX Swift package would be enough for core tensor operations and basic ML functionality, the MLX Examples package provides several additional components designed for working with language models, and making the inference process easier:
- model loading utilities that handle downloading from Hugging Face
- tokenizer integration
- inference helpers for text generation
- pre-configured model definitions
To allow our app to download models and allocate sufficient memory, we need to add specific entitlements. Create an .entitlements
file for your app with the following content:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>com.apple.security.app-sandbox</key>
<true/>
<key>com.apple.security.files.user-selected.read-only</key>
<true/>
<key>com.apple.security.network.client</key>
<true/>
<key>com.apple.developer.kernel.increased-memory-limit</key>
<true/>
</dict>
</plist>
Note: The
com.apple.developer.kernel.increased-memory-limit
entitlement is important for running larger models, as it allows the app to request more memory than normally permitted.
First, let's create a basic structure to represent our chat messages:
import SwiftUI
enum MessageState {
case ok
case waiting
}
struct ChatMessage: Identifiable {
let id = UUID()
let text: String
let isUser: Bool
let state: MessageState
}
Next, we'll create the PhiViewModel
class that handles model loading and inference:
import MLX
import MLXLLM
import MLXLMCommon
import SwiftUI
@MainActor
class PhiViewModel: ObservableObject {
@Published var isLoading: Bool = false
@Published var isLoadingEngine: Bool = false
@Published var messages: [ChatMessage] = []
@Published var prompt: String = ""
@Published var isReady: Bool = false
private let maxTokens = 1024
private var modelContainer: ModelContainer?
func loadModel() async {
DispatchQueue.main.async {
self.isLoadingEngine = true
}
do {
MLX.GPU.set(cacheLimit: 20 * 1024 * 1024)
// Phi 3.5 mini is preconfigured in Swift MLX Examples
let modelConfig = ModelRegistry.phi3_5_4bit
// Phi 4 mini can be pulled from Hugging Face, but requires referencing Swift MLX Examples from the main branch
//let modelConfig = ModelConfiguration(
// id: "mlx-community/Phi-4-mini-instruct-4bit",
// defaultPrompt: "You are a helpful assistant.",
// extraEOSTokens: ["<|end|>"]
//)
print("Loading \(modelConfig.name)...")
self.modelContainer = try await LLMModelFactory.shared.loadContainer(
configuration: modelConfig
) { progress in
print("Download progress: \(Int(progress.fractionCompleted * 100))%")
}
// Log model parameters
if let container = self.modelContainer {
let numParams = await container.perform { context in
context.model.numParameters()
}
print("Model loaded. Parameters: \(numParams / (1024*1024))M")
}
DispatchQueue.main.async {
self.isLoadingEngine = false
self.isReady = true
}
} catch {
print("Failed to load model: \(error)")
DispatchQueue.main.async {
self.isLoadingEngine = false
}
}
}
func fetchAIResponse() async {
guard !isLoading, let container = self.modelContainer else {
print("Cannot generate: model not loaded or already processing")
return
}
let userQuestion = prompt
let currentMessages = self.messages
DispatchQueue.main.async {
self.isLoading = true
self.prompt = ""
self.messages.append(ChatMessage(text: userQuestion, isUser: true, state: .ok))
self.messages.append(ChatMessage(text: "", isUser: false, state: .waiting))
}
do {
let _ = try await container.perform { context in
var messageHistory: [[String: String]] = [
["role": "system", "content": "You are a helpful assistant."]
]
for message in currentMessages {
let role = message.isUser ? "user" : "assistant"
messageHistory.append(["role": role, "content": message.text])
}
messageHistory.append(["role": "user", "content": userQuestion])
let input = try await context.processor.prepare(
input: .init(messages: messageHistory))
let startTime = Date()
let result = try MLXLMCommon.generate(
input: input,
parameters: GenerateParameters(temperature: 0.6),
context: context
) { tokens in
let output = context.tokenizer.decode(tokens: tokens)
Task { @MainActor in
if let index = self.messages.lastIndex(where: { !$0.isUser }) {
self.messages[index] = ChatMessage(
text: output,
isUser: false,
state: .ok
)
}
}
if tokens.count >= self.maxTokens {
return .stop
} else {
return .more
}
}
let finalOutput = context.tokenizer.decode(tokens: result.tokens)
Task { @MainActor in
if let index = self.messages.lastIndex(where: { !$0.isUser }) {
self.messages[index] = ChatMessage(
text: finalOutput,
isUser: false,
state: .ok
)
}
self.isLoading = false
print("Inference complete:")
print("Tokens: \(result.tokens.count)")
print("Tokens/second: \(result.tokensPerSecond)")
print("Time: \(Date().timeIntervalSince(startTime))s")
}
return result
}
} catch {
print("Inference error: \(error)")
DispatchQueue.main.async {
if let index = self.messages.lastIndex(where: { !$0.isUser }) {
self.messages[index] = ChatMessage(
text: "Sorry, an error occurred: \(error.localizedDescription)",
isUser: false,
state: .ok
)
}
self.isLoading = false
}
}
}
}
The ViewModel demonstrates the key MLX integration points:
- setting GPU cache limits with
MLX.GPU.set(cacheLimit:)
to optimize memory usage on mobile devices - using
LLMModelFactory
to download the model on-demand and initialize the MLX-optimized model - accessing the model's parameters and structure through the
ModelContainer
- leveraging MLX's token-by-token generation through the
MLXLMCommon.generate
method - managing the inference process with appropriate temperature settings and token limits
The streaming token generation approach provides immediate feedback to users as the model generates text. This is similar to how server-based models function, as they stream the tokens back to the user, but without the latency of network requests.
In terms of UI interaction, the two key functions are loadModel()
, which initializes the LLM, and fetchAIResponse()
, which processes user input and generates AI responses.
Important: Phi models for MLX cannot be used in their default or GGUF format. They must be converted to the MLX format, which is handled by the MLX community. You can find pre-converted models at huggingface.co/mlx-community.
The MLX Examples package includes pre-configured registrations for several models, including Phi-3. When you call ModelRegistry.phi3_5_4bit
, it references a specific pre-converted MLX model that will be automatically downloaded:
static public let phi3_5_4bit = ModelConfiguration(
id: "mlx-community/Phi-3.5-mini-instruct-4bit",
defaultPrompt: "What is the gravity on Mars and the moon?",
extraEOSTokens: ["<|end|>"]
)
You can create your own model configurations to point to any compatible model on Hugging Face. For example, to use Phi-4 mini instead, you could define your own configuration:
let phi4_mini_4bit = ModelConfiguration(
id: "mlx-community/Phi-4-mini-instruct-4bit",
defaultPrompt: "Explain quantum computing in simple terms.",
extraEOSTokens: ["<|end|>"]
)
// Then use this configuration when loading the model
self.modelContainer = try await LLMModelFactory.shared.loadContainer(
configuration: phi4_mini_4bit
) { progress in
print("Download progress: \(Int(progress.fractionCompleted * 100))%")
}
Note: Phi-4 support was added to the MLX Swift Examples repository at the end of February 2025 (in PR #216). As of March 2025, the latest official release (2.21.2 from December 2024) does not include built-in Phi-4 support. To use Phi-4 models, you'll need to reference the package directly from the main branch:
// In your Package.swift or via Xcode's package manager interface .package(url: "https://github.com/ml-explore/mlx-swift-examples.git", branch: "main")
This gives you access to the latest model configurations, including Phi-4, before they're included in an official release. You can use this approach to use different versions of Phi models or even other models that have been converted to the MLX format.
Let's now implement a simple chat interface to interact with our view model:
import SwiftUI
struct ContentView: View {
@ObservedObject var viewModel = PhiViewModel()
var body: some View {
NavigationStack {
if !viewModel.isReady {
Spacer()
if viewModel.isLoadingEngine {
ProgressView()
} else {
Button("Load model") {
Task {
await viewModel.loadModel()
}
}
}
Spacer()
} else {
VStack(spacing: 0) {
ScrollViewReader { proxy in
ScrollView {
VStack(alignment: .leading, spacing: 8) {
ForEach(viewModel.messages) { message in
MessageView(message: message).padding(.bottom)
}
}
.id("wrapper").padding()
.padding()
}
.onChange(of: viewModel.messages.last?.id, perform: { value in
if viewModel.isLoading {
proxy.scrollTo("wrapper", anchor: .bottom)
} else if let lastMessage = viewModel.messages.last {
proxy.scrollTo(lastMessage.id, anchor: .bottom)
}
})
}
HStack {
TextField("Type a question...", text: $viewModel.prompt, onCommit: {
Task {
await viewModel.fetchAIResponse()
}
})
.padding(10)
.background(Color.gray.opacity(0.2))
.cornerRadius(20)
.padding(.horizontal)
Button(action: {
Task {
await viewModel.fetchAIResponse()
}
}) {
Image(systemName: "paperplane.fill")
.font(.system(size: 24))
.foregroundColor(.blue)
}
.padding(.trailing)
}
.padding(.bottom)
}
}
}.navigationTitle("Phi Sample")
}
}
struct MessageView: View {
let message: ChatMessage
var body: some View {
HStack {
if message.isUser {
Spacer()
Text(message.text)
.padding()
.background(Color.blue)
.foregroundColor(.white)
.cornerRadius(10)
} else {
if message.state == .waiting {
TypingIndicatorView()
} else {
VStack {
Text(message.text)
.padding()
}
.background(Color.gray.opacity(0.1))
.cornerRadius(10)
Spacer()
}
}
}
.padding(.horizontal)
}
}
struct TypingIndicatorView: View {
@State private var shouldAnimate = false
var body: some View {
HStack {
ForEach(0..<3) { index in
Circle()
.frame(width: 10, height: 10)
.foregroundColor(.gray)
.offset(y: shouldAnimate ? -5 : 0)
.animation(
Animation.easeInOut(duration: 0.5)
.repeatForever()
.delay(Double(index) * 0.2)
)
}
}
.onAppear { shouldAnimate = true }
.onDisappear { shouldAnimate = false }
}
}
The UI consists of three main components that work together to create a basic chat interface. ContentView
creates a two-state interface that shows either a loading button or the chat interface depending on model readiness. MessageView
renders individual chat messages differently based on whether they are user messages (right-aligned, blue background) or Phi model responses (left-aligned, gray background). TypingIndicatorView
provides a simple animated indicator to show that the AI is processing
We are now ready to build and run the application.
Important! MLX does not support the simulator. You must run the app on a physical device with an Apple Silicon chip. See here for more information.
When the app launches, tap the "Load model" button to download and initialize the Phi-3 (or, depending on your configuration, Phi-4) model. This process may take some time depending on your internet connection, as it involves downloading the model from Hugging Face. Our implementation includes only a spinner to indicate loading, but you can see the actual progress in the Xcode console.
Once loaded, you can interact with the model by typing questions in the text field and tapping the send button.
Here is how our application should behave application, running on iPad Air M1:
And that's it! By following these steps, you've created an iOS application that runs the Phi-3 (or Phi-4) model directly on device using Apple's MLX framework.
Congratulations!