How to deploy large language models on iOS and Android with Executorch

Unsloth and Executorch provide a workflow to fine tune large language models with quantization aware training and run them locally on iOS and Android devices, including Qwen3 models on recent flagship phones. The tutorial walks through model training, export to .pte, and end to end deployment on both iPhone and Android with detailed tooling setup and file transfer steps.

The tutorial explains how Unsloth, TorchAO and Executorch can be combined to fine tune large language models and deploy them directly on iOS and Android phones. The workflow is built around quantization aware training so that models can run efficiently on edge devices while preserving accuracy. The article highlights that the same Executorch technology is used inside Meta products to power experiences for billions of users on services such as Instagram and WhatsApp, and that users can achieve privacy first, offline text generation with instant responses on their own devices.

The guide focuses on Qwen3 models and demonstrates fine tuning Qwen3 0.6B and exporting it for phone deployment using a free Google Colab notebook. It notes that Qwen3-0.6B can be deployed locally to Pixel 8 and iPhone 15 Pro at ~40 tokens/s and that applying quantization aware training via TorchAO can recover 70% of accuracy. During training users set full_finetuning = True and qat_scheme = “phone-deployment” which internally maps to qat_scheme = “int8-int4” to simulate INT8 dynamic activation quantization with INT4 weight quantization for linear layers using fake quantization operations while keeping computations in 16bits. After fine tuning, the model is exported as a qwen3_0.6B_model.pte file which is around 472MB in size, then converted into a real quantized version that is smaller and more accurate than naive post training quantization.

The iOS section walks through running the Executorch demo app and loading the custom model on an iPhone. Users must install Xcode from the Mac App Store, ensure Xcode 15 or later is present and complete the first launch component installation, then optionally enroll in the Apple developer program to unlock the increased-memory-limit capability for deployment to a physical iPhone. The article shows how to open the etLLM example project, run it in the iOS simulator, and then copy qwen3_0.6B_model.pte and tokenizer.json into a Qwen3test folder in the Files app so the simulator app can load them. For physical devices it details connecting an iPhone over USB, trusting the Mac, configuring signing in Xcode, enabling the increased memory limit capability, turning on developer mode on the phone, and then using Finder’s Files tab to copy the .pte and tokenizer.json into the app’s shared storage before loading them through the app interface.

The Android deployment guide shows how to build and install the Executorch Llama demo app from source using a Linux or Mac development machine without Android Studio. It specifies that Java 17 must be installed and verified with java -version output showing openjdk version “17.0.x”, and then walks through installing Android command line tools, configuring ANDROID_HOME, ANDROID_NDK and PATH variables, and using sdkmanager to accept licenses and install “platforms;android-34”, “platform-tools”, “build-tools;34.0.0” and “ndk;25.0.8775105”. The article explains how to clone the executorch-examples repository, add a local.properties file with sdk.dir, patch deprecated getDetailedError() calls, and run ./gradlew :app:assembleDebug with JAVA_HOME pointing to openjdk-17 to produce app/build/outputs/apk/debug/app-debug.apk. It then covers installing the APK using adb install -r or by manually transferring app-debug.apk to the phone.

To make the Android app usable with custom models, the tutorial describes copying the model.pte and tokenizer.bin or tokenizer.model files to the device. For the generic file picker flow, users move files into a folder such as Downloads, open the LlamaDemo app, tap the settings gear, choose the model and tokenizer files and then tap Load Model to start chatting once the load completes. For the executorchllamademo app that expects files in a protected directory, the article shows how to use adb devices to verify a connection, then adb shell mkdir -p /data/local/tmp/llama followed by adb shell chmod 777 /data/local/tmp/llama, adb shell ls -l /data/local/tmp/llama, and adb push commands to copy tokenizer.json and the model .pte file into that path. After launching the app, users select the model and tokenizer entries, optionally choose the appropriate model type such as Qwen3, and wait for the “successfully loaded model” message before beginning an on device chat session.

Beyond Qwen3 0.6B, the documentation notes that all Qwen 3 dense models such as Qwen3-0.6B, Qwen3-4B and Qwen3-32B are supported, alongside all Gemma 3 models such as Gemma3-270M, Gemma3-4B and Gemma3-27B, and all Llama 3 models including Llama 3.1 8B and Llama 3.3 70B Instruct, plus Qwen 2.5, Phi 4 Mini and other variants. Users can customize the main Qwen3 0.6B Colab notebook to target any of these architectures for phone deployment. The article reiterates that Executorch powers on device machine learning experiences for billions of people and supports hardware backends across Apple, Qualcomm, Arm, Meta Quest 3 and Ray Bans, positioning this workflow as a practical path to run advanced large language models natively on modern phones.

55

Impact Score

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.