How to deploy large language models on iOS and Android with Executorch

Unsloth and Executorch provide a workflow to fine tune large language models with quantization aware training and run them locally on iOS and Android devices, including Qwen3 models on recent flagship phones. The tutorial walks through model training, export to .pte, and end to end deployment on both iPhone and Android with detailed tooling setup and file transfer steps.

The tutorial explains how Unsloth, TorchAO and Executorch can be combined to fine tune large language models and deploy them directly on iOS and Android phones. The workflow is built around quantization aware training so that models can run efficiently on edge devices while preserving accuracy. The article highlights that the same Executorch technology is used inside Meta products to power experiences for billions of users on services such as Instagram and WhatsApp, and that users can achieve privacy first, offline text generation with instant responses on their own devices.

The guide focuses on Qwen3 models and demonstrates fine tuning Qwen3 0.6B and exporting it for phone deployment using a free Google Colab notebook. It notes that Qwen3-0.6B can be deployed locally to Pixel 8 and iPhone 15 Pro at ~40 tokens/s and that applying quantization aware training via TorchAO can recover 70% of accuracy. During training users set full_finetuning = True and qat_scheme = “phone-deployment” which internally maps to qat_scheme = “int8-int4” to simulate INT8 dynamic activation quantization with INT4 weight quantization for linear layers using fake quantization operations while keeping computations in 16bits. After fine tuning, the model is exported as a qwen3_0.6B_model.pte file which is around 472MB in size, then converted into a real quantized version that is smaller and more accurate than naive post training quantization.

The iOS section walks through running the Executorch demo app and loading the custom model on an iPhone. Users must install Xcode from the Mac App Store, ensure Xcode 15 or later is present and complete the first launch component installation, then optionally enroll in the Apple developer program to unlock the increased-memory-limit capability for deployment to a physical iPhone. The article shows how to open the etLLM example project, run it in the iOS simulator, and then copy qwen3_0.6B_model.pte and tokenizer.json into a Qwen3test folder in the Files app so the simulator app can load them. For physical devices it details connecting an iPhone over USB, trusting the Mac, configuring signing in Xcode, enabling the increased memory limit capability, turning on developer mode on the phone, and then using Finder’s Files tab to copy the .pte and tokenizer.json into the app’s shared storage before loading them through the app interface.

The Android deployment guide shows how to build and install the Executorch Llama demo app from source using a Linux or Mac development machine without Android Studio. It specifies that Java 17 must be installed and verified with java -version output showing openjdk version “17.0.x”, and then walks through installing Android command line tools, configuring ANDROID_HOME, ANDROID_NDK and PATH variables, and using sdkmanager to accept licenses and install “platforms;android-34”, “platform-tools”, “build-tools;34.0.0” and “ndk;25.0.8775105”. The article explains how to clone the executorch-examples repository, add a local.properties file with sdk.dir, patch deprecated getDetailedError() calls, and run ./gradlew :app:assembleDebug with JAVA_HOME pointing to openjdk-17 to produce app/build/outputs/apk/debug/app-debug.apk. It then covers installing the APK using adb install -r or by manually transferring app-debug.apk to the phone.

To make the Android app usable with custom models, the tutorial describes copying the model.pte and tokenizer.bin or tokenizer.model files to the device. For the generic file picker flow, users move files into a folder such as Downloads, open the LlamaDemo app, tap the settings gear, choose the model and tokenizer files and then tap Load Model to start chatting once the load completes. For the executorchllamademo app that expects files in a protected directory, the article shows how to use adb devices to verify a connection, then adb shell mkdir -p /data/local/tmp/llama followed by adb shell chmod 777 /data/local/tmp/llama, adb shell ls -l /data/local/tmp/llama, and adb push commands to copy tokenizer.json and the model .pte file into that path. After launching the app, users select the model and tokenizer entries, optionally choose the appropriate model type such as Qwen3, and wait for the “successfully loaded model” message before beginning an on device chat session.

Beyond Qwen3 0.6B, the documentation notes that all Qwen 3 dense models such as Qwen3-0.6B, Qwen3-4B and Qwen3-32B are supported, alongside all Gemma 3 models such as Gemma3-270M, Gemma3-4B and Gemma3-27B, and all Llama 3 models including Llama 3.1 8B and Llama 3.3 70B Instruct, plus Qwen 2.5, Phi 4 Mini and other variants. Users can customize the main Qwen3 0.6B Colab notebook to target any of these architectures for phone deployment. The article reiterates that Executorch powers on device machine learning experiences for billions of people and supports hardware backends across Apple, Qualcomm, Arm, Meta Quest 3 and Ray Bans, positioning this workflow as a practical path to run advanced large language models natively on modern phones.

55

Impact Score

Vals publishes public enterprise language model benchmarks

Vals lists a broad set of public enterprise benchmarks spanning law, finance, healthcare, math, education, academics, coding, and beta agent tasks. The index highlights which models currently lead specific enterprise-focused evaluations and how widely each benchmark has been tested.

MIT method spots overconfident Artificial Intelligence models

MIT researchers developed a way to detect when large language models are confidently wrong by comparing their answers with outputs from similar models. The combined uncertainty measure outperformed standard techniques across a range of tasks and may help reduce unreliable responses.

MEPs back delay for parts of Artificial Intelligence Act

European Parliament committees have endorsed targeted delays to parts of the Artificial Intelligence Act while adding a proposed ban on certain non-consensual image manipulation tools. The changes aim to give companies clearer deadlines, reduce overlap with other EU rules, and extend support to small mid-cap enterprises.

Publisher alliance seeks leverage over Artificial Intelligence web access

A new publisher coalition is trying to reshape how Artificial Intelligence companies access journalism by combining collective bargaining with tougher technical controls. The effort reflects growing pressure on Artificial Intelligence firms to pay for content used in training, search, and user-facing responses.

Military advantage in the age of algorithmic diffusion

American leadership in Artificial Intelligence research and infrastructure may not translate into lasting military advantage. Rapid diffusion of algorithms is shifting the contest toward compute, talent, and the speed of military adoption.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.