Unsloth shows how to run Gemma 4 locally

Gemma 4 can run through Unsloth Studio, llama.cpp, MLX, and Ollama, with guidance for model selection, memory needs, and multimodal prompting.

Google DeepMind’s Gemma 4 family is presented as open, Apache-2.0 licensed models that can run locally, with variants including 12B, E2B, E4B, 26B-A4B, and 31B. The models support 140+ languages, multimodal inputs, dense and MoE designs, and context windows up to 256K, with the smaller E2B and E4B positioned for phones and laptops.

Unsloth recommends its Studio web UI for running and fine-tuning Gemma 4 on MacOS, Windows, and Linux, with support for GGUFs, MLX files, search and download, tool calling, web search, code execution, and automatic inference parameter tuning. The documentation also provides llama.cpp and Ollama workflows, plus recommended defaults of temperature = 1.0, top_p = 0.95, and top_k = 64.

Hardware guidance varies by model and quantization: Gemma-4-12B runs on 8GB RAM (4-bit) or 14GB (8-bit), while Gemma-4-26B-A4B runs on 18GB (4-bit) or 28GB (8-bit), and Gemma-4-31B needs 20GB RAM (4-bit) or 34GB (8-bit). MTP support is described as enabling 1.4-2.2x faster inference without accuracy loss, and QAT variants are said to reduce memory requirements around 3x while preserving model quality.

58

Impact Score

Nvidia faces a more credible benchmark fight

Inference costs are pushing cloud buyers to compare GPUs, custom silicon and total operating economics more closely. Nvidia remains ahead, but AMD and hyperscaler chips are giving customers stronger alternatives.

Local models become practical for developers in 2026

Open-source models such as Gemma 4 and Qwen 3 have narrowed the gap with cloud systems for coding, tool use, and experimentation. Hardware, latency, context limits, and complex reasoning remain the main constraints.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.