Google DeepMind’s Gemma 4 family is presented as open, Apache-2.0 licensed models that can run locally, with variants including 12B, E2B, E4B, 26B-A4B, and 31B. The models support 140+ languages, multimodal inputs, dense and MoE designs, and context windows up to 256K, with the smaller E2B and E4B positioned for phones and laptops.
Unsloth recommends its Studio web UI for running and fine-tuning Gemma 4 on MacOS, Windows, and Linux, with support for GGUFs, MLX files, search and download, tool calling, web search, code execution, and automatic inference parameter tuning. The documentation also provides llama.cpp and Ollama workflows, plus recommended defaults of temperature = 1.0, top_p = 0.95, and top_k = 64.
Hardware guidance varies by model and quantization: Gemma-4-12B runs on 8GB RAM (4-bit) or 14GB (8-bit), while Gemma-4-26B-A4B runs on 18GB (4-bit) or 28GB (8-bit), and Gemma-4-31B needs 20GB RAM (4-bit) or 34GB (8-bit). MTP support is described as enabling 1.4-2.2x faster inference without accuracy loss, and QAT variants are said to reduce memory requirements around 3x while preserving model quality.
