MiniMax-M2.5 is positioned as a new open large language model focused on coding, agentic tool use, search, and office workflows, with reported scores of 80.2% in SWE-Bench Verified, 51.3% in Multi-SWE-Bench, and 76.3% in BrowseComp. The model has 230B parameters with 10B active, a 200K context window, and in unquantized bf16 form requires 457GB of memory. Using Unsloth Dynamic 3-bit GGUF quantization, the model size is reduced to 101GB, a stated reduction of -62%, and all uploads apply Unsloth Dynamic 2.0 so that some important layers are upcasted to 8 or 16-bit for better quality. Users can also fine-tune MiniMax-M2.5 through Unsloth with multi-GPU configurations.
The usage guidance centers on fitting the quantized model into local hardware constraints while maintaining performance. The 3-bit dynamic quant UD-Q3_K_XL uses 101GB of disk space and is described as suitable for a 128GB unified memory Mac for ~20+ tokens/s, and it also runs faster on a setup with a 1x16GB GPU and 96GB of RAM for 25+ tokens/s. It is stated that 2-bit quants or the biggest 2-bit will fit on a 96GB device, while for near full precision users are advised to use Q8_0 (8-bit) which utilizes 243GB and will fit on a 256GB RAM device or Mac for 10+ tokens/s. For stable operation, the guidance is that total available memory, combining VRAM and system RAM, should exceed the size of the chosen quantized model file, otherwise llama.cpp will fall back to SSD or HDD offloading with slower inference.
MiniMax provides recommended decoding parameters to balance quality and diversity: temperature=1.0, top_p = 0.95, and top_k = 40. Default settings for most tasks are summarized as temperature = 1.0, top_p = 0.95, top_k = 40, repeat penalty = 1.0 or disabled, maximum context window of 196,608 tokens, Min_P = 0.01 where the default might be 0.05, and a default system prompt that sets the assistant persona and name as MiniMax-M2.5 built by MiniMax. For running locally, the guide walks through building the latest llama.cpp with CMake, cloning from GitHub, and compiling targets such as llama-cli, llama-server, and llama-gguf-split. Users can load the model via llama-cli using a Hugging Face URI like unsloth/MiniMax-M2.5-GGUF:UD-Q3_K_XL, configure a cache directory with the LLAMA_CACHE environment variable, and adjust flags including –ctx-size 16384, –flash-attn on, –temp 1.0, –top-p 0.95, –min-p 0.01, –top-k 40, –threads 32, and –n-gpu-layers 2 for GPU offloading or CPU-only inference. The model weights can be downloaded with the Hugging Face CLI using patterns such as “*UD-Q3_K_XL*” or “*Q8_0*” to select the desired quant, and the 3-bit UD-Q3_K_XL quant is highlighted as a practical default that fits in a 128GB RAM device.
For deployment in production-style environments, the guide recommends using llama-server or an OpenAI-compatible completion API on top of llama.cpp. An example shows starting llama-server with the UD-Q3_K_XL checkpoint, assigning an alias “unsloth/MiniMax-M2.5”, and setting decoding parameters and context size over port 8001. A corresponding Python snippet uses the OpenAI client library pointed at base_url = “http://127.0.0.1:8001/v1” with api_key = “sk-no-key-required” and invokes chat.completions.create with model = “unsloth/MiniMax-M2.5” to generate content such as a Snake game. Benchmark information from an external 750-prompt suite indicates that Unsloth quantizations, including UD-Q4_K_XL, aim for a favorable quality-to-size tradeoff, with the best UD-Q4_K_XL described as being only 6.0 points down from the original and with +22.8% more errors, while other Unsloth Q4 variants cluster around ~64.5-64.9 accuracy and show ~33-35% more errors compared to the baseline.
Official benchmark tables compare MiniMax-M2.5 to MiniMax-M2.1, Claude Opus 4.5, Claude Opus 4.6, Gemini 3 Pro, and GPT-5.2 (thinking) on a broad range of tasks. On AIME2, MiniMax-M2.5 scores 58.6 where the table lists 83.0 for MiniMax-M2.1, 91.0 for Claude Opus 4.5, 95.6 for Claude Opus 4.6, 96.0 for Gemini 3 Pro, and 98.0 for GPT-5.2 (thinking). On GPQA-D, MiniMax-M2.5 records 85.2 compared to 83.0 for MiniMax-M2.1 and 90.0 or 91.0 for the larger proprietary models, and on SciCode it has 44.4 versus 41.0 for MiniMax-M2.1 and 50.0 to 56.0 for competitors. IFBench shows MiniMax-M2.5 at 70.0, SWE-Bench Verified at 80.2, SWE-Bench Pro at 55.4, Terminal Bench 2 at 51.7, as well as 19.4 on HLE w/o tools and 51.3 on Multi-SWE-Bench. Additional entries include 74.1 on SWE-Bench Multilingual, 54.2 on VIBE-Pro (AVG), 76.3 on BrowseComp (w/ctx), 70.3 on Wide Search, 50.2 on RISE, 76.8 on BFCL multi-turn, 97.8 on τ² Telecom, 74.4 on MEWC, 59.0 on GDPval-MM, and 21.6 on Finance Modeling. Across these metrics, MiniMax-M2.5 is presented as competitive with or close to strong proprietary systems on several coding, reasoning, and domain benchmarks while remaining available as an open model that can be run locally in quantized GGUF form.
