How to run MiniMax M2.5 locally with Unsloth GGUF

February 27, 2026

MiniMax-M2.5 is a new open large language model optimized for coding, tool use, search, and office tasks, and Unsloth provides quantized GGUF builds and usage recipes for running it locally. The guide focuses on memory requirements, recommended decoding parameters, and deployment via llama.cpp and llama-server with an OpenAI-compatible interface.

MiniMax-M2.5 is positioned as a new open large language model focused on coding, agentic tool use, search, and office workflows, with reported scores of 80.2% in SWE-Bench Verified, 51.3% in Multi-SWE-Bench, and 76.3% in BrowseComp. The model has 230B parameters with 10B active, a 200K context window, and in unquantized bf16 form requires 457GB of memory. Using Unsloth Dynamic 3-bit GGUF quantization, the model size is reduced to 101GB, a stated reduction of -62%, and all uploads apply Unsloth Dynamic 2.0 so that some important layers are upcasted to 8 or 16-bit for better quality. Users can also fine-tune MiniMax-M2.5 through Unsloth with multi-GPU configurations.

The usage guidance centers on fitting the quantized model into local hardware constraints while maintaining performance. The 3-bit dynamic quant UD-Q3_K_XL uses 101GB of disk space and is described as suitable for a 128GB unified memory Mac for ~20+ tokens/s, and it also runs faster on a setup with a 1x16GB GPU and 96GB of RAM for 25+ tokens/s. It is stated that 2-bit quants or the biggest 2-bit will fit on a 96GB device, while for near full precision users are advised to use Q8_0 (8-bit) which utilizes 243GB and will fit on a 256GB RAM device or Mac for 10+ tokens/s. For stable operation, the guidance is that total available memory, combining VRAM and system RAM, should exceed the size of the chosen quantized model file, otherwise llama.cpp will fall back to SSD or HDD offloading with slower inference.

MiniMax provides recommended decoding parameters to balance quality and diversity: temperature=1.0, top_p = 0.95, and top_k = 40. Default settings for most tasks are summarized as temperature = 1.0, top_p = 0.95, top_k = 40, repeat penalty = 1.0 or disabled, maximum context window of 196,608 tokens, Min_P = 0.01 where the default might be 0.05, and a default system prompt that sets the assistant persona and name as MiniMax-M2.5 built by MiniMax. For running locally, the guide walks through building the latest llama.cpp with CMake, cloning from GitHub, and compiling targets such as llama-cli, llama-server, and llama-gguf-split. Users can load the model via llama-cli using a Hugging Face URI like unsloth/MiniMax-M2.5-GGUF:UD-Q3_K_XL, configure a cache directory with the LLAMA_CACHE environment variable, and adjust flags including –ctx-size 16384, –flash-attn on, –temp 1.0, –top-p 0.95, –min-p 0.01, –top-k 40, –threads 32, and –n-gpu-layers 2 for GPU offloading or CPU-only inference. The model weights can be downloaded with the Hugging Face CLI using patterns such as “*UD-Q3_K_XL*” or “*Q8_0*” to select the desired quant, and the 3-bit UD-Q3_K_XL quant is highlighted as a practical default that fits in a 128GB RAM device.

For deployment in production-style environments, the guide recommends using llama-server or an OpenAI-compatible completion API on top of llama.cpp. An example shows starting llama-server with the UD-Q3_K_XL checkpoint, assigning an alias “unsloth/MiniMax-M2.5”, and setting decoding parameters and context size over port 8001. A corresponding Python snippet uses the OpenAI client library pointed at base_url = “http://127.0.0.1:8001/v1” with api_key = “sk-no-key-required” and invokes chat.completions.create with model = “unsloth/MiniMax-M2.5” to generate content such as a Snake game. Benchmark information from an external 750-prompt suite indicates that Unsloth quantizations, including UD-Q4_K_XL, aim for a favorable quality-to-size tradeoff, with the best UD-Q4_K_XL described as being only 6.0 points down from the original and with +22.8% more errors, while other Unsloth Q4 variants cluster around ~64.5-64.9 accuracy and show ~33-35% more errors compared to the baseline.

Official benchmark tables compare MiniMax-M2.5 to MiniMax-M2.1, Claude Opus 4.5, Claude Opus 4.6, Gemini 3 Pro, and GPT-5.2 (thinking) on a broad range of tasks. On AIME2, MiniMax-M2.5 scores 58.6 where the table lists 83.0 for MiniMax-M2.1, 91.0 for Claude Opus 4.5, 95.6 for Claude Opus 4.6, 96.0 for Gemini 3 Pro, and 98.0 for GPT-5.2 (thinking). On GPQA-D, MiniMax-M2.5 records 85.2 compared to 83.0 for MiniMax-M2.1 and 90.0 or 91.0 for the larger proprietary models, and on SciCode it has 44.4 versus 41.0 for MiniMax-M2.1 and 50.0 to 56.0 for competitors. IFBench shows MiniMax-M2.5 at 70.0, SWE-Bench Verified at 80.2, SWE-Bench Pro at 55.4, Terminal Bench 2 at 51.7, as well as 19.4 on HLE w/o tools and 51.3 on Multi-SWE-Bench. Additional entries include 74.1 on SWE-Bench Multilingual, 54.2 on VIBE-Pro (AVG), 76.3 on BrowseComp (w/ctx), 70.3 on Wide Search, 50.2 on RISE, 76.8 on BFCL multi-turn, 97.8 on τ² Telecom, 74.4 on MEWC, 59.0 on GDPval-MM, and 21.6 on Finance Modeling. Across these metrics, MiniMax-M2.5 is presented as competitive with or close to strong proprietary systems on several coding, reasoning, and domain benchmarks while remaining available as an open model that can be run locally in quantized GGUF form.

Source

55

Impact Score

Latest News

Adaptive training method boosts reasoning large language model efficiency

February 27, 2026

Researchers have developed an adaptive training system that uses idle processors to train a smaller helper model on the fly, doubling reasoning large language model training speed without sacrificing accuracy. The method aims to cut costs and energy use for advanced applications such as financial forecasting and power grid risk detection.

Y Combinator backs new wave of computer vision startups in 2026

February 27, 2026

Y Combinator’s 2026 computer vision cohort spans infrastructure, developer tools, and industry-specific applications from retail security to aquaculture and healthcare. Startups are increasingly pairing computer vision with large vision language models and foundation models to tackle real-time video, automation, and domain-specific analysis.

How to run MiniMax M2.5 locally with Unsloth GGUF

55

Impact Score

Latest News

Adaptive training method boosts reasoning large language model efficiency

Y Combinator backs new wave of computer vision startups in 2026

SK hynix boosts Yongin semiconductor cluster investment to meet artificial intelligence demand

Samsung converts legacy 2D NAND fabs for HBM4 DRAM production

SK hynix and Sandisk push global standard for high bandwidth flash memory

How to run MiniMax M2.5 locally with Unsloth GGUF

55

Impact Score

Latest News

Adaptive training method boosts reasoning large language model efficiency

Y Combinator backs new wave of computer vision startups in 2026

SK hynix boosts Yongin semiconductor cluster investment to meet artificial intelligence demand

Samsung converts legacy 2D NAND fabs for HBM4 DRAM production

SK hynix and Sandisk push global standard for high bandwidth flash memory

Contact Us