Running local LLMs with Claude Code using Unsloth and llama.cpp

Claude Code can be pointed at locally hosted open models through Unsloth or llama.cpp, giving developers a way to run coding agents on their own machines. The setup depends on environment variables, local endpoints, and a key fix to avoid major performance loss from KV cache invalidation.

Claude Code can be configured to work with locally hosted open large language models through either Unsloth or llama.cpp. The guide centers on using open models such as Qwen3.5 and GLM-4.7-Flash, with Unsloth Dynamic GGUFs recommended for quantized deployments that aim to preserve accuracy. Claude Code is positioned as a terminal-based coding agent that understands codebases and Git workflows, while local hosting lets developers connect it to models running on macOS, Linux, Windows, or WSL.

Setup begins by redirecting Claude Code to a local server with environment variables. For llama.cpp deployments, ANTHROPIC_BASE_URL is pointed to http://localhost:8001, while Unsloth uses http://localhost:8888 together with ANTHROPIC_AUTH_TOKEN and ANTHROPIC_MODEL. On first run, local setups may still require onboarding-related values in ~/.claude.json, and editor integrations for VS Code and Cursor can also be used. A key caveat is a Claude Code attribution header that invalidates the KV Cache, making inference 90% slower with local models. The recommended fix is to add “CLAUDE_CODE_ATTRIBUTION_HEADER” : “0” inside ~/.claude/settings.json under the env section, rather than relying on a temporary export alone.

For Unsloth, the workflow is comparatively direct. Users install Unsloth, launch it with unsloth studio -H 0.0.0.0 -p 8888, copy an API key and model name from the interface, and then start Claude Code in the same terminal. Unsloth exposes an Anthropic-compatible /v1/messages endpoint and adds features beyond basic inference, including self-healing tool calling, web search, code execution in Python and Bash, and automatic tuning of inference parameters. It also supports both GGUF and safetensor models and is described as offering fast CPU and GPU inference through llama.cpp.

For llama.cpp, the process is more manual but gives tighter control over serving. The instructions build llama.cpp with GPU bindings, download a quantized GGUF model from Hugging Face, and launch llama-server on port 8001. For Qwen3.5-35B-A3B, recommended sampling parameters are temp 0.6, top_p 0.95, top-k 20, and a deployment example is described as fitting in a 24GB GPU (RTX 4090) (uses 23GB) with –ctx-size 131072. Qwen3.5-27B is suggested for users who want a smarter model or lack enough VRAM, but it will be ~2x slower than 35B-A3B however. For GLM-4.7-Flash, the suggested parameters are temp 1.0 and top_p 0.95, with the same 24GB GPU (RTX 4090) (uses 23GB) guidance and –ctx-size 131072.

Across both model families, KV cache settings are emphasized as important for quality and memory use. The setup uses –cache-type-k q8_0 –cache-type-v q8_0 to reduce VRAM consumption, while bf16 is presented as an alternative that raises memory use. For Qwen3.5, f16 KV cache is specifically discouraged because it can degrade accuracy. Both Qwen3.5 and GLM-4.7-Flash can also run with thinking disabled through chat template arguments to improve performance for agentic coding workloads.

52

Impact Score

Chrome downloads Gemini Nano model locally without clear consent

Google Chrome is reported to download a 4 GB Gemini Nano model onto some PCs automatically when certain Artificial Intelligence features are active. The process happens without clear notice in browser settings and can repeat after the model is deleted.

AMD plans specialized EPYC CPUs for Artificial Intelligence, hpc, and cloud

AMD is preparing a broader EPYC strategy with task-specific server CPUs aimed at agentic Artificial Intelligence, hpc, training and inference, and cloud deployments. The shift starts with the Zen 6 generation and adds Verano as an Artificial Intelligence-focused variant within the same EPYC family.

Nvidia expands spectrum-x ethernet with open mrc protocol

Nvidia is positioning Spectrum-X Ethernet as a foundation for large-scale Artificial Intelligence training, with Multipath Reliable Connection adding open, multi-path RDMA transport for higher resilience and throughput. OpenAI, Microsoft and Oracle are among the organizations using the technology in large Artificial Intelligence environments.

Anthropic explores Fractile chips to diversify supply

Anthropic is reportedly in early talks with London-based Fractile to secure high-performance Artificial Intelligence chips for inference workloads. The move would reduce reliance on Nvidia and broaden the company’s hardware supply chain.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.