Running local LLMs with Claude Code using Unsloth and llama.cpp

Claude Code can be pointed at locally hosted open models through Unsloth or llama.cpp, giving developers a way to run coding agents on their own machines. The setup depends on environment variables, local endpoints, and a key fix to avoid major performance loss from KV cache invalidation.

Claude Code can be configured to work with locally hosted open large language models through either Unsloth or llama.cpp. The guide centers on using open models such as Qwen3.5 and GLM-4.7-Flash, with Unsloth Dynamic GGUFs recommended for quantized deployments that aim to preserve accuracy. Claude Code is positioned as a terminal-based coding agent that understands codebases and Git workflows, while local hosting lets developers connect it to models running on macOS, Linux, Windows, or WSL.

Setup begins by redirecting Claude Code to a local server with environment variables. For llama.cpp deployments, ANTHROPIC_BASE_URL is pointed to http://localhost:8001, while Unsloth uses http://localhost:8888 together with ANTHROPIC_AUTH_TOKEN and ANTHROPIC_MODEL. On first run, local setups may still require onboarding-related values in ~/.claude.json, and editor integrations for VS Code and Cursor can also be used. A key caveat is a Claude Code attribution header that invalidates the KV Cache, making inference 90% slower with local models. The recommended fix is to add “CLAUDE_CODE_ATTRIBUTION_HEADER” : “0” inside ~/.claude/settings.json under the env section, rather than relying on a temporary export alone.

For Unsloth, the workflow is comparatively direct. Users install Unsloth, launch it with unsloth studio -H 0.0.0.0 -p 8888, copy an API key and model name from the interface, and then start Claude Code in the same terminal. Unsloth exposes an Anthropic-compatible /v1/messages endpoint and adds features beyond basic inference, including self-healing tool calling, web search, code execution in Python and Bash, and automatic tuning of inference parameters. It also supports both GGUF and safetensor models and is described as offering fast CPU and GPU inference through llama.cpp.

For llama.cpp, the process is more manual but gives tighter control over serving. The instructions build llama.cpp with GPU bindings, download a quantized GGUF model from Hugging Face, and launch llama-server on port 8001. For Qwen3.5-35B-A3B, recommended sampling parameters are temp 0.6, top_p 0.95, top-k 20, and a deployment example is described as fitting in a 24GB GPU (RTX 4090) (uses 23GB) with –ctx-size 131072. Qwen3.5-27B is suggested for users who want a smarter model or lack enough VRAM, but it will be ~2x slower than 35B-A3B however. For GLM-4.7-Flash, the suggested parameters are temp 1.0 and top_p 0.95, with the same 24GB GPU (RTX 4090) (uses 23GB) guidance and –ctx-size 131072.

Across both model families, KV cache settings are emphasized as important for quality and memory use. The setup uses –cache-type-k q8_0 –cache-type-v q8_0 to reduce VRAM consumption, while bf16 is presented as an alternative that raises memory use. For Qwen3.5, f16 KV cache is specifically discouraged because it can degrade accuracy. Both Qwen3.5 and GLM-4.7-Flash can also run with thinking disabled through chat template arguments to improve performance for agentic coding workloads.

52

Impact Score

BitUnlocker bypasses TPM-only Windows 11 BitLocker

Intrinsec disclosed BitUnlocker, a downgrade attack that can bypass TPM-only Windows 11 BitLocker protections with physical access to a machine. The technique abuses a flaw in Windows recovery and deployment components and relies on older trusted boot code.

Micron samples 256 GB DDR5 9200 MT/s RDIMM server modules

Micron has begun sampling 256 GB DDR5 RDIMM server modules built on its 1-gamma technology to key ecosystem partners. The company positions the new modules as a higher-speed, more power-efficient option for scaling next-generation Artificial Intelligence and HPC infrastructure.

Microsoft emails show early doubts about OpenAI

Court emails show Microsoft executives were unconvinced by OpenAI’s early Artificial Intelligence progress in 2018 while also worrying that rejecting the lab could push it toward Amazon. The messages reveal internal tension between skepticism over technical claims and concern about competitive and public relations fallout.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.