Claude Code can be configured to work with locally hosted open large language models through either Unsloth or llama.cpp. The guide centers on using open models such as Qwen3.5 and GLM-4.7-Flash, with Unsloth Dynamic GGUFs recommended for quantized deployments that aim to preserve accuracy. Claude Code is positioned as a terminal-based coding agent that understands codebases and Git workflows, while local hosting lets developers connect it to models running on macOS, Linux, Windows, or WSL.
Setup begins by redirecting Claude Code to a local server with environment variables. For llama.cpp deployments, ANTHROPIC_BASE_URL is pointed to http://localhost:8001, while Unsloth uses http://localhost:8888 together with ANTHROPIC_AUTH_TOKEN and ANTHROPIC_MODEL. On first run, local setups may still require onboarding-related values in ~/.claude.json, and editor integrations for VS Code and Cursor can also be used. A key caveat is a Claude Code attribution header that invalidates the KV Cache, making inference 90% slower with local models. The recommended fix is to add “CLAUDE_CODE_ATTRIBUTION_HEADER” : “0” inside ~/.claude/settings.json under the env section, rather than relying on a temporary export alone.
For Unsloth, the workflow is comparatively direct. Users install Unsloth, launch it with unsloth studio -H 0.0.0.0 -p 8888, copy an API key and model name from the interface, and then start Claude Code in the same terminal. Unsloth exposes an Anthropic-compatible /v1/messages endpoint and adds features beyond basic inference, including self-healing tool calling, web search, code execution in Python and Bash, and automatic tuning of inference parameters. It also supports both GGUF and safetensor models and is described as offering fast CPU and GPU inference through llama.cpp.
For llama.cpp, the process is more manual but gives tighter control over serving. The instructions build llama.cpp with GPU bindings, download a quantized GGUF model from Hugging Face, and launch llama-server on port 8001. For Qwen3.5-35B-A3B, recommended sampling parameters are temp 0.6, top_p 0.95, top-k 20, and a deployment example is described as fitting in a 24GB GPU (RTX 4090) (uses 23GB) with –ctx-size 131072. Qwen3.5-27B is suggested for users who want a smarter model or lack enough VRAM, but it will be ~2x slower than 35B-A3B however. For GLM-4.7-Flash, the suggested parameters are temp 1.0 and top_p 0.95, with the same 24GB GPU (RTX 4090) (uses 23GB) guidance and –ctx-size 131072.
Across both model families, KV cache settings are emphasized as important for quality and memory use. The setup uses –cache-type-k q8_0 –cache-type-v q8_0 to reduce VRAM consumption, while bf16 is presented as an alternative that raises memory use. For Qwen3.5, f16 KV cache is specifically discouraged because it can degrade accuracy. Both Qwen3.5 and GLM-4.7-Flash can also run with thinking disabled through chat template arguments to improve performance for agentic coding workloads.
