A Hacker News thread centers on IBM’s Granite 4.0 large language models and their hybrid Mamba and Transformer design, with commenters trading hands-on notes, links to official resources, and early impressions. Several participants pointed readers to IBM’s own announcement and a company explainer on the Mamba architecture, while noting that community write-ups can be more informative than high-level coverage. One commenter also highlighted IBM’s reference to ISO/IEC 42001 certification for Artificial Intelligence management systems and asked what concrete practices that implies in product design and deployment.
Tooling and local inference support featured prominently. Contributors reported that support for Granite 4’s hybrid architecture landed in llama.cpp earlier this year, and that Ollama’s engine uses GGML directly while falling back to llama.cpp for models it does not yet support. Unsloth released dynamic GGUF conversions for a 32 billion parameter mixture-of-experts variant and shared a support agent fine-tuning notebook. Users tested local setups across LM Studio, Vulkan and ROCm back ends, and different quantizations, with one noting a switch to ROCm resolved a GPU loading issue. Another user tried an Ollama package that ran quickly at roughly 1.9 GB download size, though without Mamba components and with default context limits lower than the claimed maximum.
Early performance anecdotes were mixed. A practitioner reported the small 32 billion parameter mixture-of-experts quantized build at around 19 GB, roughly 20 GB at 100,000 token context, about 26 GB of VRAM in one runtime and 22 GB in another, and around 30 tokens per second. The same commenter judged coding ability to be underwhelming in initial tests, later citing third-party dashboards showing approximately 25.1 percent on livecodebench, 2 percent on a terminal benchmark, and 16 percent on a coding index for one Granite 4.0 variant. Elsewhere in the thread, developers asked for head-to-head comparisons with leading closed models and noted that third-party benchmarks to date appear less favorable than vendor materials.
The conversation also placed Granite 4.0 in a broader context of long-context and hybrid architectures. Commenters referenced other systems, including a model with a 256,000 token context and a newly released model that slows markedly beyond 40,000 tokens. Some expressed caution rooted in past experiences with IBM’s Artificial Intelligence offerings and skepticism about marketing claims, while others praised the pace of open tooling and the ability to run models locally. Overall, the thread captures an active, early-stage evaluation: strong momentum in ecosystem support, clear enterprise and governance positioning, and a wait-and-see posture on independent benchmarks and real-world task performance.
