Qwen QwQ-32B reasoning model overview

Qwen positions QwQ-32B as a medium-sized reasoning model built for stronger downstream performance on difficult tasks. The release highlights architecture details, deployment guidance, and recommended inference settings for long-context and multi-turn use.

Qwen presents QwQ-32B as the reasoning model in the Qwen series, designed to outperform conventional instruction-tuned models on hard downstream tasks through stronger thinking and reasoning capabilities. It is described as a medium-sized reasoning model with competitive performance against state-of-the-art reasoning models, including DeepSeek-R1 and o1-mini. The model is a causal language model trained through pretraining and post-training, including supervised finetuning and reinforcement learning.

The technical profile includes a transformers architecture with RoPE, SwiGLU, RMSNorm, and Attention QKV bias. Number of Parameters: 32.5B. Number of Paramaters (Non-Embedding): 31.0B. Number of Layers: 64. Number of Attention Heads (GQA): 40 for Q and 8 for KV. Context Length: Full 131,072 tokens. For prompts exceeding 8,192 tokens in length, YaRN must be enabled. Qwen also notes that QwQ is based on Qwen2.5 and recommends using the latest version of transformers, warning that with transformers<4.37.0, users will encounter the error KeyError: 'qwen2'.

Qwen recommends several inference settings to improve output quality and reduce repetition. The model should begin with ” ” to avoid empty thinking content, a behavior already handled when apply_chat_template is used with add_generation_prompt=True. Sampling Parameters: Use Temperature=0.6, TopP=0.95, MinP=0 instead of Greedy decoding to avoid endless repetitions. Use TopK between 20 and 40 to filter out rare token occurrences while maintaining output diversity. For supported frameworks, `presence_penalty` can be adjusted between 0 and 2, though higher values may introduce language mixing and a slight drop in performance.

For multi-turn conversations, historical outputs should include only the final output and exclude thinking content, which is already implemented in apply_chat_template. Qwen also recommends prompt standardization for benchmarking, including step-by-step reasoning with a boxed final answer for math problems and a fixed JSON answer field for multiple-choice tasks. For inputs exceeding 8,192 tokens, YaRN can be enabled through rope_scaling with “factor”: 4.0 and “original_max_position_embeddings”: 32768. Qwen recommends vLLM for deployment, while noting that current vLLM support is limited to static YARN, which may affect performance on shorter texts.

50

Impact Score

SK Group warns DRAM shortages could curb memory use

SK Group chairman Chey Tae-won warned that customers may reduce memory consumption through infrastructure and software optimization if DRAM suppliers fail to raise output. Demand from Artificial Intelligence data centers is keeping the market tight as memory makers weigh expansion against the long timelines for new fabs.

BitUnlocker bypasses TPM-only Windows 11 BitLocker

Intrinsec disclosed BitUnlocker, a downgrade attack that can bypass TPM-only Windows 11 BitLocker protections with physical access to a machine. The technique abuses a flaw in Windows recovery and deployment components and relies on older trusted boot code.

Micron samples 256 GB DDR5 9200 MT/s RDIMM server modules

Micron has begun sampling 256 GB DDR5 RDIMM server modules built on its 1-gamma technology to key ecosystem partners. The company positions the new modules as a higher-speed, more power-efficient option for scaling next-generation Artificial Intelligence and HPC infrastructure.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.