Qwen presents QwQ-32B as the reasoning model in the Qwen series, designed to outperform conventional instruction-tuned models on hard downstream tasks through stronger thinking and reasoning capabilities. It is described as a medium-sized reasoning model with competitive performance against state-of-the-art reasoning models, including DeepSeek-R1 and o1-mini. The model is a causal language model trained through pretraining and post-training, including supervised finetuning and reinforcement learning.
The technical profile includes a transformers architecture with RoPE, SwiGLU, RMSNorm, and Attention QKV bias. Number of Parameters: 32.5B. Number of Paramaters (Non-Embedding): 31.0B. Number of Layers: 64. Number of Attention Heads (GQA): 40 for Q and 8 for KV. Context Length: Full 131,072 tokens. For prompts exceeding 8,192 tokens in length, YaRN must be enabled. Qwen also notes that QwQ is based on Qwen2.5 and recommends using the latest version of transformers, warning that with transformers<4.37.0, users will encounter the error KeyError: 'qwen2'.
Qwen recommends several inference settings to improve output quality and reduce repetition. The model should begin with ” ” to avoid empty thinking content, a behavior already handled when apply_chat_template is used with add_generation_prompt=True. Sampling Parameters: Use Temperature=0.6, TopP=0.95, MinP=0 instead of Greedy decoding to avoid endless repetitions. Use TopK between 20 and 40 to filter out rare token occurrences while maintaining output diversity. For supported frameworks, `presence_penalty` can be adjusted between 0 and 2, though higher values may introduce language mixing and a slight drop in performance.
For multi-turn conversations, historical outputs should include only the final output and exclude thinking content, which is already implemented in apply_chat_template. Qwen also recommends prompt standardization for benchmarking, including step-by-step reasoning with a boxed final answer for math problems and a fixed JSON answer field for multiple-choice tasks. For inputs exceeding 8,192 tokens, YaRN can be enabled through rope_scaling with “factor”: 4.0 and “original_max_position_embeddings”: 32768. Qwen recommends vLLM for deployment, while noting that current vLLM support is limited to static YARN, which may affect performance on shorter texts.
