As 2025 closes, large language model development is defined less by raw scale and more by reasoning, reinforcement learning, and smarter training and deployment pipelines. Sebastian Raschka identifies DeepSeek R1 and its Reinforcement Learning with Verifiable Rewards (RLVR) and GRPO algorithm as the year’s central breakthrough, because it showed that reasoning-like behavior can be induced in models via post-training on verifiable tasks such as math and code. The DeepSeek V3 and R1 work also challenged assumptions about training costs, with estimates closer to 5 million dollars rather than 50 or 500 million and an additional training cost of 294,000 for R1 on top of V3, while still acknowledging that these figures exclude salaries and extensive experimentation. Raschka summarizes the yearly focus shift from 2022 RLHF + PPO, to 2023 LoRA SFT, to 2024 mid-training, and to 2025 RLVR + GRPO, and predicts 2026 RLVR extensions and more inference-time scaling, followed by 2027 continual learning.
Raschka explains that RLVR’s key advantage is the use of deterministic, verifiable labels, which allow models to learn complex problem-solving at scale without the bottlenecks of human-written responses or preference labels. He notes growing interest in process reward models that score not just final answers but explanations, and expects explanation-scoring and RLVR to expand into domains beyond math and code as secondary models are used to evaluate intermediate reasoning steps. In parallel, inference-time scaling has become a major lever: platforms are increasingly willing to trade latency and cost for accuracy in high-stakes tasks, and methods such as self-consistency and self-refinement can push models to gold-level performance on challenging math benchmarks when combined with heavy inference budgets. Continual learning is also gaining attention as teams search for ways to update models on new data without catastrophic forgetting, although no decisive breakthroughs have emerged yet.
On the research side, GRPO has become a favored playground because it is conceptually rich yet not prohibitively expensive to experiment with, enabling both academic and industrial teams to propose modifications that have already made their way into state-of-the-art training pipelines. Raschka lists refinements such as zero gradient signal filtering, active sampling, token-level loss, variants that remove KL loss, truncated importance sampling, KL tuning with domain-specific strengths, and off-policy masking, and reports that these tricks significantly stabilize runs and prevent bad updates from corrupting training. Architecturally, most leading models still rely on decoder-style transformers, but open-weight systems are converging on mixture-of-experts layers and efficiency tweaks like grouped-query and sliding-window attention, alongside more radical linear-time components such as Gated DeltaNets and Mamba-2 layers. He expects transformers to remain dominant for a few more years, while efficiency-focused hybrids grow in importance, and he highlights emerging text diffusion models such as Google’s planned Gemini Diffusion and the 100B-parameter LLaDA 2.0 as promising for low-latency use cases even if they do not yet set state-of-the-art records.
Raschka argues that 2025 is also the year when inference scaling and tool use rival raw model scaling as primary drivers of capability gains. He points to GPT 4.5 as an example where a large training budget yielded limited returns compared to smarter post-training and inference strategies. Reasoning-heavy modes such as GPT Heavy Thinking or DeepSeekMath-V2 illustrate that high-latency, high-cost inference is acceptable in narrow domains like competition-level math or complex coding. Tool use, meanwhile, is reducing hallucinations by allowing models to call search engines, official databases, calculators, and APIs instead of relying purely on memorized knowledge, and OpenAI’s gpt-oss is cited as an early open-weight model designed explicitly around tool use. While security and tooling gaps mean that many open-source stacks still default to non-tool-use modes, Raschka expects local tool use and more agentic capabilities to become standard in coming years.
Evaluation, however, has become increasingly strained. Raschka coins “benchmaxxing” to describe the heavy optimization on public benchmarks that has inflated scores without corresponding real-world gains, using Llama 4 as an example of a model that appeared dominant on leaderboards but disappointed many users in practice. He argues that if the test set is public and part of the training data, benchmark numbers cease to be reliable indicators of relative quality, especially when models are directly tuned on those tasks. Benchmarks remain necessary as minimum thresholds, but exceeding a given score no longer guarantees superior performance. The complexity of large language model use cases compared to single-task systems like image classifiers further complicates evaluation, and Raschka stresses that practical testing and more nuanced, varied benchmarks are essential.
Raschka dedicates substantial space to how Artificial Intelligence is reshaping coding, writing, and research work. He positions large language models as tools that grant “superpowers” rather than full replacements: they are excellent at boilerplate, refactoring, and debugging, and they can help non-experts automate tasks that would otherwise be out of reach, but expert-crafted codebases and designs still outperform purely generated systems. In his view, high-quality technical books and research remain anchored in human expertise, with models serving as assistants for error checking, literature discovery, and experiment design. He warns that overdelegating to models can undermine deep learning and intrinsic satisfaction, potentially accelerating burnout if humans become mere supervisors of automation instead of active creators. Drawing an analogy to chess, he suggests treating Artificial Intelligence as a partner for exploration and analysis rather than a full outsourcing of thinking.
Looking at the ecosystem, Raschka identifies proprietary, domain-specific data as the next major source of competitive edge and a potential bottleneck. While large providers would like access to high-quality proprietary corpora in areas such as finance, biotech, or healthcare, many organizations are reluctant to share data that underpins their differentiation, and he argues that selling such assets to external Artificial Intelligence vendors could be short-sighted. Instead, he anticipates that as skills and tooling commoditize, more well-funded vertical players will train or adapt strong base models like DeepSeek V3.2, Kimi K2, or GLM 4.7 in-house, keeping sensitive data local and secure. He envisions a future where state-of-the-art capabilities are widely accessible as open-weight bases, while private fine-tuning and post-training on proprietary data become key differentiators for enterprises.
Raschka also reflects on his own work, including his decision to remain an independent researcher focused on long-form writing, consulting, and books. He notes that his book Build A Large Language Model (From Scratch) has been widely adopted in universities and industry and has now been translated into at least nine languages, and that he prefers to keep it centered on accessible core architectures rather than more exotic attention variants. To cover newer advances, he has been adding extensive bonus materials to the book’s GitHub repository and is writing a sequel, Build A Reasoning Model (From Scratch), which focuses on inference-time scaling and reinforcement learning for reasoning. He provides a detailed breakdown of the approximately 75-120 hours he invests per chapter, from topic selection to experiments, figures, exercises, and revisions, and reports that he is midway through implementing RLVR and GRPO training code for reasoning models.
In closing, Raschka lists the developments that most surprised him in 2025 and outlines his expectations for 2026. He was struck by how quickly several reasoning models achieved gold-level performance in major math competitions, by the rapid shift in open-weight popularity from Llama to Qwen, by Mistral adopting the DeepSeek V3 architecture in its latest flagship model, by the rise of additional open-weight contenders like Kimi, GLM, MiniMax, and Yi, by the early prioritization of hybrid efficiency architectures, by OpenAI releasing an open-weight model, and by MCP quickly becoming a de facto standard for tools and data access in agent systems. For 2026, he predicts at least one industry-scale, consumer-facing diffusion model for cheap low-latency inference, broader adoption of local tool-using and agentic models in the open-weight community, RLVR expansion into domains such as chemistry and biology, a gradual move away from classical retrieval-augmented generation in favor of better long-context models, and a significant portion of apparent progress coming from tooling and inference-time scaling rather than core model training. The overarching lesson he draws from 2025 is that progress is driven by many independent levers working in concert, and that making sense of these gains will require more rigorous, transparent, and consistent evaluation.
