xAI has officially launched Grok 3, along with Grok 3 Reasoning in beta and a smaller Grok 3 mini Reasoning model. Reasoning models aim to step beyond standard generative systems by iteratively working through problems, which can reduce hallucinations. xAI is promoting Grok 3 as best in class, saying it surpasses models from OpenAI, Google, Anthropic, and DeepSeek on key benchmarks, and it performed strongly under the codename “chocolate” in Chatbot Arena’s blind tests. The model appears to have largely caught up to rivals despite xAI’s late start, though it still inherits some familiar frontier-model limitations.
Early user assessments frame Grok 3 as competitive but not definitively superior. Andrej Karpathy, a founding member of OpenAI and former Tesla director of Artificial Intelligence, said Grok 3 with its Deep Search reasoning feature feels in the state-of-the-art range of OpenAI’s strongest models and slightly ahead of DeepSeek-R1 and Gemini 2.0 Flash Thinking on his stress tests. Wharton professor Ethan Mollick called the release in line with expectations, arguing it does not alter the broader consensus: rapid progress continues, speed is a moat, compute still matters, and there is no obvious secret sauce if a team has talent and chips. In other words, Grok 3 may satisfy enthusiasts but is not an obvious reason for most users to cancel a ChatGPT subscription.
After xAI’s benchmark slides circulated, OpenAI product engineer Rex Asabor posted an “updated” comparison indicating OpenAI’s unreleased o3 beats Grok 3 Reasoning on math and science benchmarks. Since o3 is not public, xAI may not have had access to those scores, but the exchange tempers claims that Grok 3 is the outright leader in reasoning.
Observers also highlighted the pace of xAI’s catch-up. Mollick noted how quickly X got to the frontier and said the key question is whether the trend continues. Elon Musk said Grok 3 training used 10 times the compute of Grok 2, powered by 200,000 GPUs, reinforcing the near-term view that more compute correlates with better performance. Still, researcher Gary Marcus remains skeptical that scaling laws will continue to yield linear gains in intelligence.
Grok 3 shows familiar shortcomings. Karpathy described its humor as limited to punny dad jokes, calling this a common large model issue tied to mode collapse. In a prompt to generate an SVG of a pelican riding a bicycle, Grok 3 did better than some peers but still missed elements. On politically charged prompts, Karpathy said the model produced a cautious, noncommittal essay, suggesting it may be more sensitive on ethics than some of Elon Musk’s supporters expect. Previous Grok versions have leaned left on political questions, which Musk has attributed to public training data, and he has vowed to steer the system toward political neutrality. First access to Grok 3 goes to X Premium+ subscribers, a plan whose price was recently increased.