Evaluating whether an artificial intelligence model truly reasons or merely parrots information from training data remains a core challenge in the field. HongShan Capital Group, known as HSG, has launched Xbench—a new benchmarking tool that aims to address this issue directly. Unlike conventional benchmarks that focus on static academic tests, Xbench incorporates assessments on executing real-world tasks while pledging to regularly update its dataset. This approach helps ensure the benchmark keeps pace with evolving capabilities and topical relevance.
This week, HSG has made a portion of Xbench´s question set open-source and freely available. Accompanying the release is a public leaderboard comparing mainstream artificial intelligence models on Xbench´s demanding criteria. The latest rankings see ChatGPT o3 leading across all measured categories, but notable performances from ByteDance´s Doubao, Google´s Gemini 2.5 Pro, Elon Musk´s Grok, and Claude Sonnet demonstrate stiff competition. Initially conceived in 2022 as an internal tool to guide investment decisions, Xbench has since evolved, with external researchers and professionals joining the project. The team, led by partner Gong Yuan, recognized the value of broader access and has gradually upgraded Xbench to serve the wider artificial intelligence community.
Xbench operates via two primary methodologies. One resembles traditional benchmarking, testing academic proficiency through postgraduate-level STEM questions authored by graduate students and validated by faculty. This component, called ScienceQA, rewards not just correct answers but also the logical chain of reasoning. The second component, DeepResearch, shifts focus to the Chinese-language web, presenting models with complex, research-heavy questions across fields like music, finance, history, and literature. Here, success requires depth, source diversity, consistency, and admission of insufficient data when warranted. Illustrative challenges include queries that require specialized geographic knowledge, such as the count of border cities in China´s northwestern provinces—a question only a third of tested models could answer correctly.
To simulate professional real-world scenarios, Xbench further collaborates with field experts to create tasks drawn from recruitment and marketing domains. Sample tasks include sourcing specialized engineering candidates or matching advertisers with suitable video influencers from vast pools. As new domains such as legal, finance, accounting, and design are slated for future inclusion, the test will be updated quarterly using a blend of public and private questions. On the current leaderboard, ChatGPT-o3 maintains its lead in both recruiting and marketing applications, with Perplexity Search and Claude 3.5 Sonnet performing strongly. While quantifying certain model skills remains difficult, external researchers recognize Xbench as an important, promising advance in the endeavor to provide rigorous, real-world relevant benchmarking for artificial intelligence models.