Chinese firm unveils dynamic benchmark for artificial intelligence models

June 24, 2025

HongShan Capital Group´s Xbench provides a constantly evolving framework to assess artificial intelligence performance in both academic rigor and real-world tasks.

Evaluating whether an artificial intelligence model truly reasons or merely parrots information from training data remains a core challenge in the field. HongShan Capital Group, known as HSG, has launched Xbench—a new benchmarking tool that aims to address this issue directly. Unlike conventional benchmarks that focus on static academic tests, Xbench incorporates assessments on executing real-world tasks while pledging to regularly update its dataset. This approach helps ensure the benchmark keeps pace with evolving capabilities and topical relevance.

This week, HSG has made a portion of Xbench´s question set open-source and freely available. Accompanying the release is a public leaderboard comparing mainstream artificial intelligence models on Xbench´s demanding criteria. The latest rankings see ChatGPT o3 leading across all measured categories, but notable performances from ByteDance´s Doubao, Google´s Gemini 2.5 Pro, Elon Musk´s Grok, and Claude Sonnet demonstrate stiff competition. Initially conceived in 2022 as an internal tool to guide investment decisions, Xbench has since evolved, with external researchers and professionals joining the project. The team, led by partner Gong Yuan, recognized the value of broader access and has gradually upgraded Xbench to serve the wider artificial intelligence community.

Xbench operates via two primary methodologies. One resembles traditional benchmarking, testing academic proficiency through postgraduate-level STEM questions authored by graduate students and validated by faculty. This component, called ScienceQA, rewards not just correct answers but also the logical chain of reasoning. The second component, DeepResearch, shifts focus to the Chinese-language web, presenting models with complex, research-heavy questions across fields like music, finance, history, and literature. Here, success requires depth, source diversity, consistency, and admission of insufficient data when warranted. Illustrative challenges include queries that require specialized geographic knowledge, such as the count of border cities in China´s northwestern provinces—a question only a third of tested models could answer correctly.

To simulate professional real-world scenarios, Xbench further collaborates with field experts to create tasks drawn from recruitment and marketing domains. Sample tasks include sourcing specialized engineering candidates or matching advertisers with suitable video influencers from vast pools. As new domains such as legal, finance, accounting, and design are slated for future inclusion, the test will be updated quarterly using a blend of public and private questions. On the current leaderboard, ChatGPT-o3 maintains its lead in both recruiting and marketing applications, with Perplexity Search and Claude 3.5 Sonnet performing strongly. While quantifying certain model skills remains difficult, external researchers recognize Xbench as an important, promising advance in the endeavor to provide rigorous, real-world relevant benchmarking for artificial intelligence models.

Source

73

Impact Score

Latest News

Artificial Intelligence designed vaccine targets coronavirus threats

June 6, 2026

University of Cambridge researchers have trialled a vaccine component designed entirely by Artificial Intelligence to train the immune system against a broad family of coronaviruses. Early safety work showed modest immune effects, while larger studies and related vaccine projects are underway.

Meta Instagram breach exposes Artificial Intelligence agent security gaps

June 6, 2026

Attackers exploited Meta’s Artificial Intelligence customer support agent to take over Instagram accounts, underscoring risks that go beyond advanced hacking models. Security researchers warn that agentic systems can create serious vulnerabilities when deployed without strong guardrails and red-teaming.

Meta hack highlights broader Artificial Intelligence security risks

June 6, 2026

A reported Instagram account theft campaign shows how routine Artificial Intelligence support tools can create security gaps. New concerns are also emerging over chatbots’ effects on attention, thinking, and decision-making.

Broadcom falls on softer Artificial Intelligence chip outlook

June 6, 2026

Broadcom’s Artificial Intelligence chip outlook overshadowed an earnings beat, pressuring Advanced Micro Devices and Intel as investors reassessed semiconductor momentum. The selloff reflected high expectations after a sharp run in chip stocks.

Silicon Motion details PCIe Gen 5 and PCI-Express Gen 6 SSD controllers

June 6, 2026

Silicon Motion used Computex 2026 to present new consumer and enterprise SSD controllers, including the SM2524XT and SM8466. The lineup targets mid-range Gen 5 NVMe drives and next-generation enterprise SSDs positioned for near GPU storage performance.

Chinese firm unveils dynamic benchmark for artificial intelligence models

73

Impact Score

Latest News

Artificial Intelligence designed vaccine targets coronavirus threats

Meta Instagram breach exposes Artificial Intelligence agent security gaps

Meta hack highlights broader Artificial Intelligence security risks

Broadcom falls on softer Artificial Intelligence chip outlook

Silicon Motion details PCIe Gen 5 and PCI-Express Gen 6 SSD controllers

Contact Us