Chinese firm unveils dynamic benchmark for artificial intelligence models

June 24, 2025

HongShan Capital Group´s Xbench provides a constantly evolving framework to assess artificial intelligence performance in both academic rigor and real-world tasks.

Evaluating whether an artificial intelligence model truly reasons or merely parrots information from training data remains a core challenge in the field. HongShan Capital Group, known as HSG, has launched Xbench—a new benchmarking tool that aims to address this issue directly. Unlike conventional benchmarks that focus on static academic tests, Xbench incorporates assessments on executing real-world tasks while pledging to regularly update its dataset. This approach helps ensure the benchmark keeps pace with evolving capabilities and topical relevance.

This week, HSG has made a portion of Xbench´s question set open-source and freely available. Accompanying the release is a public leaderboard comparing mainstream artificial intelligence models on Xbench´s demanding criteria. The latest rankings see ChatGPT o3 leading across all measured categories, but notable performances from ByteDance´s Doubao, Google´s Gemini 2.5 Pro, Elon Musk´s Grok, and Claude Sonnet demonstrate stiff competition. Initially conceived in 2022 as an internal tool to guide investment decisions, Xbench has since evolved, with external researchers and professionals joining the project. The team, led by partner Gong Yuan, recognized the value of broader access and has gradually upgraded Xbench to serve the wider artificial intelligence community.

Xbench operates via two primary methodologies. One resembles traditional benchmarking, testing academic proficiency through postgraduate-level STEM questions authored by graduate students and validated by faculty. This component, called ScienceQA, rewards not just correct answers but also the logical chain of reasoning. The second component, DeepResearch, shifts focus to the Chinese-language web, presenting models with complex, research-heavy questions across fields like music, finance, history, and literature. Here, success requires depth, source diversity, consistency, and admission of insufficient data when warranted. Illustrative challenges include queries that require specialized geographic knowledge, such as the count of border cities in China´s northwestern provinces—a question only a third of tested models could answer correctly.

To simulate professional real-world scenarios, Xbench further collaborates with field experts to create tasks drawn from recruitment and marketing domains. Sample tasks include sourcing specialized engineering candidates or matching advertisers with suitable video influencers from vast pools. As new domains such as legal, finance, accounting, and design are slated for future inclusion, the test will be updated quarterly using a blend of public and private questions. On the current leaderboard, ChatGPT-o3 maintains its lead in both recruiting and marketing applications, with Perplexity Search and Claude 3.5 Sonnet performing strongly. While quantifying certain model skills remains difficult, external researchers recognize Xbench as an important, promising advance in the endeavor to provide rigorous, real-world relevant benchmarking for artificial intelligence models.

Source

73

Impact Score

Latest News

Is the UK ready for £31bn in US Artificial Intelligence funding?

October 2, 2025

A £31 billion wave of US investment is heading into the UK’s Artificial Intelligence sector. Founder Varun Bhanot outlines the opportunities and responsibilities this creates for British startups.

Inside Intel: employees say culture eroded as firm missed the Artificial Intelligence boom

October 2, 2025

Current and former staff describe how Intel’s shift from Andy Grove’s experimental ethos to top-down cost cutting, layoffs and outsourcing sapped morale as the company stumbled in mobile and Artificial Intelligence. A new CEO and high-profile partnerships have lifted hopes, but trust remains fragile.

Scientists track permafrost thaw from space to guide Arctic planning

October 2, 2025

Researchers are using radar satellites to map seasonal ground subsidence and infer deep ice content, turning space data into practical guidance for communities and militaries coping with thawing permafrost. Early results in Alaska are informing relocation and infrastructure decisions as warming accelerates risks.

FAA proposal would expand beyond visual line of sight drone flights, raising privacy concerns

October 1, 2025

The FAA has proposed easing beyond visual line of sight restrictions across sectors including delivery and policing. Advocates say it will accelerate drone operations, while civil liberties groups warn of expanded surveillance.

Permafrost seen from space and the drone rules shaping surveillance

October 1, 2025

Scientists are using satellites to track thawing permafrost as Arctic towns feel the strain, while looming Federal Aviation Administration changes could accelerate a drone-filled future for policing and retail security.

Chinese firm unveils dynamic benchmark for artificial intelligence models

73

Impact Score

Latest News

Is the UK ready for £31bn in US Artificial Intelligence funding?

Inside Intel: employees say culture eroded as firm missed the Artificial Intelligence boom

Scientists track permafrost thaw from space to guide Arctic planning

FAA proposal would expand beyond visual line of sight drone flights, raising privacy concerns

Permafrost seen from space and the drone rules shaping surveillance

Contact Us