Chinese firm unveils dynamic benchmark for artificial intelligence models

HongShan Capital Group´s Xbench provides a constantly evolving framework to assess artificial intelligence performance in both academic rigor and real-world tasks.

Evaluating whether an artificial intelligence model truly reasons or merely parrots information from training data remains a core challenge in the field. HongShan Capital Group, known as HSG, has launched Xbench—a new benchmarking tool that aims to address this issue directly. Unlike conventional benchmarks that focus on static academic tests, Xbench incorporates assessments on executing real-world tasks while pledging to regularly update its dataset. This approach helps ensure the benchmark keeps pace with evolving capabilities and topical relevance.

This week, HSG has made a portion of Xbench´s question set open-source and freely available. Accompanying the release is a public leaderboard comparing mainstream artificial intelligence models on Xbench´s demanding criteria. The latest rankings see ChatGPT o3 leading across all measured categories, but notable performances from ByteDance´s Doubao, Google´s Gemini 2.5 Pro, Elon Musk´s Grok, and Claude Sonnet demonstrate stiff competition. Initially conceived in 2022 as an internal tool to guide investment decisions, Xbench has since evolved, with external researchers and professionals joining the project. The team, led by partner Gong Yuan, recognized the value of broader access and has gradually upgraded Xbench to serve the wider artificial intelligence community.

Xbench operates via two primary methodologies. One resembles traditional benchmarking, testing academic proficiency through postgraduate-level STEM questions authored by graduate students and validated by faculty. This component, called ScienceQA, rewards not just correct answers but also the logical chain of reasoning. The second component, DeepResearch, shifts focus to the Chinese-language web, presenting models with complex, research-heavy questions across fields like music, finance, history, and literature. Here, success requires depth, source diversity, consistency, and admission of insufficient data when warranted. Illustrative challenges include queries that require specialized geographic knowledge, such as the count of border cities in China´s northwestern provinces—a question only a third of tested models could answer correctly.

To simulate professional real-world scenarios, Xbench further collaborates with field experts to create tasks drawn from recruitment and marketing domains. Sample tasks include sourcing specialized engineering candidates or matching advertisers with suitable video influencers from vast pools. As new domains such as legal, finance, accounting, and design are slated for future inclusion, the test will be updated quarterly using a blend of public and private questions. On the current leaderboard, ChatGPT-o3 maintains its lead in both recruiting and marketing applications, with Perplexity Search and Claude 3.5 Sonnet performing strongly. While quantifying certain model skills remains difficult, external researchers recognize Xbench as an important, promising advance in the endeavor to provide rigorous, real-world relevant benchmarking for artificial intelligence models.

73

Impact Score

Trump executive order targets state Artificial Intelligence laws

Executive Order 14365 lays out a federal strategy to discourage, challenge, and potentially preempt state Artificial Intelligence laws viewed as burdensome. Employers are advised to keep complying with current state and local rules while preparing for regulatory uncertainty in 2026.

Who decides how America uses Artificial Intelligence in war

Stanford experts are divided over how the United States should govern Artificial Intelligence in defense, surveillance, and warfare. Their views converge on one point: decisions with such high stakes cannot be left to companies alone.

GPUBreach bypasses IOMMU on GDDR6-based NVIDIA GPUs

Researchers from the University of Toronto describe GPUBreach, a rowhammer attack against GDDR6-based NVIDIA GPUs that can bypass IOMMU protections. The technique enables CPU-side privilege escalation by abusing trusted GPU driver behavior on the host system.

Google Vids opens free video generation to all Google users

Google has made Google Vids available to anyone with a Google account, adding free access to video generation with its latest models. The move expands Google’s end-to-end video workflow and increases pressure on rivals that charge for similar tools.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.