Vals AI Launches Comprehensive Public Enterprise LLM Benchmarks

Vals AI introduces in-depth, industry-focused benchmarks for large language models, highlighting current capabilities and shortcomings in enterprise tasks for Artificial Intelligence.

Vals AI has launched a suite of public enterprise benchmarks designed to evaluate large language models on tasks relevant to real-world industry applications. The platform addresses the critical shortage of practical benchmarking by focusing on how language models perform in enterprise settings, such as finance, law, and tax, providing transparency on their strengths and limitations.

Among the latest updates, the newly released Finance Agent Benchmark rigorously tests Artificial Intelligence agents on tasks expected of entry-level financial analysts. Developed in collaboration with industry experts, the benchmark includes 537 questions covering skills such as data retrieval, market research, and projections. The benchmark assesses the models´ ability to use up to four digital tools, including web and EDGAR database searches, to address realistic queries. Results indicate significant performance gaps: no current Artificial Intelligence model surpasses 50% accuracy. Notably, OpenAI´s o3 model led with 48.3% accuracy but incurred higher operational costs, while Anthropic´s Claude Sonnet 3.7 Thinking achieved 44.1% accuracy at a much lower price per question.

Comprehensive model evaluations reveal clear trends in model capabilities across domains. OpenAI´s o3 and o4 Mini recently claimed top accuracy rankings, especially in complex reasoning tasks and math benchmarks like MMMU, MMLU Pro, and MGSM. However, all models—including GPT 4.1 and its smaller variants—face persistent challenges in legal benchmarks such as ContractLaw and CaseLaw, highlighting significant domain-specific weaknesses. The platform also emphasizes cost-effectiveness, latency, and real-world usability in its evaluations, with smaller models like GPT 4.1 Nano excelling in speed but trailing in complex task accuracy. Vals AI regularly updates its database with new model releases from major players, including Anthropic, OpenAI, and DeepSeek, making it a dynamic resource for tracking advancements and gaps in language model performance for enterprise adoption.

67

Impact Score

Mustafa Suleyman says Artificial Intelligence compute growth is still accelerating

Mustafa Suleyman argues that Artificial Intelligence development is being propelled by simultaneous advances in chips, memory, networking, and software efficiency rather than nearing a hard limit. He contends that rising compute capacity and falling deployment costs will push systems beyond chatbots toward more capable agents.

China and the US are leading different Artificial Intelligence races

The US leads in large language models and advanced chips, while China has built a major advantage in robotics and humanoid manufacturing. That balance is shifting as Chinese developers narrow the gap in model performance and both countries push to combine software and machines.

Congress weighs Artificial Intelligence transparency rules

Bipartisan lawmakers are pushing a federal transparency standard for the largest Artificial Intelligence models as Congress works on a broader national framework. The proposal aims to increase public trust while avoiding stricter state-by-state requirements and heavier regulation.

Report finds California creative job losses are not driven by Artificial Intelligence

New research from Otis College of Art and Design finds California’s recent creative industry job losses stem from cost pressures and structural shifts, not direct worker displacement by generative Artificial Intelligence. The technology is changing workflows and expectations, but it is largely replacing tasks rather than entire jobs.

U.S. senators propose broader chip tool export ban for Chinese firms

A bipartisan proposal in the U.S. Senate would shift semiconductor equipment controls from specific fabs to targeted Chinese companies and their affiliates. The measure is aimed at cutting off access to advanced lithography and other wafer fabrication tools for firms such as Huawei, SMIC, YMTC, CXMT, and Hua Hong.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.