Vals AI Launches Comprehensive Public Enterprise LLM Benchmarks

Vals AI introduces in-depth, industry-focused benchmarks for large language models, highlighting current capabilities and shortcomings in enterprise tasks for Artificial Intelligence.

Vals AI has launched a suite of public enterprise benchmarks designed to evaluate large language models on tasks relevant to real-world industry applications. The platform addresses the critical shortage of practical benchmarking by focusing on how language models perform in enterprise settings, such as finance, law, and tax, providing transparency on their strengths and limitations.

Among the latest updates, the newly released Finance Agent Benchmark rigorously tests Artificial Intelligence agents on tasks expected of entry-level financial analysts. Developed in collaboration with industry experts, the benchmark includes 537 questions covering skills such as data retrieval, market research, and projections. The benchmark assesses the models´ ability to use up to four digital tools, including web and EDGAR database searches, to address realistic queries. Results indicate significant performance gaps: no current Artificial Intelligence model surpasses 50% accuracy. Notably, OpenAI´s o3 model led with 48.3% accuracy but incurred higher operational costs, while Anthropic´s Claude Sonnet 3.7 Thinking achieved 44.1% accuracy at a much lower price per question.

Comprehensive model evaluations reveal clear trends in model capabilities across domains. OpenAI´s o3 and o4 Mini recently claimed top accuracy rankings, especially in complex reasoning tasks and math benchmarks like MMMU, MMLU Pro, and MGSM. However, all models—including GPT 4.1 and its smaller variants—face persistent challenges in legal benchmarks such as ContractLaw and CaseLaw, highlighting significant domain-specific weaknesses. The platform also emphasizes cost-effectiveness, latency, and real-world usability in its evaluations, with smaller models like GPT 4.1 Nano excelling in speed but trailing in complex task accuracy. Vals AI regularly updates its database with new model releases from major players, including Anthropic, OpenAI, and DeepSeek, making it a dynamic resource for tracking advancements and gaps in language model performance for enterprise adoption.

67

Impact Score

UK and EU Artificial Intelligence regulatory outlook for May 2026

The UK is moving ahead with targeted Artificial Intelligence measures in policing, online safety, cyber security and copyright policy, while the EU is refining how the EU Artificial Intelligence Act will apply in practice. Consultations, new offences and implementation deadlines are shaping the next phase of compliance on both sides.

Germany sets out national implementation of the Artificial Intelligence Act

Germany has published a draft law to implement the European Artificial Intelligence Act through new supervisory structures, clearer institutional responsibilities, and measures designed to support innovation. The proposal puts the Federal Network Agency at the center of enforcement while preserving sector-specific oversight in sensitive fields.

ECB warns banks about new Artificial Intelligence security risks

The European Central Bank has called major banks to an emergency meeting over cybersecurity risks tied to advanced Artificial Intelligence models. Regulators want banks to speed up security updates as newer tools make it easier to find and exploit vulnerabilities.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.