Vals AI Launches Comprehensive Public Enterprise LLM Benchmarks

Vals AI introduces in-depth, industry-focused benchmarks for large language models, highlighting current capabilities and shortcomings in enterprise tasks for Artificial Intelligence.

Vals AI has launched a suite of public enterprise benchmarks designed to evaluate large language models on tasks relevant to real-world industry applications. The platform addresses the critical shortage of practical benchmarking by focusing on how language models perform in enterprise settings, such as finance, law, and tax, providing transparency on their strengths and limitations.

Among the latest updates, the newly released Finance Agent Benchmark rigorously tests Artificial Intelligence agents on tasks expected of entry-level financial analysts. Developed in collaboration with industry experts, the benchmark includes 537 questions covering skills such as data retrieval, market research, and projections. The benchmark assesses the models´ ability to use up to four digital tools, including web and EDGAR database searches, to address realistic queries. Results indicate significant performance gaps: no current Artificial Intelligence model surpasses 50% accuracy. Notably, OpenAI´s o3 model led with 48.3% accuracy but incurred higher operational costs, while Anthropic´s Claude Sonnet 3.7 Thinking achieved 44.1% accuracy at a much lower price per question.

Comprehensive model evaluations reveal clear trends in model capabilities across domains. OpenAI´s o3 and o4 Mini recently claimed top accuracy rankings, especially in complex reasoning tasks and math benchmarks like MMMU, MMLU Pro, and MGSM. However, all models—including GPT 4.1 and its smaller variants—face persistent challenges in legal benchmarks such as ContractLaw and CaseLaw, highlighting significant domain-specific weaknesses. The platform also emphasizes cost-effectiveness, latency, and real-world usability in its evaluations, with smaller models like GPT 4.1 Nano excelling in speed but trailing in complex task accuracy. Vals AI regularly updates its database with new model releases from major players, including Anthropic, OpenAI, and DeepSeek, making it a dynamic resource for tracking advancements and gaps in language model performance for enterprise adoption.

67

Impact Score

HMS researchers design Artificial Intelligence tool to quicken drug discovery

Harvard Medical School researchers unveiled PDGrapher, an Artificial Intelligence tool that identifies gene target combinations to reverse disease states up to 25 times faster than current methods. The Nature-published study outlines a shift from single-target screening to multi-gene intervention design.

How hackers poison Artificial Intelligence business tools and defences

Researchers report attackers are now planting hidden prompts in emails to hijack enterprise Artificial Intelligence tools and even tamper with Artificial Intelligence-powered security features. With most organisations adopting Artificial Intelligence, email must be treated as an execution environment with stricter controls.

Meta unveils Business Artificial Intelligence as a 24/7 sales agent

Meta launched Business Artificial Intelligence, a customer assistant that lives across Facebook, Instagram and even third-party sites to answer questions, recommend products and guide checkout. The company is also rolling out generative Artificial Intelligence and creator tools to help brands produce targeted ads and scale influencer campaigns.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.