Vals AI has launched a suite of public enterprise benchmarks designed to evaluate large language models on tasks relevant to real-world industry applications. The platform addresses the critical shortage of practical benchmarking by focusing on how language models perform in enterprise settings, such as finance, law, and tax, providing transparency on their strengths and limitations.
Among the latest updates, the newly released Finance Agent Benchmark rigorously tests Artificial Intelligence agents on tasks expected of entry-level financial analysts. Developed in collaboration with industry experts, the benchmark includes 537 questions covering skills such as data retrieval, market research, and projections. The benchmark assesses the models´ ability to use up to four digital tools, including web and EDGAR database searches, to address realistic queries. Results indicate significant performance gaps: no current Artificial Intelligence model surpasses 50% accuracy. Notably, OpenAI´s o3 model led with 48.3% accuracy but incurred higher operational costs, while Anthropic´s Claude Sonnet 3.7 Thinking achieved 44.1% accuracy at a much lower price per question.
Comprehensive model evaluations reveal clear trends in model capabilities across domains. OpenAI´s o3 and o4 Mini recently claimed top accuracy rankings, especially in complex reasoning tasks and math benchmarks like MMMU, MMLU Pro, and MGSM. However, all models—including GPT 4.1 and its smaller variants—face persistent challenges in legal benchmarks such as ContractLaw and CaseLaw, highlighting significant domain-specific weaknesses. The platform also emphasizes cost-effectiveness, latency, and real-world usability in its evaluations, with smaller models like GPT 4.1 Nano excelling in speed but trailing in complex task accuracy. Vals AI regularly updates its database with new model releases from major players, including Anthropic, OpenAI, and DeepSeek, making it a dynamic resource for tracking advancements and gaps in language model performance for enterprise adoption.