Vals publishes public enterprise language model benchmarks

Vals lists a broad set of public enterprise benchmarks spanning law, finance, healthcare, math, education, academics, coding, and beta agent tasks. The index highlights which models currently lead specific enterprise-focused evaluations and how widely each benchmark has been tested.

Vals presents a public benchmark suite designed to track how large language models perform on enterprise and domain-specific work. The Vals Index is a benchmark consisting of a weighted performance across finance, law and coding tasks, showing the potential impact that LLM’s can have on the economy. Updated 3/17/2026, the top model is Claude Sonnet 4.6, and the number of models tested is 36. The Vals Multimodal Index extends that approach across finance, law, coding, and education tasks. Updated 3/17/2026, the top model is Claude Sonnet 4.6, and the number of models tested is 25.

Legal and finance benchmarks focus on practical professional workflows. Updated 3/17/2026, CaseLaw (v2) is a private question-answer benchmark over Canadian court-cases, the top model is GPT 5.1, and the number of models tested is 41. Updated 3/18/2026, LegalBench evaluates language models on a wide range of open source legal reasoning tasks, the top model is Gemini 3.1 Pro Preview (02/26), and the number of models tested is 113. In finance, Updated 3/17/2026, CorpFin (v2) evaluates understanding of long-context credit agreements, the top model is Kimi K2.5, and the number of models tested is 92. Updated 3/17/2026, Finance Agent v1.1 evaluates agents on core financial analyst tasks, the top model is Claude Sonnet 4.6, and the number of models tested is 40. Updated 3/17/2026, MortgageTax evaluates reading and understanding tax certificates as images, the top model is Gemini 3.1 Pro Preview (02/26), and the number of models tested is 66. Updated 3/18/2026, TaxEval (v2) is a Vals-created set of questions and responses to tax questions, the top model is Claude Sonnet 4.6, and the number of models tested is 100.

Healthcare, math, academic, education, and coding evaluations broaden the coverage. Updated 3/17/2026, MedCode asks whether models can support the medical billing process, the top model is Gemini 3.1 Pro Preview (02/26), and the number of models tested is 47. Updated 3/17/2026, MedScribe asks whether models can support doctors with their administrative work, the top model is GPT 5.1, and the number of models tested is 47. Updated 3/17/2026, AIME lists Gemini 3.1 Pro Preview (02/26) as top model with 92 models tested, while Updated 3/19/2026, ProofBench lists Aristotle as top system with 23 systems tested. Updated 3/17/2026, GPQA has Gemini 3.1 Pro Preview (02/26) on top with 95 models tested, Updated 3/18/2026, MMLU Pro also has Gemini 3.1 Pro Preview (02/26) on top with 93 models tested, and Updated 3/17/2026, MMMU shows the same model leading with 63 models tested. Updated 3/17/2026, SAGE lists Claude Opus 4.5 (Thinking) as top model with 46 models tested.

Coding and agent benchmarks emphasize software and task execution. Updated 3/18/2026, IOI lists GPT 5.4 as the top model with 50 models tested. Updated 3/17/2026, LiveCodeBench lists Gemini 3.1 Pro Preview (02/26) as the top model with 101 models tested. Updated 3/17/2026, SWE-bench lists Claude Opus 4.6 (Thinking) as the top model with 62 models tested. Updated 3/17/2026, Terminal-Bench 2.0 lists Gemini 3.1 Pro Preview (02/26) as the top model with 46 models tested. Updated 3/18/2026, Vibe Code Bench v1.1 lists GPT 5.4 as the top model with 22 models tested. In beta benchmarks, Updated 12/23/2025, Poker Agent asks which model can make the most money playing poker, with GPT 5.2 listed as the top model and 17 models tested. Vals says it reports how language models perform on the industry-specific tasks where they will be used.

52

Impact Score

Governance risk highlights from Infosecurity Magazine

Governance and risk coverage centers on regulation, compliance, cybersecurity policy, and the growing role of Artificial Intelligence in enterprise security. Recent headlines point to pressure on critical infrastructure, standards updates, insider threat guidance, and concerns over guardrails for large language models.

MIT method spots overconfident Artificial Intelligence models

MIT researchers developed a way to detect when large language models are confidently wrong by comparing their answers with outputs from similar models. The combined uncertainty measure outperformed standard techniques across a range of tasks and may help reduce unreliable responses.

MEPs back delay for parts of Artificial Intelligence Act

European Parliament committees have endorsed targeted delays to parts of the Artificial Intelligence Act while adding a proposed ban on certain non-consensual image manipulation tools. The changes aim to give companies clearer deadlines, reduce overlap with other EU rules, and extend support to small mid-cap enterprises.

Publisher alliance seeks leverage over Artificial Intelligence web access

A new publisher coalition is trying to reshape how Artificial Intelligence companies access journalism by combining collective bargaining with tougher technical controls. The effort reflects growing pressure on Artificial Intelligence firms to pay for content used in training, search, and user-facing responses.

Military advantage in the age of algorithmic diffusion

American leadership in Artificial Intelligence research and infrastructure may not translate into lasting military advantage. Rapid diffusion of algorithms is shifting the contest toward compute, talent, and the speed of military adoption.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.