Vals presents a public benchmark suite designed to track how large language models perform on enterprise and domain-specific work. The Vals Index is a benchmark consisting of a weighted performance across finance, law and coding tasks, showing the potential impact that LLM’s can have on the economy. Updated 3/17/2026, the top model is Claude Sonnet 4.6, and the number of models tested is 36. The Vals Multimodal Index extends that approach across finance, law, coding, and education tasks. Updated 3/17/2026, the top model is Claude Sonnet 4.6, and the number of models tested is 25.
Legal and finance benchmarks focus on practical professional workflows. Updated 3/17/2026, CaseLaw (v2) is a private question-answer benchmark over Canadian court-cases, the top model is GPT 5.1, and the number of models tested is 41. Updated 3/18/2026, LegalBench evaluates language models on a wide range of open source legal reasoning tasks, the top model is Gemini 3.1 Pro Preview (02/26), and the number of models tested is 113. In finance, Updated 3/17/2026, CorpFin (v2) evaluates understanding of long-context credit agreements, the top model is Kimi K2.5, and the number of models tested is 92. Updated 3/17/2026, Finance Agent v1.1 evaluates agents on core financial analyst tasks, the top model is Claude Sonnet 4.6, and the number of models tested is 40. Updated 3/17/2026, MortgageTax evaluates reading and understanding tax certificates as images, the top model is Gemini 3.1 Pro Preview (02/26), and the number of models tested is 66. Updated 3/18/2026, TaxEval (v2) is a Vals-created set of questions and responses to tax questions, the top model is Claude Sonnet 4.6, and the number of models tested is 100.
Healthcare, math, academic, education, and coding evaluations broaden the coverage. Updated 3/17/2026, MedCode asks whether models can support the medical billing process, the top model is Gemini 3.1 Pro Preview (02/26), and the number of models tested is 47. Updated 3/17/2026, MedScribe asks whether models can support doctors with their administrative work, the top model is GPT 5.1, and the number of models tested is 47. Updated 3/17/2026, AIME lists Gemini 3.1 Pro Preview (02/26) as top model with 92 models tested, while Updated 3/19/2026, ProofBench lists Aristotle as top system with 23 systems tested. Updated 3/17/2026, GPQA has Gemini 3.1 Pro Preview (02/26) on top with 95 models tested, Updated 3/18/2026, MMLU Pro also has Gemini 3.1 Pro Preview (02/26) on top with 93 models tested, and Updated 3/17/2026, MMMU shows the same model leading with 63 models tested. Updated 3/17/2026, SAGE lists Claude Opus 4.5 (Thinking) as top model with 46 models tested.
Coding and agent benchmarks emphasize software and task execution. Updated 3/18/2026, IOI lists GPT 5.4 as the top model with 50 models tested. Updated 3/17/2026, LiveCodeBench lists Gemini 3.1 Pro Preview (02/26) as the top model with 101 models tested. Updated 3/17/2026, SWE-bench lists Claude Opus 4.6 (Thinking) as the top model with 62 models tested. Updated 3/17/2026, Terminal-Bench 2.0 lists Gemini 3.1 Pro Preview (02/26) as the top model with 46 models tested. Updated 3/18/2026, Vibe Code Bench v1.1 lists GPT 5.4 as the top model with 22 models tested. In beta benchmarks, Updated 12/23/2025, Poker Agent asks which model can make the most money playing poker, with GPT 5.2 listed as the top model and 17 models tested. Vals says it reports how language models perform on the industry-specific tasks where they will be used.
