ADeLe Offers Predictive and Explanatory Evaluation for AI Models

A new method called ADeLe breaks down Artificial Intelligence tasks by ability, enabling clearer predictions of model performance and revealing the ´why´ behind successes or failures.

Researchers supported by Microsoft and its Accelerating Foundation Models Research initiative have introduced a novel approach, ADeLe (annotated-demand-levels), for systematically evaluating Artificial Intelligence model performance. Unlike conventional benchmarks, ADeLe predicts how models will perform on unfamiliar tasks and provides detailed explanations for their successes and failures. It does this by decomposing tasks into demands across 18 cognitive and knowledge-based ability scales, such as reasoning, attention, and domain-specific knowledge, quantifying how much each ability is required using a detailed 0–5 rubric initially designed for human evaluation.

To generate an ability profile for an Artificial Intelligence model, researchers compare the model’s capabilities on a large, annotated benchmark to these task requirements. The result is a profile that highlights which abilities a particular model possesses and clarifies why it may fail or succeed on given tasks. This ability matching not only supports rigorous analysis but also enables accurate performance prediction, achieving about 88% success in forecasting whether leading models like GPT-4o and LLaMA-3.1-405B will correctly solve new, even unfamiliar, challenges—outperforming traditional single-metric approaches.

Extensive testing across 63 tasks and 20 benchmarks revealed measurement shortcomings in existing Artificial Intelligence evaluation methods, such as tests not genuinely assessing the abilities they claim or lacking variation in difficulty. Analysis also exposed distinct model strengths and weaknesses: newer and larger models generally perform better but with diminishing returns; reasoning-specific models excel where logical inference or social cognition is needed; and different training approaches critically impact knowledge base abilities. Further, ADeLe’s results provide nuanced visualizations through radial ability plots, helping developers and policymakers better grasp a model’s readiness for deployment. Researchers suggest that this approach could become a standardized framework for evaluating future Artificial Intelligence, extending to multimodal or embodied systems, and facilitating safer, more transparent societal adoption.

77

Impact Score

Samsung shows 96% power reduction in NAND flash

Samsung researchers report a design that combines ferroelectric materials with oxide semiconductors to cut NAND flash string-level power by up to 96%. The team says the approach supports high density, including up to 5 bits per cell, and could lower power for data centers and mobile and edge-Artificial Intelligence devices.

the download: fossil fuels and new endometriosis tests

This edition of The Download highlights how this year’s UN climate talks again omitted the phrase “fossil fuels” and why new noninvasive tests could shorten the nearly 10 years it now takes to diagnose endometriosis.

SAP unveils EU Artificial Intelligence Cloud: a unified vision for Europe’s sovereign Artificial Intelligence and cloud future

SAP launched EU Artificial Intelligence Cloud as a sovereign offering that brings together its milestones into a full-stack cloud and Artificial Intelligence framework. The offering supports EU data residency and gives customers flexible sovereignty and deployment choices across SAP data centers, trusted European infrastructure or fully managed on-site solutions.

HPC won’t be an x86 monoculture forever

x86 dominance in high-performance computing is receding – its share of the TOP500 has fallen from almost nine in ten machines a decade ago to 57 percent today. The rise of GPUs, Arm and RISC-V and the demands of Artificial Intelligence and hyperscale workloads are reshaping processor choices.

A trillion dollars is a terrible thing to waste

Gary Marcus argues that the machine learning mainstream’s prolonged focus on scaling large language models may have cost roughly a trillion dollars and produced diminishing returns. He urges a pivot toward new ideas such as neurosymbolic techniques and built-in inductive constraints to address persistent problems.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.