Assessing LLM autograders and introducing HiBayES in AISI research

AISI publishes methods to evaluate large language model judges and unveils HiBayES, a hierarchical Bayesian framework to improve Artificial Intelligence evaluation.

The AI Security Institute´s ´our work´ hub aggregates recent research, tools, and partnership activity aimed at improving how advanced systems are measured and managed. The site highlights methodological advances alongside applied evaluations and programmatic initiatives. It is organised as a catalogue of reports, tool releases and explanatory posts that explain why rigorous testing matters for system deployment and policy.

Two recent research outputs stand out. ´LLM judges on trial´ presents a new statistical framework to assess autograders, the model-based evaluators increasingly used to score other models; the framework estimates grader reliability while answering the primary research question. HiBayES proposes a hierarchical Bayesian modelling approach to evaluation, addressing the nested, correlated structure of large-scale LLM testing and providing more robust uncertainty estimates. Together these contributions tackle practical problems in evaluation: bias in model-graded labels, variability across tasks and annotators, and the need to quantify confidence in aggregate results.

The site also surfaces a broad portfolio of complementary work. Inspect, an open-source testing framework, and the Inspect sandboxing toolkit aim to make agentic evaluations scalable and reproducible. RepliBench offers benchmarks for autonomous replication capability. AISI´s pre-deployment evaluations of major models, international joint testing exercises on agentic behaviour, and investigative work on cyber, chemical and biological misuse demonstrate an applied pipeline from methodology to real-world assessment. The alignment project and funding announcements show parallel investment in longer-term research and capacity building.

The overall narrative emphasises empirical rigour, transparency and collaboration. The institute publishes methods, benchmarks and software; it runs joint testing partnerships and prize programmes to crowdsource difficult evaluations. Readers are invited to explore individual reports and tools on the site to understand specific methods, datasets and results. The collection reflects an attempt to standardise how evaluators measure risk, to strengthen the evidence base for policy and to make evaluations more reproducible and informative for stakeholders.

72

Impact Score

Policymakers weigh pause on Artificial Intelligence data center construction

Federal, state, and local officials are moving to slow or condition large data center development as concerns grow over electricity costs, grid strain, environmental effects, and labor standards. Proposed moratoriums and tax incentive changes are creating new uncertainty for developers, hyperscalers, and financiers.

European Union delays key Artificial Intelligence Act obligations

European Union lawmakers have agreed to revise the Artificial Intelligence Act, delaying major high-risk compliance obligations and easing some overlapping requirements. The changes give businesses more time to prepare while preserving the law’s core framework for high-risk systems and transparency rules.

HMRC signs £175m Quantexa deal for fraud detection

HM Revenue and Customs has signed a £175 million, 10-year agreement with Quantexa to unify fragmented data and strengthen fraud detection. The deployment is designed to automate routine work while keeping decisions transparent, auditable and subject to human approval.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.