Assessing LLM autograders and introducing HiBayES in AISI research

AISI publishes methods to evaluate large language model judges and unveils HiBayES, a hierarchical Bayesian framework to improve Artificial Intelligence evaluation.

The AI Security Institute´s ´our work´ hub aggregates recent research, tools, and partnership activity aimed at improving how advanced systems are measured and managed. The site highlights methodological advances alongside applied evaluations and programmatic initiatives. It is organised as a catalogue of reports, tool releases and explanatory posts that explain why rigorous testing matters for system deployment and policy.

Two recent research outputs stand out. ´LLM judges on trial´ presents a new statistical framework to assess autograders, the model-based evaluators increasingly used to score other models; the framework estimates grader reliability while answering the primary research question. HiBayES proposes a hierarchical Bayesian modelling approach to evaluation, addressing the nested, correlated structure of large-scale LLM testing and providing more robust uncertainty estimates. Together these contributions tackle practical problems in evaluation: bias in model-graded labels, variability across tasks and annotators, and the need to quantify confidence in aggregate results.

The site also surfaces a broad portfolio of complementary work. Inspect, an open-source testing framework, and the Inspect sandboxing toolkit aim to make agentic evaluations scalable and reproducible. RepliBench offers benchmarks for autonomous replication capability. AISI´s pre-deployment evaluations of major models, international joint testing exercises on agentic behaviour, and investigative work on cyber, chemical and biological misuse demonstrate an applied pipeline from methodology to real-world assessment. The alignment project and funding announcements show parallel investment in longer-term research and capacity building.

The overall narrative emphasises empirical rigour, transparency and collaboration. The institute publishes methods, benchmarks and software; it runs joint testing partnerships and prize programmes to crowdsource difficult evaluations. Readers are invited to explore individual reports and tools on the site to understand specific methods, datasets and results. The collection reflects an attempt to standardise how evaluators measure risk, to strengthen the evidence base for policy and to make evaluations more reproducible and informative for stakeholders.

72

Impact Score

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.

Please check your email for a Verification Code sent to . Didn't get a code? Click here to resend