Assessing LLM autograders and introducing HiBayES in AISI research

AISI publishes methods to evaluate large language model judges and unveils HiBayES, a hierarchical Bayesian framework to improve Artificial Intelligence evaluation.

The AI Security Institute´s ´our work´ hub aggregates recent research, tools, and partnership activity aimed at improving how advanced systems are measured and managed. The site highlights methodological advances alongside applied evaluations and programmatic initiatives. It is organised as a catalogue of reports, tool releases and explanatory posts that explain why rigorous testing matters for system deployment and policy.

Two recent research outputs stand out. ´LLM judges on trial´ presents a new statistical framework to assess autograders, the model-based evaluators increasingly used to score other models; the framework estimates grader reliability while answering the primary research question. HiBayES proposes a hierarchical Bayesian modelling approach to evaluation, addressing the nested, correlated structure of large-scale LLM testing and providing more robust uncertainty estimates. Together these contributions tackle practical problems in evaluation: bias in model-graded labels, variability across tasks and annotators, and the need to quantify confidence in aggregate results.

The site also surfaces a broad portfolio of complementary work. Inspect, an open-source testing framework, and the Inspect sandboxing toolkit aim to make agentic evaluations scalable and reproducible. RepliBench offers benchmarks for autonomous replication capability. AISI´s pre-deployment evaluations of major models, international joint testing exercises on agentic behaviour, and investigative work on cyber, chemical and biological misuse demonstrate an applied pipeline from methodology to real-world assessment. The alignment project and funding announcements show parallel investment in longer-term research and capacity building.

The overall narrative emphasises empirical rigour, transparency and collaboration. The institute publishes methods, benchmarks and software; it runs joint testing partnerships and prize programmes to crowdsource difficult evaluations. Readers are invited to explore individual reports and tools on the site to understand specific methods, datasets and results. The collection reflects an attempt to standardise how evaluators measure risk, to strengthen the evidence base for policy and to make evaluations more reproducible and informative for stakeholders.

72

Impact Score

Artificial Intelligence tool targets forged radiology reports

University at Buffalo researchers developed a detection system aimed at identifying radiology reports generated by Artificial Intelligence rather than clinicians. The work targets a growing risk of fraud in health care, insurance, and other record-driven industries.

NSF funds teacher training to expand Artificial Intelligence education nationwide

The U.S. National Science Foundation is awarding 11 million to the Computer Science Teachers Association to train K-12 educators in computer science and Artificial Intelligence instruction. The multistate initiative is designed to scale classroom-ready teaching capacity and broaden high-quality learning opportunities for students across the country.

NVIDIA DLSS 5 uses 2D frames and motion vectors

NVIDIA has outlined DLSS 5 as a system that takes 2D frames and motion vectors as input, then uses a generative Artificial Intelligence model to produce its final output. The approach focuses on 2D imagery rather than full 3D scene generation to improve computational efficiency.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.