BenchmarkQED automates rigorous benchmarking for retrieval-augmented generation systems

BenchmarkQED introduces an open-source toolkit for automated benchmarking of retrieval-augmented generation systems, advancing evaluation methodology in Artificial Intelligence.

Microsoft Research has unveiled BenchmarkQED, an open-source framework engineered to automate the benchmarking of retrieval-augmented generation (RAG) systems. The toolkit addresses the crucial need for robust, reproducible evaluation as new RAG techniques emerge to answer questions over private datasets. BenchmarkQED automates the entire benchmarking stack, including query synthesis, automated answer evaluation, and standardized dataset preparation, enabling head-to-head comparison of advanced systems such as LazyGraphRAG and traditional vector-based RAGs across a spectrum of query challenges and datasets.

The toolkit’s AutoQ component uses synthetic query generation to cover both local and global question types, spanning four distinct classes: data-local, activity-local, data-global, and activity-global. This allows for consistent benchmarking even across disparate datasets, removing the need for manual query design. BenchmarkQED leverages the capabilities of GraphRAG, which deploys large language models to construct and summarize knowledge graphs, yielding comprehensive and nuanced responses, especially for complex or global queries that typically challenge standard RAG systems. The benchmarking corpus includes both the AP News health articles dataset and updated Behind the Tech podcast transcripts, which have been made freely available to the research community.

Evaluation is accomplished through the AutoE framework, which deploys the LLM-as-a-Judge methodology. It assesses competing system outputs based on comprehensiveness, diversity, empowerment, and relevance—key metrics for high-quality question answering. In extensive trials, LazyGraphRAG significantly outperformed standard vector RAG implementations, including those given an unprecedented one-million-token context window. Performance gains were noted not just for global queries, but also for certain local tasks, with win rates exceeding 50 percent across nearly all scenarios. Comparative systems—GraphRAG Global, Vector RAG, LightRAG, RAPTOR, TREX, and more—were systematically evaluated under identical conditions, ensuring rigorous and fair comparisons.

To further guarantee reliable evaluation, the AutoD component enables the creation of structurally aligned datasets by sampling for topic breadth and depth. These data preparation steps ensure that benchmarking reflects the capabilities of RAG systems rather than the peculiarities of individual corpora. By releasing both the BenchmarkQED toolkit and curated datasets under open licenses, Microsoft Research aims to accelerate the development and validation of next-generation Artificial Intelligence question-answering systems. The team invites practitioners and researchers to engage with the toolkit and datasets on GitHub to drive progress in the field.

74

Impact Score

Devin Desktop turns Windsurf into an agent command center

Cognition has renamed Windsurf as Devin Desktop, positioning the IDE as a unified surface for managing coding agents. The product keeps the existing editor experience while adding multi-agent workflows, shared context, and cloud handoff features.

NVIDIA advances U.K. sovereign Artificial Intelligence push

NVIDIA is positioning the U.K.’s sovereign Artificial Intelligence effort as a shift from policy to deployment, with new compute plans, startup funding and enterprise projects. The push spans cloud infrastructure, life sciences, coding, inference and developer training.

World Cup ball tests and OpenAI super app plans

Wind-tunnel experiments suggest Adidas’s Trionda ball could trade distance for stability at the FIFA World Cup. OpenAI is also weighing a broader ChatGPT revamp as Artificial Intelligence infrastructure and policy fights widen.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.