BenchmarkQED automates rigorous benchmarking for retrieval-augmented generation systems

BenchmarkQED introduces an open-source toolkit for automated benchmarking of retrieval-augmented generation systems, advancing evaluation methodology in Artificial Intelligence.

Microsoft Research has unveiled BenchmarkQED, an open-source framework engineered to automate the benchmarking of retrieval-augmented generation (RAG) systems. The toolkit addresses the crucial need for robust, reproducible evaluation as new RAG techniques emerge to answer questions over private datasets. BenchmarkQED automates the entire benchmarking stack, including query synthesis, automated answer evaluation, and standardized dataset preparation, enabling head-to-head comparison of advanced systems such as LazyGraphRAG and traditional vector-based RAGs across a spectrum of query challenges and datasets.

The toolkit’s AutoQ component uses synthetic query generation to cover both local and global question types, spanning four distinct classes: data-local, activity-local, data-global, and activity-global. This allows for consistent benchmarking even across disparate datasets, removing the need for manual query design. BenchmarkQED leverages the capabilities of GraphRAG, which deploys large language models to construct and summarize knowledge graphs, yielding comprehensive and nuanced responses, especially for complex or global queries that typically challenge standard RAG systems. The benchmarking corpus includes both the AP News health articles dataset and updated Behind the Tech podcast transcripts, which have been made freely available to the research community.

Evaluation is accomplished through the AutoE framework, which deploys the LLM-as-a-Judge methodology. It assesses competing system outputs based on comprehensiveness, diversity, empowerment, and relevance—key metrics for high-quality question answering. In extensive trials, LazyGraphRAG significantly outperformed standard vector RAG implementations, including those given an unprecedented one-million-token context window. Performance gains were noted not just for global queries, but also for certain local tasks, with win rates exceeding 50 percent across nearly all scenarios. Comparative systems—GraphRAG Global, Vector RAG, LightRAG, RAPTOR, TREX, and more—were systematically evaluated under identical conditions, ensuring rigorous and fair comparisons.

To further guarantee reliable evaluation, the AutoD component enables the creation of structurally aligned datasets by sampling for topic breadth and depth. These data preparation steps ensure that benchmarking reflects the capabilities of RAG systems rather than the peculiarities of individual corpora. By releasing both the BenchmarkQED toolkit and curated datasets under open licenses, Microsoft Research aims to accelerate the development and validation of next-generation Artificial Intelligence question-answering systems. The team invites practitioners and researchers to engage with the toolkit and datasets on GitHub to drive progress in the field.

74

Impact Score

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.

Please check your email for a Verification Code sent to . Didn't get a code? Click here to resend