BenchmarkQED automates rigorous benchmarking for retrieval-augmented generation systems

BenchmarkQED introduces an open-source toolkit for automated benchmarking of retrieval-augmented generation systems, advancing evaluation methodology in Artificial Intelligence.

Microsoft Research has unveiled BenchmarkQED, an open-source framework engineered to automate the benchmarking of retrieval-augmented generation (RAG) systems. The toolkit addresses the crucial need for robust, reproducible evaluation as new RAG techniques emerge to answer questions over private datasets. BenchmarkQED automates the entire benchmarking stack, including query synthesis, automated answer evaluation, and standardized dataset preparation, enabling head-to-head comparison of advanced systems such as LazyGraphRAG and traditional vector-based RAGs across a spectrum of query challenges and datasets.

The toolkit’s AutoQ component uses synthetic query generation to cover both local and global question types, spanning four distinct classes: data-local, activity-local, data-global, and activity-global. This allows for consistent benchmarking even across disparate datasets, removing the need for manual query design. BenchmarkQED leverages the capabilities of GraphRAG, which deploys large language models to construct and summarize knowledge graphs, yielding comprehensive and nuanced responses, especially for complex or global queries that typically challenge standard RAG systems. The benchmarking corpus includes both the AP News health articles dataset and updated Behind the Tech podcast transcripts, which have been made freely available to the research community.

Evaluation is accomplished through the AutoE framework, which deploys the LLM-as-a-Judge methodology. It assesses competing system outputs based on comprehensiveness, diversity, empowerment, and relevance—key metrics for high-quality question answering. In extensive trials, LazyGraphRAG significantly outperformed standard vector RAG implementations, including those given an unprecedented one-million-token context window. Performance gains were noted not just for global queries, but also for certain local tasks, with win rates exceeding 50 percent across nearly all scenarios. Comparative systems—GraphRAG Global, Vector RAG, LightRAG, RAPTOR, TREX, and more—were systematically evaluated under identical conditions, ensuring rigorous and fair comparisons.

To further guarantee reliable evaluation, the AutoD component enables the creation of structurally aligned datasets by sampling for topic breadth and depth. These data preparation steps ensure that benchmarking reflects the capabilities of RAG systems rather than the peculiarities of individual corpora. By releasing both the BenchmarkQED toolkit and curated datasets under open licenses, Microsoft Research aims to accelerate the development and validation of next-generation Artificial Intelligence question-answering systems. The team invites practitioners and researchers to engage with the toolkit and datasets on GitHub to drive progress in the field.

74

Impact Score

Analog computing from waste heat

MIT researchers developed an analog computing approach that uses waste heat in electronic devices to process data without electricity. The technique performs matrix vector multiplication with strong accuracy and could also help monitor heat in chips without extra energy use.

How Artificial Intelligence is reshaping financial services oversight

Financial services regulators are largely treating Artificial Intelligence as another technology governed by existing rules rather than building new securities-specific frameworks. History suggests that clearer expectations will emerge through examinations, enforcement, and supervisory guidance.

Nvidia faces gamer backlash over Artificial Intelligence shift

Nvidia is facing growing frustration from gamers as memory supply is steered toward data center chips and DLSS 5 becomes more central to game performance. The dispute highlights how far the company’s priorities have shifted toward enterprise Artificial Intelligence.

Executives see limited Artificial Intelligence productivity gains so far

Corporate enthusiasm around Artificial Intelligence has yet to translate into broad gains in employment or productivity, reviving comparisons to the long lag between early computing breakthroughs and measurable economic impact. Recent surveys and studies show mixed results, with strong expectations for future benefits but little consensus on present gains.

Nvidia skips a new GeForce generation as Artificial Intelligence chips dominate

Nvidia is set to go a year without a new GeForce GPU generation for the first time since the 1990s as memory shortages and higher margins in Artificial Intelligence hardware reshape the market. AMD and Intel are also struggling to capitalize because the same supply constraints are hitting gaming products across the industry.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.