Evaluating Compound AI Systems through Behaviors, Not Benchmarks

Fujitsu Research of India
*Equal Contribution

Abstract

Compound AI (CAI) systems (also known as LLM Agents) combine large language models with retrievers and tools to perform multi-step information-seeking tasks. Traditional benchmark-based evaluation often fails to capture real-world operational failure modes. This work presents a behavior-driven evaluation framework that (1) generates explicit test specifications aligned with real usage scenarios and (2) implements them as document-grounded conversational QA tests. Our framework uses submodular selection to maximize diversity and coverage of tests and graph-based pipelines to realize scenarios over textual and tabular sources. Evaluations on QuAC and HybriDialogue show that the behavior-driven tests reveal failure modes missed by standard benchmarks.

Why this matters — for developers

Benchmarks are useful, but they often hide targeted failure modes that only appear in realistic usage scenarios. Thus, to capture such failure modes without the expense, we devise a behavior-driven approach which gives you interpretable, scenario-based tests (think: “Given / When / Then”) that helps understand how your agent behaves in different realistic scenarios before its deployed — enabling you to find and fix brittle behaviors.

Surface realistic failure cases early

Makes evaluation requirements explicit

Supports tests across tables & text

Targeted tests for your LLM Agent

How the evaluation works — 3 steps

  1. Generate Test Specifications: Leverage LLMs, and Human Domain Experts, to create BDD-style test specifications (“Scenario / Given / When / Then”) to specify expected behaviors of your LLM Agent.
  2. Select a diverse subset: Apply our submodular selection mechanism to maximize semantic diversity and document coverage, giving a compact but representative test specification suite.
  3. Implement as conversational QA: Transform each specification into document-grounded QA pairs for evaluation using both textual & tabular sources.

These steps are designed to complement — not replace — standard benchmarks, while exposing operational failures that aggregate metrics often miss.

Key Contributions

  • Behavior-driven test specifications for evaluating information-seeking CAI systems.
  • Automated graph-based pipelines to implement scenario-aligned conversational QA over textual and tabular sources.
  • Submodular selection procedure to ensure diverse, high-coverage test suites.

BibTeX

@inproceedings{bhagat2025evaluating,
  title     = {Evaluating Compound AI Systems through Behaviors, Not Benchmarks},
  author    = {Bhagat, Pranav and K.N. Ajay Shastry and Panda, Pranoy and Devaguptapu, Chaitanya},
  booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2025},
  pages     = {24193--24222},
  year      = {2025},
  publisher = {Association for Computational Linguistics}
}