Compound AI (CAI) systems (also known as LLM Agents) combine large language models with retrievers and tools to perform multi-step information-seeking tasks. Traditional benchmark-based evaluation often fails to capture real-world operational failure modes. This work presents a behavior-driven evaluation framework that (1) generates explicit test specifications aligned with real usage scenarios and (2) implements them as document-grounded conversational QA tests. Our framework uses submodular selection to maximize diversity and coverage of tests and graph-based pipelines to realize scenarios over textual and tabular sources. Evaluations on QuAC and HybriDialogue show that the behavior-driven tests reveal failure modes missed by standard benchmarks.
Benchmarks are useful, but they often hide targeted failure modes that only appear in realistic usage scenarios. Thus, to capture such failure modes without the expense, we devise a behavior-driven approach which gives you interpretable, scenario-based tests (think: “Given / When / Then”) that helps understand how your agent behaves in different realistic scenarios before its deployed — enabling you to find and fix brittle behaviors.
Surface realistic failure cases early
Makes evaluation requirements explicit
Supports tests across tables & text
Targeted tests for your LLM Agent
These steps are designed to complement — not replace — standard benchmarks, while exposing operational failures that aggregate metrics often miss.
@inproceedings{bhagat2025evaluating,
title = {Evaluating Compound AI Systems through Behaviors, Not Benchmarks},
author = {Bhagat, Pranav and K.N. Ajay Shastry and Panda, Pranoy and Devaguptapu, Chaitanya},
booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2025},
pages = {24193--24222},
year = {2025},
publisher = {Association for Computational Linguistics}
}