Abstract
Enterprise systems are crucial for enhancing productivity and decision-making among employees and customers. Integrating LLM based systems into enterprise systems enables intelligent automation, personalized experiences, and efficient information retrieval, driving operational efficiency and strategic growth. However, developing and evaluating such systems is challenging due to the inherent complexity of enterprise environments, where data is fragmented across multiple sources and governed by sophisticated access controls.
Our benchmark features EnerpriseBench which provides a Enteprise Simulation Environment along with 500 Realistic Tasks for comprehensive agent assessment. Through extensive evaluation across multiple domains, EnterpriseBench reveals significant gaps between current LLM agent capabilities and enterprise requirements, establishing new benchmarks for real-world AI deployment readiness.
π― Key Contributions:
- Realistic Enterprise Simulation: Comprehensive sandbox with authentic business data across 10+ domains
- Diverse Tasks across Domains: Search-based and CRUD-based task assessment across different domains
- Automated Task Generation: Dynamic creation of enterprise tasks with configurable complexity
Introduction
The deployment of LLM agents in enterprise environments presents unique challenges that current benchmarks fail to address. While existing evaluation frameworks focus on isolated capabilities like question-answering or code generation, real enterprise scenarios require agents to navigate complex, interconnected business systems with authentic data relationships and domain-specific constraints.
Why Enterprise-Specific Evaluation Matters
Enterprise environments are characterized by:
- Multi-domain Integration: Tasks often span HR, IT, Sales, and Engineering departments
- Complex Data Relationships: Information is interconnected across multiple business systems
- Domain-Specific Constraints: Each department has unique workflows, terminology, and requirements
- Realistic Scale: Enterprise data volumes and complexity far exceed academic benchmarks
EnterpriseBench addresses these gaps by providing the first comprehensive framework specifically designed for enterprise LLM agent evaluation.
Figure 1: EnterpriseBench agent workflow showing the complete task execution process from user query through planning, execution, and task completion within the enterprise environment.
EnterpriseBench Framework
Architecture Overview
EnterpriseBench consists of three core components working together to provide comprehensive enterprise agent evaluation:
1. Enterprise Sandbox Environment
- Realistic Data: Synthetic but authentic business data across 10+ domains
- Interconnected Systems: Data relationships mirror real enterprise architectures
- Scalable Infrastructure: Supports various task types and complexity levels
- Privacy-Compliant: Synthetic data ensures privacy while maintaining realism
2. Comprehensive Task Suite
- Search Tasks: Information retrieval, conversation analysis, and database queries
- CRUD Tasks: Create, Read, Update, Delete operations on enterprise data
- Multi-Domain Coverage: Tasks spanning HR, IT, Sales, and Engineering departments
- Realistic Complexity: Enterprise-grade scenarios with authentic data relationships
3. Advanced Evaluation System
- Automated Assessment: AI-powered evaluation with multiple scoring criteria
- Performance Metrics: Comprehensive evaluation framework for agent capabilities
- Interactive Interfaces: Streamlit-powered demos for real-time testing
- Comparative Analysis: Benchmarking across different LLM architectures
Supported Domains
EnterpriseBench covers comprehensive business domains with authentic data and realistic task scenarios:
Human Resources
Employee management, recruitment, and policy administration
IT Service Management
Helpdesk operations, incident management, and system administration
Customer Relations
Customer support, sales processes, and relationship management
Software Engineering
Code management, issue tracking, and development collaboration
Business Operations
Project management, partnerships, and strategic planning
Enterprise Communications
Email systems, collaboration tools, and social platforms
Evaluation Methods
Search-Based Evaluation
Search tasks evaluate an agentβs ability to find, analyze, and synthesize information across enterprise systems:
- Information Retrieval: Locate specific data points across multiple systems
- Conversation Analysis: Extract insights from communication threads
- Database Queries: Navigate complex data relationships
- Cross-Domain Search: Find information spanning multiple departments
CRUD-Based Evaluation
CRUD tasks assess an agentβs capability to perform standard business operations:
- Create: Generate new records, documents, or communications
- Read: Access and interpret existing business data
- Update: Modify records while maintaining data integrity
- Delete: Remove outdated or incorrect information safely
Performance Results
Table 3: EnterpriseBench Evaluation - Comparison of performance across agents using different models and planning strategies with LangChain and DSPy frameworks
| Model | GPT-4 Evaluator | Gemini Evaluator | ||||||
|---|---|---|---|---|---|---|---|---|
| w/o Planning | CoT | ReAct | w/ Gold Planning | w/o Planning | CoT | ReAct | w/ Gold Planning | |
| LangChain Framework | ||||||||
| GPT-4o | 0.29 | 0.27 | 0.32 | 0.43 | 0.27 | 0.28 | 0.29 | 0.44 |
| Claude-3.5-Sonnet | 0.31 | 0.27 | 0.28 | 0.38 | 0.32 | 0.30 | 0.30 | 0.41 |
| o1-mini | 0.31 | 0.28 | 0.35 | 0.51 | 0.28 | 0.27 | 0.32 | 0.47 |
| Llama-3.1-8B | 0.04 | 0.06 | 0.14 | 0.20 | 0.03 | 0.04 | 0.09 | 0.21 |
| Llama-3.3-70B | 0.23 | 0.22 | 0.21 | 0.40 | 0.24 | 0.23 | 0.23 | 0.36 |
| DSPy | ||||||||
| GPT-4o | 0.19 | 0.32 | 0.34 | 0.50 | 0.25 | 0.26 | 0.27 | 0.47 |
| Claude-3.5-Sonnet | 0.19 | 0.24 | 0.30 | 0.50 | 0.21 | 0.29 | 0.26 | 0.44 |
| o1-mini | 0.29 | 0.33 | 0.38 | 0.62 | 0.27 | 0.32 | 0.41 | 0.63 |
| Llama-3.1-8B | 0.10 | 0.15 | 0.15 | 0.34 | 0.07 | 0.14 | 0.16 | 0.34 |
| Llama-3.3-70B | 0.20 | 0.27 | 0.30 | 0.47 | 0.24 | 0.25 | 0.28 | 0.48 |
Interactive Demos
EnterpriseBench provides interactive Streamlit applications for hands-on agent evaluation across different enterprise scenarios:
Task Generation
Experience automated task creation across different enterprise domains with configurable complexity and real-time generation.
- Department Selection: Choose from 6 major business domains
- Complexity Control: Adjust task difficulty and scope
- Real-time Generation: Create tasks dynamically based on parameters
- JSON Export: Download generated tasks for evaluation
Search Evaluation
Test agent capabilities on information retrieval and analysis tasks across interconnected business data sources.
- Multi-Domain Queries: Search across HR, IT, Sales, and Engineering data
- Complex Relationships: Navigate interconnected business data
- Real-time Results: See agent performance in real-time
- Performance Analytics: Detailed metrics and failure analysis
Email Communication
Watch agents handle enterprise communication tasks, including drafting and sending emails for business operations.
- Email Drafting: Automated composition of professional emails
- Context Awareness: Understanding business context and requirements
- Multi-Step Process: Complete workflow from analysis to action
- Business Integration: Seamless integration with enterprise systems
How to Cite
If you use EnterpriseBench in your research, please cite our work:
@inproceedings{vishwakarma-etal-2025-llms,
title = "Can {LLM}s Help You at Work? A Sandbox for Evaluating {LLM} Agents in Enterprise Environments",
author = "Vishwakarma, Harsh and
Agarwal, Ankush and
Patil, Ojas and
Devaguptapu, Chaitanya and
Chandran, Mahesh",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-main.466/",
pages = "9178--9212",
ISBN = "979-8-89176-332-6",
}
π Ready to Evaluate Your LLM Agents?
EnterpriseBench provides the most comprehensive framework for testing LLM agents in realistic enterprise environments. Start evaluating today!