Harsh Vishwakarma1,*, Ankush Agarwal1,*, Ojas Patil1, Chaitanya Devaguptapu1, Mahesh Chandran1
1Fujitsu Research India
*Equal contribution

Abstract

Enterprise systems are crucial for enhancing productivity and decision-making among employees and customers. Integrating LLM based systems into enterprise systems enables intelligent automation, personalized experiences, and efficient information retrieval, driving operational efficiency and strategic growth. However, developing and evaluating such systems is challenging due to the inherent complexity of enterprise environments, where data is fragmented across multiple sources and governed by sophisticated access controls.

Our benchmark features EnerpriseBench which provides a Enteprise Simulation Environment along with 500 Realistic Tasks for comprehensive agent assessment. Through extensive evaluation across multiple domains, EnterpriseBench reveals significant gaps between current LLM agent capabilities and enterprise requirements, establishing new benchmarks for real-world AI deployment readiness.

🎯 Key Contributions:

  • Realistic Enterprise Simulation: Comprehensive sandbox with authentic business data across 10+ domains
  • Diverse Tasks across Domains: Search-based and CRUD-based task assessment across different domains
  • Automated Task Generation: Dynamic creation of enterprise tasks with configurable complexity

Introduction

The deployment of LLM agents in enterprise environments presents unique challenges that current benchmarks fail to address. While existing evaluation frameworks focus on isolated capabilities like question-answering or code generation, real enterprise scenarios require agents to navigate complex, interconnected business systems with authentic data relationships and domain-specific constraints.

Why Enterprise-Specific Evaluation Matters

Enterprise environments are characterized by:

  • Multi-domain Integration: Tasks often span HR, IT, Sales, and Engineering departments
  • Complex Data Relationships: Information is interconnected across multiple business systems
  • Domain-Specific Constraints: Each department has unique workflows, terminology, and requirements
  • Realistic Scale: Enterprise data volumes and complexity far exceed academic benchmarks

EnterpriseBench addresses these gaps by providing the first comprehensive framework specifically designed for enterprise LLM agent evaluation.

EnterpriseBench Agent Workflow Figure 1: EnterpriseBench agent workflow showing the complete task execution process from user query through planning, execution, and task completion within the enterprise environment.

EnterpriseBench Framework

Architecture Overview

EnterpriseBench consists of three core components working together to provide comprehensive enterprise agent evaluation:

1. Enterprise Sandbox Environment

  • Realistic Data: Synthetic but authentic business data across 10+ domains
  • Interconnected Systems: Data relationships mirror real enterprise architectures
  • Scalable Infrastructure: Supports various task types and complexity levels
  • Privacy-Compliant: Synthetic data ensures privacy while maintaining realism

2. Comprehensive Task Suite

  • Search Tasks: Information retrieval, conversation analysis, and database queries
  • CRUD Tasks: Create, Read, Update, Delete operations on enterprise data
  • Multi-Domain Coverage: Tasks spanning HR, IT, Sales, and Engineering departments
  • Realistic Complexity: Enterprise-grade scenarios with authentic data relationships

3. Advanced Evaluation System

  • Automated Assessment: AI-powered evaluation with multiple scoring criteria
  • Performance Metrics: Comprehensive evaluation framework for agent capabilities
  • Interactive Interfaces: Streamlit-powered demos for real-time testing
  • Comparative Analysis: Benchmarking across different LLM architectures

Supported Domains

EnterpriseBench covers comprehensive business domains with authentic data and realistic task scenarios:

🏒

Human Resources

Employee management, recruitment, and policy administration

Search CRUD Communication
πŸ’»

IT Service Management

Helpdesk operations, incident management, and system administration

Search CRUD Troubleshooting
🀝

Customer Relations

Customer support, sales processes, and relationship management

Search CRUD Analysis
βš™οΈ

Software Engineering

Code management, issue tracking, and development collaboration

Search CRUD Code Review
πŸ“Š

Business Operations

Project management, partnerships, and strategic planning

Search CRUD Analysis
πŸ“§

Enterprise Communications

Email systems, collaboration tools, and social platforms

Search CRUD Communication

Evaluation Methods

Search-Based Evaluation

Search tasks evaluate an agent’s ability to find, analyze, and synthesize information across enterprise systems:

  • Information Retrieval: Locate specific data points across multiple systems
  • Conversation Analysis: Extract insights from communication threads
  • Database Queries: Navigate complex data relationships
  • Cross-Domain Search: Find information spanning multiple departments

CRUD-Based Evaluation

CRUD tasks assess an agent’s capability to perform standard business operations:

  • Create: Generate new records, documents, or communications
  • Read: Access and interpret existing business data
  • Update: Modify records while maintaining data integrity
  • Delete: Remove outdated or incorrect information safely

Performance Results

Table 3: EnterpriseBench Evaluation - Comparison of performance across agents using different models and planning strategies with LangChain and DSPy frameworks

Model GPT-4 Evaluator Gemini Evaluator
w/o Planning CoT ReAct w/ Gold Planning w/o Planning CoT ReAct w/ Gold Planning
LangChain Framework
GPT-4o 0.29 0.27 0.32 0.43 0.27 0.28 0.29 0.44
Claude-3.5-Sonnet 0.31 0.27 0.28 0.38 0.32 0.30 0.30 0.41
o1-mini 0.31 0.28 0.35 0.51 0.28 0.27 0.32 0.47
Llama-3.1-8B 0.04 0.06 0.14 0.20 0.03 0.04 0.09 0.21
Llama-3.3-70B 0.23 0.22 0.21 0.40 0.24 0.23 0.23 0.36
DSPy
GPT-4o 0.19 0.32 0.34 0.50 0.25 0.26 0.27 0.47
Claude-3.5-Sonnet 0.19 0.24 0.30 0.50 0.21 0.29 0.26 0.44
o1-mini 0.29 0.33 0.38 0.62 0.27 0.32 0.41 0.63
Llama-3.1-8B 0.10 0.15 0.15 0.34 0.07 0.14 0.16 0.34
Llama-3.3-70B 0.20 0.27 0.30 0.47 0.24 0.25 0.28 0.48

Interactive Demos

EnterpriseBench provides interactive Streamlit applications for hands-on agent evaluation across different enterprise scenarios:

🎲 Task Generation

Experience automated task creation across different enterprise domains with configurable complexity and real-time generation.

  • Department Selection: Choose from 6 major business domains
  • Complexity Control: Adjust task difficulty and scope
  • Real-time Generation: Create tasks dynamically based on parameters
  • JSON Export: Download generated tasks for evaluation

πŸ” Search Evaluation

Test agent capabilities on information retrieval and analysis tasks across interconnected business data sources.

  • Multi-Domain Queries: Search across HR, IT, Sales, and Engineering data
  • Complex Relationships: Navigate interconnected business data
  • Real-time Results: See agent performance in real-time
  • Performance Analytics: Detailed metrics and failure analysis

πŸ“§ Email Communication

Watch agents handle enterprise communication tasks, including drafting and sending emails for business operations.

  • Email Drafting: Automated composition of professional emails
  • Context Awareness: Understanding business context and requirements
  • Multi-Step Process: Complete workflow from analysis to action
  • Business Integration: Seamless integration with enterprise systems

How to Cite

If you use EnterpriseBench in your research, please cite our work:

@inproceedings{vishwakarma-etal-2025-llms,
    title = "Can {LLM}s Help You at Work? A Sandbox for Evaluating {LLM} Agents in Enterprise Environments",
    author = "Vishwakarma, Harsh  and
      Agarwal, Ankush  and
      Patil, Ojas  and
      Devaguptapu, Chaitanya  and
      Chandran, Mahesh",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.466/",
    pages = "9178--9212",
    ISBN = "979-8-89176-332-6",
}

πŸš€ Ready to Evaluate Your LLM Agents?

EnterpriseBench provides the most comprehensive framework for testing LLM agents in realistic enterprise environments. Start evaluating today!