Can LLMs Help You at Work? A Sandbox for Evaluating LLM Agents in Enterprise Environments

Harsh Vishwakarma^1,*, Ankush Agarwal^1,*, Ojas Patil¹, Chaitanya Devaguptapu¹, Mahesh Chandran¹

¹Fujitsu Research India

^*Equal contribution

Abstract

Enterprise systems are crucial for enhancing productivity and decision-making among employees and customers. Integrating LLM based systems into enterprise systems enables intelligent automation, personalized experiences, and efficient information retrieval, driving operational efficiency and strategic growth. However, developing and evaluating such systems is challenging due to the inherent complexity of enterprise environments, where data is fragmented across multiple sources and governed by sophisticated access controls.

Our benchmark features EnerpriseBench which provides a Enteprise Simulation Environment along with 500 Realistic Tasks for comprehensive agent assessment. Through extensive evaluation across multiple domains, EnterpriseBench reveals significant gaps between current LLM agent capabilities and enterprise requirements, establishing new benchmarks for real-world AI deployment readiness.

🎯 Key Contributions:

Realistic Enterprise Simulation: Comprehensive sandbox with authentic business data across 10+ domains
Diverse Tasks across Domains: Search-based and CRUD-based task assessment across different domains
Automated Task Generation: Dynamic creation of enterprise tasks with configurable complexity

Introduction

The deployment of LLM agents in enterprise environments presents unique challenges that current benchmarks fail to address. While existing evaluation frameworks focus on isolated capabilities like question-answering or code generation, real enterprise scenarios require agents to navigate complex, interconnected business systems with authentic data relationships and domain-specific constraints.

Why Enterprise-Specific Evaluation Matters

Enterprise environments are characterized by:

Multi-domain Integration: Tasks often span HR, IT, Sales, and Engineering departments
Complex Data Relationships: Information is interconnected across multiple business systems
Domain-Specific Constraints: Each department has unique workflows, terminology, and requirements
Realistic Scale: Enterprise data volumes and complexity far exceed academic benchmarks

EnterpriseBench addresses these gaps by providing the first comprehensive framework specifically designed for enterprise LLM agent evaluation.

Figure 1: EnterpriseBench agent workflow showing the complete task execution process from user query through planning, execution, and task completion within the enterprise environment.

EnterpriseBench Framework

Architecture Overview

EnterpriseBench consists of three core components working together to provide comprehensive enterprise agent evaluation:

1. Enterprise Sandbox Environment

Realistic Data: Synthetic but authentic business data across 10+ domains
Interconnected Systems: Data relationships mirror real enterprise architectures
Scalable Infrastructure: Supports various task types and complexity levels
Privacy-Compliant: Synthetic data ensures privacy while maintaining realism

2. Comprehensive Task Suite

Search Tasks: Information retrieval, conversation analysis, and database queries
CRUD Tasks: Create, Read, Update, Delete operations on enterprise data
Multi-Domain Coverage: Tasks spanning HR, IT, Sales, and Engineering departments
Realistic Complexity: Enterprise-grade scenarios with authentic data relationships

3. Advanced Evaluation System

Automated Assessment: AI-powered evaluation with multiple scoring criteria
Performance Metrics: Comprehensive evaluation framework for agent capabilities
Interactive Interfaces: Streamlit-powered demos for real-time testing
Comparative Analysis: Benchmarking across different LLM architectures

Supported Domains

EnterpriseBench covers comprehensive business domains with authentic data and realistic task scenarios:

🏢

Human Resources

Employee management, recruitment, and policy administration

Search CRUD Communication

💻

IT Service Management

Helpdesk operations, incident management, and system administration

Search CRUD Troubleshooting

🤝

Customer Relations

Customer support, sales processes, and relationship management

Search CRUD Analysis

⚙️

Software Engineering

Code management, issue tracking, and development collaboration

Search CRUD Code Review

📊

Business Operations

Project management, partnerships, and strategic planning

Search CRUD Analysis

📧

Enterprise Communications

Email systems, collaboration tools, and social platforms

Search CRUD Communication

Evaluation Methods

Search-Based Evaluation

Search tasks evaluate an agent’s ability to find, analyze, and synthesize information across enterprise systems:

Information Retrieval: Locate specific data points across multiple systems
Conversation Analysis: Extract insights from communication threads
Database Queries: Navigate complex data relationships
Cross-Domain Search: Find information spanning multiple departments

CRUD-Based Evaluation

CRUD tasks assess an agent’s capability to perform standard business operations:

Create: Generate new records, documents, or communications
Read: Access and interpret existing business data
Update: Modify records while maintaining data integrity
Delete: Remove outdated or incorrect information safely

Performance Results

Table 3: EnterpriseBench Evaluation - Comparison of performance across agents using different models and planning strategies with LangChain and DSPy frameworks

Model	GPT-4 Evaluator				Gemini Evaluator
Model	w/o Planning	CoT	ReAct	w/ Gold Planning	w/o Planning	CoT	ReAct	w/ Gold Planning
LangChain Framework
GPT-4o	0.29	0.27	0.32	0.43	0.27	0.28	0.29	0.44
Claude-3.5-Sonnet	0.31	0.27	0.28	0.38	0.32	0.30	0.30	0.41
o1-mini	0.31	0.28	0.35	0.51	0.28	0.27	0.32	0.47
Llama-3.1-8B	0.04	0.06	0.14	0.20	0.03	0.04	0.09	0.21
Llama-3.3-70B	0.23	0.22	0.21	0.40	0.24	0.23	0.23	0.36
DSPy
GPT-4o	0.19	0.32	0.34	0.50	0.25	0.26	0.27	0.47
Claude-3.5-Sonnet	0.19	0.24	0.30	0.50	0.21	0.29	0.26	0.44
o1-mini	0.29	0.33	0.38	0.62	0.27	0.32	0.41	0.63
Llama-3.1-8B	0.10	0.15	0.15	0.34	0.07	0.14	0.16	0.34
Llama-3.3-70B	0.20	0.27	0.30	0.47	0.24	0.25	0.28	0.48

Interactive Demos

EnterpriseBench provides interactive Streamlit applications for hands-on agent evaluation across different enterprise scenarios:

🎲 Task Generation

Experience automated task creation across different enterprise domains with configurable complexity and real-time generation.

Department Selection: Choose from 6 major business domains
Complexity Control: Adjust task difficulty and scope
Real-time Generation: Create tasks dynamically based on parameters
JSON Export: Download generated tasks for evaluation

🔍 Search Evaluation

Test agent capabilities on information retrieval and analysis tasks across interconnected business data sources.

Multi-Domain Queries: Search across HR, IT, Sales, and Engineering data
Complex Relationships: Navigate interconnected business data
Real-time Results: See agent performance in real-time
Performance Analytics: Detailed metrics and failure analysis

📧 Email Communication

Watch agents handle enterprise communication tasks, including drafting and sending emails for business operations.

Email Drafting: Automated composition of professional emails
Context Awareness: Understanding business context and requirements
Multi-Step Process: Complete workflow from analysis to action
Business Integration: Seamless integration with enterprise systems

How to Cite

If you use EnterpriseBench in your research, please cite our work:

@inproceedings{vishwakarma-etal-2025-llms,
    title = "Can {LLM}s Help You at Work? A Sandbox for Evaluating {LLM} Agents in Enterprise Environments",
    author = "Vishwakarma, Harsh  and
      Agarwal, Ankush  and
      Patil, Ojas  and
      Devaguptapu, Chaitanya  and
      Chandran, Mahesh",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.466/",
    pages = "9178--9212",
    ISBN = "979-8-89176-332-6",
}

🚀 Ready to Evaluate Your LLM Agents?

EnterpriseBench provides the most comprehensive framework for testing LLM agents in realistic enterprise environments. Start evaluating today!