EnterpriseLab

EnterpriseLab: A Full-Stack Platform for Developing and Deploying Agents in Enterprises

Unified infrastructure for training privacy-preserving, cost-effective enterprise AI agents

๐Ÿข Enterprise AI ๐Ÿค– Agent Training ๐Ÿ”ง Tool Integration ๐Ÿ’ฐ Cost Efficiency

Ankush Agarwal1,*, Harsh Vishwakarma1,*, Suraj Nagaje1,*, Chaitanya Devaguptapu1
1Fujitsu Research India
*Equal contribution

๐Ÿ“„ Paper ๐Ÿ’ป GitHub ๐Ÿ“Š Dataset ๐Ÿ† Leaderboard

Abstract

Deploying AI agents in enterprise environments requires balancing capability with data sovereignty and cost constraints. While frontier models like GPT-4o demonstrate strong reasoning abilities, their high inference costs ($3-$15 per million tokens) and data privacy concerns hinder enterprise adoption.

We introduce EnterpriseLab, a full-stack platform that unifies tool integration, data generation, and model training into a closed-loop framework. The platform enables enterprises to train small 8B-parameter models that match GPT-4oโ€™s performance while reducing inference costs by 8-10ร—.

๐ŸŽฏ Key Contributions

Key Results at a Glance

8-10ร— Cost Reduction vs. GPT-4o
140+ Enterprise Tools
500 Expert-Curated Tasks
+10% Improvement on Benchmarks

Interactive Demos

Watch EnterpriseLab agents in action performing complex enterprise workflows:

๐ŸŽฅ Task Demonstration Videos

๐Ÿ“‹ Task Demo 1

Multi-step enterprise workflow demonstration showing agent orchestration across HR, document management, and communication systems.

๐Ÿ“‹ Task Demo 2

Complex cross-functional workflow demonstrating integration of version control, project management, and notification systems.

These demonstrations showcase EnterpriseLab's ability to handle realistic enterprise scenarios involving multiple tools and stateful decision-making.


Introduction

The Enterprise AI Challenge

Enterprise environments require intelligent automation across complex, cross-departmental workflows spanning HR, IT, sales, and engineering. While frontier language models demonstrate strong capabilities, their deployment faces critical constraints:

The Infrastructure Gap

Small Language Models (SLMs) in the 8B-32B parameter range offer a promising alternative through on-premises deployment and 10ร— cost reduction. However, effective specialization is hindered by fragmented development pipelines:

Current Challenges:

The EnterpriseLab Solution

EnterpriseLab addresses these challenges by providing a unified platform that integrates:

  1. Modular Tool Environment: MCP-based architecture for plug-and-play tool integration
  2. Automated Trajectory Synthesis: Programmatic training data generation from environment schemas
  3. Integrated Training Pipeline: SFT, DPO, and online RL with continuous evaluation

The EnterpriseLab Platform

EnterpriseLab Architecture

Figure 1: EnterpriseLab's three-module architecture for developing enterprise agents

The platform integrates tool environments, data synthesis, and training infrastructure in a closed-loop system

1. Modular Tool Environment Architecture

The environment layer implements a client-server system built on Model Context Protocol (MCP), featuring:

Dynamic Tool Registry

Stateful Execution Containers

Observation Normalizer

2. Task Synthesis Pipeline

Automated generation of high-quality, executable training data through four phases:

Phase 1: Tool Graph Construction

Build dependency graph where edges represent data-flow compatibility between tools. Graph ensures any path corresponds to executable sequences.

Phase 2: Constraint-Aware Trajectory Sampling

Phase 3: Hierarchical Task Synthesis

Phase 4: Validation and Filtering

3. Integrated Training Infrastructure

Agent Scaffolding

Support for multiple execution strategies:

Offline Training Methods

Agentic GRPO: Online Reinforcement Learning

Group Relative Policy Optimization adapted for agentic settings:

Trajectory Reward Design

Composite reward combining four execution-grounded signals:

Overall reward: r(ฯ„) = ฮฃ wโ‚–rโ‚–(ฯ„), normalized to [0,1]


EnterpriseArena: Benchmark Instantiation

EnterpriseArena demonstrates EnterpriseLabโ€™s capabilities through a comprehensive benchmark environment with 15 specialized MCP servers and 500 expert-curated tasks.

MCP Server Ecosystem

๐Ÿ’ฌ

Communication

RocketChat, Mail System

20 tools for messaging and email

๐Ÿ’ป

Development

GitLab MCP

22 tools for version control and CI/CD

๐ŸŽซ

Operations & IT

Zammad, Plane (Jira)

24 tools for ticketing and project management

๐Ÿ‘ฅ

Human Resources

Frappe HR, Calendar

20 tools for employee management

๐Ÿ’พ

Data & Storage

Mongoose MCP, OwnCloud

15 tools for database and file operations

๐Ÿ“Š

Business (CRM)

Dolibarr, Salesforce

19 tools for customer relationship management

๐Ÿ’ฐ

Finance

Invoice System

7 tools for invoicing and payments

๐Ÿ”ง

Utilities

File System, Bash, Browser

18 tools for system operations

Task Complexity and Categories

The 500 expert-curated tasks span five workflow categories with realistic cross-departmental orchestration:

Task Category Description % of Tasks
CRUD Operations Create, Read, Update, Delete tasks across systems 35%
Search & Orchestration Multi-system information retrieval and coordination 28%
Multi-entity Workflow Complex tasks involving multiple data entities 18%
Version Control Code management and development operations 12%
Cross-functional Integration Tasks spanning multiple departments 7%

Example Complex Task

Cross-Functional Recruitment Workflow

Task: "Read the 2026 Software Engineer job description, fetch relevant resumes, identify the top three candidates based on required skills, and coordinate interview scheduling with engineering managers via email."

Required Orchestration:

Complexity: 6-8 tool invocations across 3 systems with stateful reasoning

Stateful Environment Dependencies

Unlike static benchmarks, EnterpriseArena maintains a unified backend where data changes propagate automatically:

Expert Validation

Tasks developed through structured reviews with 9 domain experts across Software Engineering, Business Development, Sales, IT Security, HR, and Finance. All tasks rated โ€œRealisticโ€ or above on five-point Likert scale.


Results and Analysis

Performance Across Benchmarks

We evaluate Qwen3-8B models trained with EnterpriseLab across four environments: EnterpriseArena (ours), EnterpriseBench, CRMArena, and ฯ„-Bench.

Model EA EB CRM ฯ„-B
Closed-Source Models ย  ย  ย  ย 
GPT-4o (2-shot) 0.45 0.47 0.32 0.54
Claude-3.5-Sonnet (2-shot) 0.60 0.55 0.34 0.56
Gemini-2.5-Pro (2-shot) 0.71 0.55 0.49 0.59
Open-Source Models ย  ย  ย  ย 
Qwen3-8B Base (2-shot) 0.31 0.35 0.25 0.33
ToolACE (26K-trained) 0.39 0.41 0.10 0.15
xLAM-2-70B (60K-trained) 0.15 0.40 0.12 0.17
Our Platform-Trained Models (<1K examples) ย  ย  ย  ย 
Qwen3-8B SFT 0.35 0.38 0.30 0.36
Qwen3-8B Agentic GRPO 0.43 0.51 0.35 0.42

Key Performance Insights

30% Improvement over Base Model
โ‰ˆGPT-4o Performance Parity
+10% Over GPT-4o on EnterpriseBench
26-60ร— Less Training Data vs. Baselines

Tool Selection Accuracy

For benchmarks with tool-level annotations (EnterpriseArena and EnterpriseBench):

Model EA EB
GPT-4o (2-shot) 0.31 0.21
Qwen3-8B Base (2-shot) 0.14 0.14
Qwen3-8B Agentic GRPO 0.28 0.21

Cost Efficiency Analysis

Model Input ($/1M tokens) Output ($/1M tokens)
GPT-4o $5.00 $15.00
Claude-3.5-Sonnet $3.00 $15.00
Gemini-2.5-Pro $1.25 $10.00
Qwen3-8B Agentic GRPO (Self-hosted) $0.50โ€“$1.00 (combined) ย 

Result: 8-10ร— cost reduction while achieving competitive performance makes EnterpriseLab-trained models ideal for cost-sensitive, large-scale deployments.

Impact of Trajectory-Level Optimization

Comparing optimization strategies on EnterpriseBench:

Trajectory-level optimization is critical for multi-turn agentic tasks, validating EnterpriseLabโ€™s design for complete trajectory collection and training.

Synthetic Data Quality Analysis

Analysis of 1,500 synthetic trajectories for EnterpriseBench:

Diversity

Complexity

Correctness

Adaptation to Environment Changes

Testing robustness with 30% tool modifications (schemas, parameters, data) on EnterpriseBench:

Scenario LLM Eval Tool Eval
Original environment 0.50 0.20
Modified environment (30% changes) 0.43 (-15%) 0.15
+ 200 incremental training samples 0.48 (95% recovery) 0.18

Insight: EnterpriseLab supports rapid model adaptation to evolving environments with minimal additional data, without full retraining.

Training Efficiency

Time to Production

Error Analysis

Analysis of 50 failure cases reveals systematic patterns:

Failure Mode Frequency Description
Tool Parameter Errors 42% Incorrect arguments causing API failures; limited error recovery
Domain Misselection 28% Ambiguous cues lead to wrong tool selection and recursion loops
Task Decomposition 18% Completing initial sub-task but failing to plan subsequent steps
Context Loss 12% Loss of coherence in longer interactions

Comparison with Existing Benchmarks

EnterpriseLab and EnterpriseArena uniquely address multi-application enterprise orchestration with dynamic data:

Benchmark Domain Focus Multi-App Flow Dynamic Data Training Platform
AgentBench General Reasoning โœ— โœ— โœ—
WebArena Web UI โœ— โœ“ โœ—
SWE-bench Software Eng. โœ— โœ— โœ—
CRMArena CRM โœ— โœ— โœ—
EnterpriseBench General Enterprise โœ“ โœ— โœ—
ฯ„-Bench Customer Service โœ— โœ“ โœ—
EnterpriseLab + Arena Cross-Functional Enterprise โœ“ โœ“ โœ“

Citation

If you use EnterpriseLab or EnterpriseArena in your research, please cite our work:

@article{enterpriselab2026,
  title = {{EnterpriseLab}: A Full-Stack Platform for Developing and Deploying Agents in Enterprises},
  author = {Nagaje, Suraj and Vishwakarma, Harsh and Agarwal, Ankush and Devaguptapu, Chaitanya},
  journal = {arXiv preprint arXiv:XXXX.XXXXX},
  year = {2026}
}

๐Ÿš€ Get Started with EnterpriseLab

EnterpriseLab provides the first unified platform for training privacy-preserving, cost-effective enterprise AI agents. Transform your enterprise tools into specialized agents with:

Visit our GitHub repository to get started!


EnterpriseLab

Full-Stack Platform for Enterprise AI Agents

Paper GitHub Dataset Documentation

Contact: suraj.nagaje@fujitsu.com
ยฉ 2026 EnterpriseLab. Preliminary work under review.