Unified infrastructure for training privacy-preserving, cost-effective enterprise AI agents
Ankush Agarwal1,*,
Harsh Vishwakarma1,*,
Suraj Nagaje1,*,
Chaitanya Devaguptapu1
1Fujitsu Research India
*Equal contribution
Deploying AI agents in enterprise environments requires balancing capability with data sovereignty and cost constraints. While frontier models like GPT-4o demonstrate strong reasoning abilities, their high inference costs ($3-$15 per million tokens) and data privacy concerns hinder enterprise adoption.
We introduce EnterpriseLab, a full-stack platform that unifies tool integration, data generation, and model training into a closed-loop framework. The platform enables enterprises to train small 8B-parameter models that match GPT-4oโs performance while reducing inference costs by 8-10ร.
Watch EnterpriseLab agents in action performing complex enterprise workflows:
Multi-step enterprise workflow demonstration showing agent orchestration across HR, document management, and communication systems.
Complex cross-functional workflow demonstrating integration of version control, project management, and notification systems.
These demonstrations showcase EnterpriseLab's ability to handle realistic enterprise scenarios involving multiple tools and stateful decision-making.
Enterprise environments require intelligent automation across complex, cross-departmental workflows spanning HR, IT, sales, and engineering. While frontier language models demonstrate strong capabilities, their deployment faces critical constraints:
Small Language Models (SLMs) in the 8B-32B parameter range offer a promising alternative through on-premises deployment and 10ร cost reduction. However, effective specialization is hindered by fragmented development pipelines:
Current Challenges:
EnterpriseLab addresses these challenges by providing a unified platform that integrates:
Figure 1: EnterpriseLab's three-module architecture for developing enterprise agents
The platform integrates tool environments, data synthesis, and training infrastructure in a closed-loop system
The environment layer implements a client-server system built on Model Context Protocol (MCP), featuring:
repository and project to standard workspace_id)Automated generation of high-quality, executable training data through four phases:
Build dependency graph where edges represent data-flow compatibility between tools. Graph ensures any path corresponds to executable sequences.
Support for multiple execution strategies:
Group Relative Policy Optimization adapted for agentic settings:
Composite reward combining four execution-grounded signals:
Overall reward: r(ฯ) = ฮฃ wโrโ(ฯ), normalized to [0,1]
EnterpriseArena demonstrates EnterpriseLabโs capabilities through a comprehensive benchmark environment with 15 specialized MCP servers and 500 expert-curated tasks.
RocketChat, Mail System
20 tools for messaging and email
GitLab MCP
22 tools for version control and CI/CD
Zammad, Plane (Jira)
24 tools for ticketing and project management
Frappe HR, Calendar
20 tools for employee management
Mongoose MCP, OwnCloud
15 tools for database and file operations
Dolibarr, Salesforce
19 tools for customer relationship management
Invoice System
7 tools for invoicing and payments
File System, Bash, Browser
18 tools for system operations
The 500 expert-curated tasks span five workflow categories with realistic cross-departmental orchestration:
| Task Category | Description | % of Tasks |
|---|---|---|
| CRUD Operations | Create, Read, Update, Delete tasks across systems | 35% |
| Search & Orchestration | Multi-system information retrieval and coordination | 28% |
| Multi-entity Workflow | Complex tasks involving multiple data entities | 18% |
| Version Control | Code management and development operations | 12% |
| Cross-functional Integration | Tasks spanning multiple departments | 7% |
Task: "Read the 2026 Software Engineer job description, fetch relevant resumes, identify the top three candidates based on required skills, and coordinate interview scheduling with engineering managers via email."
Required Orchestration:
Complexity: 6-8 tool invocations across 3 systems with stateful reasoning
Unlike static benchmarks, EnterpriseArena maintains a unified backend where data changes propagate automatically:
Tasks developed through structured reviews with 9 domain experts across Software Engineering, Business Development, Sales, IT Security, HR, and Finance. All tasks rated โRealisticโ or above on five-point Likert scale.
We evaluate Qwen3-8B models trained with EnterpriseLab across four environments: EnterpriseArena (ours), EnterpriseBench, CRMArena, and ฯ-Bench.
| Model | EA | EB | CRM | ฯ-B |
|---|---|---|---|---|
| Closed-Source Models | ย | ย | ย | ย |
| GPT-4o (2-shot) | 0.45 | 0.47 | 0.32 | 0.54 |
| Claude-3.5-Sonnet (2-shot) | 0.60 | 0.55 | 0.34 | 0.56 |
| Gemini-2.5-Pro (2-shot) | 0.71 | 0.55 | 0.49 | 0.59 |
| Open-Source Models | ย | ย | ย | ย |
| Qwen3-8B Base (2-shot) | 0.31 | 0.35 | 0.25 | 0.33 |
| ToolACE (26K-trained) | 0.39 | 0.41 | 0.10 | 0.15 |
| xLAM-2-70B (60K-trained) | 0.15 | 0.40 | 0.12 | 0.17 |
| Our Platform-Trained Models (<1K examples) | ย | ย | ย | ย |
| Qwen3-8B SFT | 0.35 | 0.38 | 0.30 | 0.36 |
| Qwen3-8B Agentic GRPO | 0.43 | 0.51 | 0.35 | 0.42 |
For benchmarks with tool-level annotations (EnterpriseArena and EnterpriseBench):
| Model | EA | EB |
|---|---|---|
| GPT-4o (2-shot) | 0.31 | 0.21 |
| Qwen3-8B Base (2-shot) | 0.14 | 0.14 |
| Qwen3-8B Agentic GRPO | 0.28 | 0.21 |
| Model | Input ($/1M tokens) | Output ($/1M tokens) |
|---|---|---|
| GPT-4o | $5.00 | $15.00 |
| Claude-3.5-Sonnet | $3.00 | $15.00 |
| Gemini-2.5-Pro | $1.25 | $10.00 |
| Qwen3-8B Agentic GRPO (Self-hosted) | $0.50โ$1.00 (combined) | ย |
Result: 8-10ร cost reduction while achieving competitive performance makes EnterpriseLab-trained models ideal for cost-sensitive, large-scale deployments.
Comparing optimization strategies on EnterpriseBench:
Trajectory-level optimization is critical for multi-turn agentic tasks, validating EnterpriseLabโs design for complete trajectory collection and training.
Analysis of 1,500 synthetic trajectories for EnterpriseBench:
Diversity
Complexity
Correctness
Testing robustness with 30% tool modifications (schemas, parameters, data) on EnterpriseBench:
| Scenario | LLM Eval | Tool Eval |
|---|---|---|
| Original environment | 0.50 | 0.20 |
| Modified environment (30% changes) | 0.43 (-15%) | 0.15 |
| + 200 incremental training samples | 0.48 (95% recovery) | 0.18 |
Insight: EnterpriseLab supports rapid model adaptation to evolving environments with minimal additional data, without full retraining.
Analysis of 50 failure cases reveals systematic patterns:
| Failure Mode | Frequency | Description |
|---|---|---|
| Tool Parameter Errors | 42% | Incorrect arguments causing API failures; limited error recovery |
| Domain Misselection | 28% | Ambiguous cues lead to wrong tool selection and recursion loops |
| Task Decomposition | 18% | Completing initial sub-task but failing to plan subsequent steps |
| Context Loss | 12% | Loss of coherence in longer interactions |
EnterpriseLab and EnterpriseArena uniquely address multi-application enterprise orchestration with dynamic data:
| Benchmark | Domain Focus | Multi-App Flow | Dynamic Data | Training Platform |
|---|---|---|---|---|
| AgentBench | General Reasoning | โ | โ | โ |
| WebArena | Web UI | โ | โ | โ |
| SWE-bench | Software Eng. | โ | โ | โ |
| CRMArena | CRM | โ | โ | โ |
| EnterpriseBench | General Enterprise | โ | โ | โ |
| ฯ-Bench | Customer Service | โ | โ | โ |
| EnterpriseLab + Arena | Cross-Functional Enterprise | โ | โ | โ |
If you use EnterpriseLab or EnterpriseArena in your research, please cite our work:
@article{enterpriselab2026,
title = {{EnterpriseLab}: A Full-Stack Platform for Developing and Deploying Agents in Enterprises},
author = {Nagaje, Suraj and Vishwakarma, Harsh and Agarwal, Ankush and Devaguptapu, Chaitanya},
journal = {arXiv preprint arXiv:XXXX.XXXXX},
year = {2026}
}
EnterpriseLab provides the first unified platform for training privacy-preserving, cost-effective enterprise AI agents. Transform your enterprise tools into specialized agents with:
Visit our GitHub repository to get started!
Full-Stack Platform for Enterprise AI Agents
Contact: suraj.nagaje@fujitsu.com
ยฉ 2026 EnterpriseLab. Preliminary work under review.