Industrial assets do not fail neatly; they fail in ways that force engineers to pull signals from sensors, recall failure modes, and translate insights into work orders. AssetOpsBench is IBM's open-source answer: a domain-specific, multi-agent benchmark that evaluates how well AI agents can plan and act across the messy, end-to-end workflows of asset operations and maintenance.
It blends realistic data access (via CouchDB), four task-focused agents, and two orchestration styles to probe not just final answers but how those answers are produced across steps and tools.
The problem and the shape of a solution
Asset lifecycle work - condition monitoring, fault isolation, forecasting, and maintenance planning - has historically been tackled as a series of disconnected solutions. The rise of LLM-based agents invites a different approach where we let specialized agents cooperate, with an orchestrator drafting a plan and delegating tool use.
AssetOpsBench embodies that idea as a reproducible environment. The repository's README frames 140+ scenarios that mix single-step and multi-step tasks across IoT data retrieval, failure-mode reasoning, time-series modeling, and work order generation.
The benchmark runs inside Docker and comes with a CouchDB bootstrap that preloads a chiller dataset for realistic queries; see benchmark/README.md, benchmark/docker-compose.yml, and the sample data at src/assetopsbench/sample_data/...csv.
Why this repository stands out
Two design choices make AssetOpsBench unusually practical for enterprise R&D. First, it employs transparency-by-design where trajectories are captured and can be analyzed with an evaluation module and external viewers, making it easier to classify where multi-agent workflows go wrong - misplanning, tool misuse, premature commitment, or missing context - echoing a recent failure taxonomy from UC Berkeley (see UC Berkeley, 2025).
Second, it has orchestration flexibility where the repo ships both an agents-as-tools ReAct style (MetaAgent) and a plan-and-execute style (AgentHive), so teams can evaluate which paradigm fits their workloads, or swap in their own.
Why I like it
Three things clicked for me while exploring this repo: it treats multi-agent workflows as first-class, not an afterthought; it captures trajectories so you can audit and improve agents systematically; and it runs end-to-end with real storage (CouchDB) and realistic scenarios. That combination makes it useful not just for scoring models, but for engineering better orchestrators.
Key features and how they map to code
AssetOpsBench defines four domain agents with purpose-built tools:
- An IoT agent for sites/assets/sensors and history
- A failure-mode & sensor-relation (FMSR) agent
- A time series forecasting & anomaly agent (TSFM)
- Finally a work order agent.
Multi-agent orchestration is provided in two flavors:
MetaAgent (ReAct: agents-as-tools), centered in src/meta_agent/meta_agent.py and companions.
AgentHive (plan-and-execute), with workflows in src/agent_hive/workflows.
Scenarios live under scenarios/ (including multi-agent end-to-end tasks) and a croissant schema.
The benchmark/run_plan_execute.py script wires agents into a planning workflow and writes trajectories. A typical container run mounts scenarios and calls the entrypoint to launch plan-and-execute over multiple utterances, as defined in benchmark/entrypoint.sh.
# spin up CouchDB + assetopsbench container
# see benchmark/docker-compose.yml
docker compose -f benchmark/docker-compose.yml build
docker compose -f benchmark/docker-compose.yml up
# snippet from the container's run command (entrypoint.sh)
python /home/run_plan_execute.py \
--utterance_file /home/scenarios/multi_agent/end2end_utterance.json \
--llm 6 \
--workflow p
Under the hood
AssetOpsBench's technical architecture balances flexibility with reproducibility. Let's examine the key components that make it work:
Runtime Environment: The Docker images expose a Python 3.12 Conda environment called "assetopsbench." Dependencies are split across two levels: basic_requirements.txt covers core libraries like pandas and scipy, while extra_requirements.txt adds specialized IBM agent packages and ReActXen for ReAct-style reasoning. This layered approach means you can start with the basic image and add domain-specific tools as needed.
Data Infrastructure: CouchDB serves as the operational data store, mimicking real-world asset management systems. The couchdb_setup.sh script handles database initialization and bulk-loads the chiller sample dataset, creating a realistic environment where agents must query structured operational data rather than work with pre-formatted test inputs.
Agent Orchestration: The core execution logic lives in run_plan_execute.py. This script instantiates either React or React-Reflect agents, configures them with domain-specific tools (IoT queries, failure-mode lookups, forecasting models), and coordinates their execution across planning workflows. Each agent interaction generates trajectory data that captures not just the final output, but the intermediate reasoning steps and tool calls.
Evaluation Framework: The evaluation/analyze.py module implements a sophisticated grading system that goes beyond simple accuracy metrics. It supports rubric-based analysis over captured trajectories, allowing researchers to score agents on planning quality, tool usage efficiency, and reasoning coherence. The team also employs an LLM judge (Llama-4 Maverick 17B) to evaluate responses across six dimensions including correctness, completeness, and safety considerations, as detailed in the README and research blog (Martineau, 2025).
Scenario Management: Scenarios are defined using the Croissant metadata schema, providing structured descriptions of tasks, expected inputs, and evaluation criteria. This standardized format ensures that new scenarios can be added systematically and that benchmark results remain comparable across different research groups and model versions.
What AssetOpsBench adds to the benchmark landscape
Classic LLM benchmarks like (Hendrycks, 2020) and (Zellers, 2019) test general knowledge and commonsense inference. A newer wave measures agentic behavior in interactive settings: AgentBench spans eight environments from databases to household tasks (Liu, 2023); SWE-bench turns GitHub issues into executable software-fixing challenges (Jimenez, 2023); and WebArena builds a realistic, self-hosted web for long-horizon tasks (Zhou, 2023).
AssetOpsBench complements these by zeroing in on Industry 4.0: it evaluates how an orchestrator coordinates domain agents over sensor time series, codified failure modes, and maintenance processes - then grades both the journey and the destination.
IBM's paper deepens this with end-to-end design guidance for perception, reasoning, and control across the asset lifecycle (Patel et al., 2025).
- Compared with AgentBench's broad environment suite, AssetOpsBench is narrower but deeper in industrial semantics (e.g., failure-mode-sensor mappings, work orders) and deploys true multi-agent orchestration with reusable domain tools.
- Compared with SWE-bench's rigorous software-edit tasks, AssetOpsBench emphasizes tool-mediated data retrieval and forecasting under real-time ops constraints, with evaluation that inspects intermediate reasoning, not just the final patch.
- Compared with WebArena's web interaction, AssetOpsBench focuses on operational data systems (CouchDB-backed) and domain-specific agent tools; both share a commitment to verifiable end-to-end tasks and reproducible environments.
Use cases and who benefits
The scenarios are recognizable to reliability and facilities teams: list sensors on an asset or site; correlate symptoms with failure modes and their indicative signals; forecast equipment behavior; or write a maintenance ticket when the model flags risk.
These map to real processes in condition-based maintenance and reliability-centered maintenance. Multi-agent orchestration is especially relevant where data access, physics-informed rules, and historical context must combine before a human commits to an intervention. IBM's blog outlines how similar agents could surface insights in Maximo Application Suite and beyond (Martineau, 2025).
Running the benchmark
The fastest path is Docker. Use the compose file to bring up CouchDB and the benchmark image, then mount scenarios and a run script. Environment variables can be passed via benchmark/.env. If you extend the image, the docs show how to add dependencies on top of the assetopsbench-basic
or assetopsbench-extra
base.
Note that extra_requirements.txt includes several IBM-hosted agent packages; the prebuilt images already bundle them, but building locally may require credentials. For experimentation, the Hugging Face dataset is public: ibm-research/AssetOpsBench.
Community, contributions, and roadmap signals
The repository documents contributors and links to a research blog that previews future directions like cost-aware evaluation and safety scoring (Martineau, 2025). The design encourages external agents: bring your own tools and orchestrator and compare across the same scenarios. Transparent trajectories and the evaluation module (analyze.py) make it feasible to submit error analyses, not just scores. Related work from IBM also catalogs gaps across 120+ agent benchmarks and calls for step-wise grading and safe-by-default evaluation; AssetOpsBench adopts that posture.
License and usage
AssetOpsBench is Apache 2.0 licensed; see LICENSE. In short, you can use, modify, and distribute the software with attribution; contributions are accepted under the same license. If you redistribute modifications, retain notices and include attribution; patent grants and warranty disclaimers apply per the license.
About IBM Research
IBM Research builds enterprise-grade AI systems, from Granite foundation models to AI-for-operations, and integrates them into products like IBM Maximo Application Suite for asset management. AssetOpsBench fits squarely in that mission: make agentic evaluation realistic enough to guide design and trustworthy enough to deploy. See the research overview and benchmark announcement (Martineau, 2025).
Closing thoughts
Industrial AI agents are only as good as their ability to plan across tools, reason with domain signals, and show their work. AssetOpsBench gives teams a grounded, multi-agent yardstick - complete with trajectories, interpretable grading, and two orchestration paradigms - to iterate toward that standard. Explore the repo, try the Docker flow, and benchmark your orchestrator side by side: IBM/AssetOpsBench. For deeper background and context, see the arXiv paper (Patel et al., 2025) and IBM's blog (Martineau, 2025). Also scan related efforts like AgentBench (Liu, 2023), SWE-bench (Jimenez, 2023), and WebArena (Zhou, 2023) to see how different communities are measuring progress.
AssetOpsBench: Industrial Agents Meet a Real-World Benchmark