Skip to Content

Agent Lightning: Decoupled RL Training for Any AI Agent

A practical, framework-agnostic way to train real-world agents without invasive rewrites
microsoft

Agent Lightning is a Microsoft Research project that turns existing agents into trainable systems with minimal code changes. Instead of rewriting your agent to fit a trainer loop, you attach a lightweight client that streams traces and rewards to a centralized training server, where reinforcement learning (and other algorithms) can improve the model behind the agent. The result: a practical path to optimize agents you already have, built with frameworks like LangGraph, AutoGen, or the OpenAI Agent SDK, without invasive refactors.

microsoft

microsoft

Organization

agent-lightning

The absolute trainer to light up AI agents.
1.7k
148
45
148 Network
9 Subscribers
Python
agentagentic-aillmmlopsreinforcement-learning

The problem and the solution

Most real agents are glue around an LLM: they plan, call tools, loop, and decide when to stop. Training that LLM in the context of the full workflow is hard because the agent runtime and the ML training stack pull in different dependencies and control flows. Traditional RL setups often couple simulation and training in one process, forcing developers to contort agent code into trainer APIs. 

Agent Lightning breaks this coupling. A server manages tasks, resources, and an LLM endpoint exposed inside the training infrastructure; one or more clients run the agent as-is, collect trajectories, compute rewards, and report rollouts back. This keeps agent logic where it belongs and lets training evolve independently.

Why it stands out

It is genuinely plug-in: you can selectively optimize a single step in a multi-step agent (for example, only the write and rewrite steps of a SQL agent) and leave the rest untouched. It is framework-agnostic and embraces multiple optimization methods: reinforcement learning today, automatic prompt optimization tomorrow. 

The repo reads like production software with tests, docs, examples, and a clear type model in types.py. There is also a live docs site and an arXiv paper that formalizes the approach as a decoupled MDP with hierarchical credit assignment (Luo et al., 2025).

Key features

  • Zero-to-minimal code changes: Wrap existing agents with a thin client; keep your loops, tools, and state logic intact (README).

  • Server-client architecture: A FastAPI server queues tasks, versions resources, and receives rollouts; clients poll tasks and report results.

  • Selective optimization: Target specific sub-agents or steps in a multi-agent workflow (see examples/spider).

  • RL-first, extensible beyond RL: Integrates with VERL for algorithms like GRPO; also demonstrates prompt optimization (examples/apo).

  • Agent observability: Built-in AgentOps tracing and a tracer abstraction for exporting step-level triplets and spans (trainer.py).

  • Works with your stack: Agents built with LangGraph, AutoGen, OpenAI Agents, or plain Python OpenAI can be optimized without switching frameworks (Microsoft Research, 2025).

Under the hood

The Lightning Server hosts a versioned resource registry and an OpenAI-compatible LLM endpoint inside the training fabric (VERL + vLLM). It exposes simple endpoints for next-task, resources, and rollout, implemented with FastAPI and Uvicorn in server.py

The Lightning Client, implemented with aiohttp and requests, polls tasks, fetches resource versions, executes agent logic, and posts rollouts as Pydantic models defined in types.py

The Trainer in trainer.py orchestrates parallel workers that run your agent, connect to the server, export traces via AgentOps, and surface per-turn triplets for RL. A short deep dive on the server-client architecture explains the design tradeoffs.

On the RL side, the project uses VERL as the training backend and GRPO-style advantage estimation, with a hacked vLLM instrumentation to expose token-level info needed for loss computation. That detail matters: agent loops interleave tool calls and generations, and the trainer needs consistent tokenization to compute per-step losses. See the instrumentation directory and the paper for discussion (Volcengine VERL, 2024; vLLM, 2023).

from agentlightning.trainer import Trainer
from agentlightning.litagent import LitAgent

class MyAgent(LitAgent):
    def training_rollout(self, task, rollout_id, resources):
        # Implement your agent's single-task logic.
        # Return a final reward (float) or a list of step triplets.
        return 1.0  # toy reward for illustration

trainer = Trainer(n_workers=4)
trainer.fit(agent=MyAgent(), backend="http://localhost:8000")  # Point to a running Lightning Server

Use cases

Three concrete examples ship with the repo. calc_x shows a math agent with tool use trained via RL on Calc-X (examples/calc_x). spider implements a LangGraph SQL agent that writes, executes, checks, and rewrites queries against a database; training selectively optimizes the write and rewrite steps and improves held-out accuracy significantly on Spider (examples/spider; Zhang, 2025). apo demonstrates how to plug in training-free prompt optimization under the same server-client pattern (examples/apo).

Community and contributing

The project is MIT-licensed and maintained under the Microsoft org. Contributions require signing the Microsoft Contributor License Agreement, with a bot guiding you through on first PRs. There is a Discord for support and discussion, a Code of Conduct, and active CI for CPU and GPU examples (LICENSE; SECURITY.md; Actions).

Usage and license terms

Agent Lightning ships under the MIT License, which allows use, modification, distribution, and sublicensing, provided the copyright notice and permission appear in copies. Software is provided "as is" without warranty (MIT, 1988).

How it compares: RLHF, DPO, SFT, and prompt tuning

Most production LLM training uses supervised fine-tuning (SFT) followed by reinforcement learning from human feedback (RLHF), commonly with PPO variants that optimize a reward model trained on preference data (Schulman et al., 2017; Ouyang et al., 2022). 

More recently, Direct Preference Optimization (DPO) sidesteps the reward model by optimizing toward preferred responses directly (Rafailov et al., 2023). Prompt-tuning and adapters (for example, LoRA) avoid updating full model weights and are cheaper and safer for narrow domains (Lester et al., 2021; Hu et al., 2021).

Agent Lightning's value is orthogonal: it is a plumbing and abstraction layer that lets you apply these techniques to agents rather than single-turn chat. Because it decouples agent execution from training, you can run online RL with environment rewards (for example, SQL correctness), offline RL on logged trajectories, or training-free prompt optimization under the same interface. 

The VERL backend provides GRPO and related methods out of the box, while the examples show how to plug in non-RL algorithms. If your problem suits preference learning (DPO) or pure SFT, nothing in the architecture stops you from using them; the main win is that your agent code and your optimization code stay cleanly separated.

About Microsoft Research

Agent Lightning is an active Microsoft Research project with a public overview site outlining design goals, architecture, and roadmap. The team emphasizes decoupling, multi-agent readiness, richer reward signals, and future integrations with additional optimization backends (Microsoft Research, 2025).

Impact and what's next

For practitioners, the biggest gain is operational: you can keep using your favorite agent frameworks and iterate on training strategies without churn. For organizations, the server-client split enables centralized training with federated data collection from many agent clients, and clearer routes to observability and safety checks. Near-term, expect richer reward shaping, hierarchical RL, curriculum learning, and more backends (for example, DSPy or LLaMA-Factory) to plug into the same interface (Gupta et al., 2024).

Conclusion

Agent Lightning bridges a long-standing gap: it makes agent training feel like adding a component, not rewriting your app. If you have an agent that needs to get better at the tasks it already performs, start with the Getting Started, skim the architecture deep dive, and run an example like spider. Then decide whether RL, DPO, or prompt tuning is the right first step. The framework will meet you where you are.


Authors:
microsoft
Agent Lightning: Decoupled RL Training for Any AI Agent
Joshua Berkowitz October 8, 2025
Views 1023
Share this post