Step-level MDP¶
A Principled Foundation for RL Agent Training¶
Most existing frameworks treat the LLM agent as a token-level process: the "state" is the ever-growing concatenation of all past tokens, and the "action" is the next token. This token-level view forces context to grow monotonically and makes it hard to apply standard RL algorithms at a meaningful granularity.
Agent-R1 adopts a step-level MDP that models the LLM as an agent acting inside an environment:
| MDP Element | Definition |
|---|---|
| State \(s_t\) | The prompt presented to the LLM at step \(t\), determined entirely by the environment |
| Action \(a_t\) | The LLM's complete response at step \(t\) |
| Transition \(T(s_{t+1} \mid s_t, a_t)\) | The environment produces the next observation given the current state and the LLM's response |
| Reward \(r_t\) | A per-step reward signal from the environment |
| Policy \(\pi(a_t \mid s_t)\) | The LLM itself |
graph LR
state_t["State s_t"] -->|"Policy π (LLM)"| action_t["Action a_t"]
action_t -->|"Environment"| state_t1["State s_{t+1}"]
action_t -->|"Environment"| reward_t["Reward r_t"]
state_t1 -->|"Policy π (LLM)"| action_t1["Action a_{t+1}"]
action_t1 -->|"..."| more_steps["..."]
This formulation leads to three key insights:
Flexible Context
Because the state \(s_t\) is provided by the environment -- not derived by concatenating all prior tokens -- the environment is free to summarize, truncate, restructure, or even completely replace the context between steps. As long as the transition function is well-defined, the MDP remains valid.
Valid RL Training
Each step has its own observation, action, and reward. Log-probabilities are computed conditioned on \(s_t\) independently at each step, so standard policy gradient methods (PPO, GRPO, etc.) apply directly at the step level.
Concat as a Special Case
The traditional "append everything" approach is simply one particular transition function: \(s_{t+1} = \text{concat}(s_t,\; a_t,\; \text{env}_{output_t})\). It is a valid but by no means the only choice. Agent-R1 supports it as a special case rather than a hard-wired constraint.
Why It Matters for Agent Tasks¶
This is the main reason Agent-R1 is built around multi-step agent behavior rather than single-step prompting. Once the environment owns the next observation, the framework can naturally support:
- tool calls and structured environment feedback
- state updates across multiple turns
- per-step rewards instead of only outcome rewards
- trajectory-level training for real agent tasks
In practice, this means the important unit in Agent-R1 is not just a token stream, but a sequence of environment-mediated interaction steps.