Agent Task Tutorial¶
This tutorial follows the main Agent-R1 path: a multi-step, tool-augmented agent task built on AgentEnvLoop and ToolEnv.
The example uses GSM8K, but the important part is not the benchmark itself. The goal is to show how Agent-R1 turns a dataset row into an environment-driven, multi-step rollout.
What You Will Run¶
This tutorial uses two existing files:
- dataset preprocessing:
examples/data_preprocess/gsm8k_tool.py - training script:
examples/run_qwen3-4b_gsm8k_tool.sh
1. Prepare the Agent Dataset¶
Generate the tool-augmented GSM8K dataset:
Compared with the single-step sanity-check dataset, this preprocessing script adds two fields that make the task agentic:
agent_name: "agent_env_loop"env_kwargswithenv_type: "tool"and the tool configuration
Conceptually, each sample says:
- use the
agent_env_looprollout logic - instantiate a
toolenvironment - expose the
calc_gsm8k_rewardtool inside that environment
2. Launch the Agent Task Training Script¶
Run:
This script switches the rollout from single-step generation to the agent loop:
actor_rollout_ref.rollout.agent.default_agent_flow=agent_env_loop \
actor_rollout_ref.rollout.agent.max_steps=5 \
It also points the trainer to the tool dataset:
data.train_files=$HOME/data/gsm8k_tool/train.parquet \
data.val_files=$HOME/data/gsm8k_tool/test.parquet \
3. What Happens During One Trajectory¶
At a high level, one sample follows this path:
graph TD
datasetRow["Dataset row"] --> agentFlow["AgentEnvLoop"]
agentFlow --> toolEnv["ToolEnv"]
toolEnv --> llmStep["LLM response"]
llmStep --> toolCall["Tool call parsing"]
toolCall --> toolExec["BaseTool execution"]
toolExec --> nextObs["Next observation"]
nextObs --> agentFlow
More concretely:
AgentEnvLoopreadsenv_kwargsfrom the dataset row.AgentEnv.from_config(env_type="tool", ...)creates aToolEnv.ToolEnv.reset()starts from the sample's prompt messages.- The LLM produces a response.
ToolEnv.step()parses tool calls from the response and executes the registered tool.- Tool output is appended to the conversation as the next observation.
- The loop continues until the environment returns
done=Trueormax_stepsis reached.
4. Where the Reward Comes From¶
The built-in GSM8K tool is registered as calc_gsm8k_reward in agent_r1/tool/tools/gsm8k.py.
Its role in this example is to:
- receive the model's proposed answer
- compare it with the sample's ground truth
- return tool text back into the conversation
This is what makes the tutorial useful for Agent-R1: the model is not just generating one final answer, it is interacting with an environment that can evaluate and feed back information across multiple steps.
5. Why This Tutorial Matters More Than the Single-Step Script¶
The single-step GSM8K script is still useful, but only as a setup check. This tutorial is closer to the actual design center of Agent-R1 because it demonstrates:
- a step-level environment transition
- a multi-step agent loop
- tool-augmented interaction
- reward signals attached to environment-mediated behavior
6. Where to Look Next¶
- Read
Step-level MDPto connect this tutorial to the core RL formulation. - Read
Layered Abstractionsto see why this example maps naturally toAgentEnvLoop + ToolEnv.