Agent Task Tutorial¶

This tutorial follows the main Agent-R1 path: a multi-step, tool-augmented agent task built on AgentEnvLoop and ToolEnv.

The example uses GSM8K, but the important part is not the benchmark itself. The goal is to show how Agent-R1 turns a dataset row into an environment-driven, multi-step rollout.

What You Will Run¶

This tutorial uses two existing files:

dataset preprocessing: examples/data_preprocess/gsm8k_tool.py
training script: examples/run_qwen3-4b_gsm8k_tool.sh

1. Prepare the Agent Dataset¶

Generate the tool-augmented GSM8K dataset:

python3 examples/data_preprocess/gsm8k_tool.py --local_save_dir ~/data/gsm8k_tool

Compared with the single-step sanity-check dataset, this preprocessing script adds two fields that make the task agentic:

agent_name: "agent_env_loop"
env_kwargs with env_type: "tool" and the tool configuration

Conceptually, each sample says:

use the agent_env_loop rollout logic
instantiate a tool environment
expose the calc_gsm8k_reward tool inside that environment

2. Launch the Agent Task Training Script¶

Run:

bash examples/run_qwen3-4b_gsm8k_tool.sh

This script switches the rollout from single-step generation to the agent loop:

actor_rollout_ref.rollout.agent.default_agent_flow=agent_env_loop \
actor_rollout_ref.rollout.agent.max_steps=5 \

It also points the trainer to the tool dataset:

data.train_files=$HOME/data/gsm8k_tool/train.parquet \
data.val_files=$HOME/data/gsm8k_tool/test.parquet \

3. What Happens During One Trajectory¶

At a high level, one sample follows this path:

graph TD
    datasetRow["Dataset row"] --> agentFlow["AgentEnvLoop"]
    agentFlow --> toolEnv["ToolEnv"]
    toolEnv --> llmStep["LLM response"]
    llmStep --> toolCall["Tool call parsing"]
    toolCall --> toolExec["BaseTool execution"]
    toolExec --> nextObs["Next observation"]
    nextObs --> agentFlow

More concretely:

AgentEnvLoop reads env_kwargs from the dataset row.
AgentEnv.from_config(env_type="tool", ...) creates a ToolEnv.
ToolEnv.reset() starts from the sample's prompt messages.
The LLM produces a response.
ToolEnv.step() parses tool calls from the response and executes the registered tool.
Tool output is appended to the conversation as the next observation.
The loop continues until the environment returns done=True or max_steps is reached.

4. Where the Reward Comes From¶

The built-in GSM8K tool is registered as calc_gsm8k_reward in agent_r1/tool/tools/gsm8k.py.

Its role in this example is to:

receive the model's proposed answer
compare it with the sample's ground truth
return tool text back into the conversation

This is what makes the tutorial useful for Agent-R1: the model is not just generating one final answer, it is interacting with an environment that can evaluate and feed back information across multiple steps.

5. Why This Tutorial Matters More Than the Single-Step Script¶

The single-step GSM8K script is still useful, but only as a setup check. This tutorial is closer to the actual design center of Agent-R1 because it demonstrates:

a step-level environment transition
a multi-step agent loop
tool-augmented interaction
reward signals attached to environment-mediated behavior

6. Where to Look Next¶

Read Step-level MDP to connect this tutorial to the core RL formulation.
Read Layered Abstractions to see why this example maps naturally to AgentEnvLoop + ToolEnv.