Agent Task Tutorial¶

This tutorial shows the simplest multi-step tool-calling path in Agent-R1: GSM8K + Tool built with the generic AgentEnvLoop, a recipe-local ToolEnv, and a recipe-local BaseTool.

The example uses GSM8K, but the important part is not the benchmark itself. The goal is to show how Agent-R1 turns a dataset row into an environment-driven, multi-step rollout.

What You Will Run¶

This tutorial uses two existing files:

dataset preprocessing: recipes/gsm8k/data_preprocess/process_gsm8k_tool.py
training script: examples/gsm8k/run_steppo_tool.sh

1. Prepare the Tool Dataset¶

Generate the tool-augmented GSM8K dataset:

python3 -m recipes.gsm8k.data_preprocess.process_gsm8k_tool --local_save_dir ~/data/gsm8k_tool

Compared with the single-step sanity-check dataset, this preprocessing script keeps structured task fields for the tool path:

agent_name: "gsm8k_tool"
question and ground_truth, plus a tool-calling prompt stored in prompt
env_kwargs for per-sample tool metadata such as the ground-truth answer

Conceptually, each sample says:

use the configured gsm8k_tool entry, which points to the generic AgentEnvLoop
instantiate the built-in tool environment
expose the calc_gsm8k_reward tool inside that environment

2. Launch the Agent Task Training Script¶

Run:

bash examples/gsm8k/run_steppo_tool.sh

This script switches the rollout from single-step generation to the generic agent-environment loop:

actor_rollout_ref.rollout.agent.default_agent_flow=gsm8k_tool \
actor_rollout_ref.rollout.agent.max_steps=5 \

It also points the trainer to the tool dataset:

data.train_files=$HOME/data/gsm8k_tool/train.parquet \
data.val_files=$HOME/data/gsm8k_tool/test.parquet \

3. What Happens During One Trajectory¶

At a high level, one sample follows this path:

graph TD
    datasetRow["Dataset row"] --> agentFlow["AgentEnvLoop"]
    agentFlow --> toolEnv["ToolEnv"]
    toolEnv --> llmStep["LLM response"]
    llmStep --> toolCall["Tool call parsing"]
    toolCall --> toolExec["BaseTool execution"]
    toolExec --> nextObs["Next observation"]
    nextObs --> agentFlow

More concretely:

AgentEnvLoop reads recipe defaults and per-sample env_kwargs.
AgentEnv.from_config(env_type="tool", ...) creates the built-in ToolEnv.
ToolEnv.reset() uses the chat prompt stored in the dataset row.
The LLM produces a response.
ToolEnv.step() parses tool calls from the response and executes the registered tool.
Tool output is appended to the conversation as the next observation.
The loop continues until the environment returns done=True or max_steps is reached.

4. Where the Reward Comes From¶

The GSM8K tool is registered as calc_gsm8k_reward in recipes/gsm8k/tool.py.

Its role in this example is to:

receive the model's proposed answer
compare it with the sample's ground truth
return tool text back into the conversation

This is what makes the tutorial useful for Agent-R1: the model is not just generating one final answer, it is interacting with an environment that can evaluate and feed back information across multiple steps.

5. Why This Tutorial Is Separate From the Single-Step Script¶

The single-step GSM8K script is still useful as a setup check. This tutorial is different: it is the minimal example for the lowest abstraction layer, where users only define tools and let ToolEnv + BaseTool handle standard multi-turn tool interaction. It demonstrates:

a step-level environment transition
a multi-step agent loop
tool-augmented interaction
reward signals attached to environment-mediated behavior

6. Where to Look Next¶

Read Step-level MDP to connect this tutorial to the core RL formulation.
Read Layered Abstractions to see why this example maps naturally to ToolEnv + BaseTool.
Read Recipes and Algorithms to see the other task recipes and launch scripts.