Agent Task Tutorial¶
This tutorial shows the simplest multi-step tool-calling path in Agent-R1: GSM8K + Tool built with the generic AgentEnvLoop, a recipe-local ToolEnv, and a recipe-local BaseTool.
The example uses GSM8K, but the important part is not the benchmark itself. The goal is to show how Agent-R1 turns a dataset row into an environment-driven, multi-step rollout.
What You Will Run¶
This tutorial uses two existing files:
- dataset preprocessing:
recipes/gsm8k/data_preprocess/process_gsm8k_tool.py - training script:
examples/gsm8k/run_steppo_tool.sh
1. Prepare the Tool Dataset¶
Generate the tool-augmented GSM8K dataset:
Compared with the single-step sanity-check dataset, this preprocessing script keeps structured task fields for the tool path:
agent_name: "gsm8k_tool"questionandground_truth, plus a tool-calling prompt stored inpromptenv_kwargsfor per-sample tool metadata such as the ground-truth answer
Conceptually, each sample says:
- use the configured
gsm8k_toolentry, which points to the genericAgentEnvLoop - instantiate the built-in tool environment
- expose the
calc_gsm8k_rewardtool inside that environment
2. Launch the Agent Task Training Script¶
Run:
This script switches the rollout from single-step generation to the generic agent-environment loop:
actor_rollout_ref.rollout.agent.default_agent_flow=gsm8k_tool \
actor_rollout_ref.rollout.agent.max_steps=5 \
It also points the trainer to the tool dataset:
data.train_files=$HOME/data/gsm8k_tool/train.parquet \
data.val_files=$HOME/data/gsm8k_tool/test.parquet \
3. What Happens During One Trajectory¶
At a high level, one sample follows this path:
graph TD
datasetRow["Dataset row"] --> agentFlow["AgentEnvLoop"]
agentFlow --> toolEnv["ToolEnv"]
toolEnv --> llmStep["LLM response"]
llmStep --> toolCall["Tool call parsing"]
toolCall --> toolExec["BaseTool execution"]
toolExec --> nextObs["Next observation"]
nextObs --> agentFlow
More concretely:
AgentEnvLoopreads recipe defaults and per-sampleenv_kwargs.AgentEnv.from_config(env_type="tool", ...)creates the built-inToolEnv.ToolEnv.reset()uses the chat prompt stored in the dataset row.- The LLM produces a response.
ToolEnv.step()parses tool calls from the response and executes the registered tool.- Tool output is appended to the conversation as the next observation.
- The loop continues until the environment returns
done=Trueormax_stepsis reached.
4. Where the Reward Comes From¶
The GSM8K tool is registered as calc_gsm8k_reward in recipes/gsm8k/tool.py.
Its role in this example is to:
- receive the model's proposed answer
- compare it with the sample's ground truth
- return tool text back into the conversation
This is what makes the tutorial useful for Agent-R1: the model is not just generating one final answer, it is interacting with an environment that can evaluate and feed back information across multiple steps.
5. Why This Tutorial Is Separate From the Single-Step Script¶
The single-step GSM8K script is still useful as a setup check. This tutorial is different: it is the minimal example for the lowest abstraction layer, where users only define tools and let ToolEnv + BaseTool handle standard multi-turn tool interaction. It demonstrates:
- a step-level environment transition
- a multi-step agent loop
- tool-augmented interaction
- reward signals attached to environment-mediated behavior
6. Where to Look Next¶
- Read
Step-level MDPto connect this tutorial to the core RL formulation. - Read
Layered Abstractionsto see why this example maps naturally toToolEnv + BaseTool. - Read
Recipes and Algorithmsto see the other task recipes and launch scripts.