Datasets, Data Processing, and Algorithms¶
Agent-R1 is designed as a framework rather than a single benchmark implementation. A task recipe only needs to provide the same training data contract, an optional environment/tool configuration, and an agent flow. Once those pieces are present, the same trainer can run multiple RL algorithms by changing Hydra overrides.
This page documents the datasets, preparation scripts, task recipes, and algorithm entry points included in this repository. StepPO is included as a first-class Agent-R1 recipe, while the remaining entries are generic baselines that can run on the same dataset and agent-flow contracts.
Current Runnable Scripts¶
| Dataset | Main StepPO Script | Baseline Scripts | Data Preparation |
|---|---|---|---|
| GSM8K Tool | examples/run_qwen3-4b_gsm8k_tool_steppo.sh |
GRPO, PPO, token GAE, RLOO, REINFORCE++ scripts under examples/ |
examples/data_preprocess/gsm8k_tool.py |
| HotpotQA | examples/run_hotpotqa_steppo.sh |
examples/hotpotqa/run_{grpo,gspo,token_adv,rloo,gigpo,reinforce_plus_plus}.sh |
recipe/hotpotqa/prepare_hotpotqa_agent_r1.py |
| Paper Search | examples/run_papersearch_steppo.sh |
examples/papersearch/run_{grpo,gspo,token_adv,rloo,gigpo,reinforce_plus_plus}.sh |
Built-in JSONL under recipe/paper_search/inference/datasets/ + recipe/paper_search/prepare_paper_search_agent_r1.py |
| ALFWorld | examples/run_alfworld_steppo.sh |
examples/alfworld/run_{grpo,gspo,token_adv,rloo,gigpo,reinforce_plus_plus}.sh |
recipe/alfworld/prepare_alfworld_agent_r1.py |
| WebShop | examples/run_webshop_steppo.sh |
examples/webshop/run_{grpo,gspo,token_adv,rloo,gigpo,reinforce_plus_plus}.sh |
recipe/webshop/prepare_webshop_agent_r1.py |
The repository includes the Paper Search raw JSONL files from the StepPO recipe:
recipe/paper_search/inference/datasets/AutoScholarQuery/{train,dev,test,test_lt_5}.jsonlrecipe/paper_search/inference/datasets/RealScholarQuery/test.jsonlrecipe/paper_search/inference/datasets/test_case.jsonl
HotpotQA, ALFWorld, and WebShop are larger benchmark/environment datasets. They are prepared through the included scripts but are not committed as generated parquet, retrieval indexes, environment caches, or copied game/product artifacts.
Data Preparation¶
# GSM8K Tool
python3 examples/data_preprocess/gsm8k_tool.py --local_save_dir ~/data/gsm8k_tool
# HotpotQA + retrieval corpus
python3 recipe/hotpotqa/prepare_hotpotqa_agent_r1.py \
--output_dir data/corpus/hotpotqa \
--corpus_output_path data/corpus/hotpotqa_corpus/hpqa_corpus.jsonl
# Optional HotpotQA FAISS assets
python3 recipe/hotpotqa/process_hotpotqa.py
# Paper Search from bundled AutoScholarQuery JSONL files
python3 recipe/paper_search/prepare_paper_search_agent_r1.py \
--input_dir recipe/paper_search/inference/datasets/AutoScholarQuery \
--output_dir data/pasa
# ALFWorld from local ALFWorld raw data
python3 recipe/alfworld/prepare_alfworld_agent_r1.py \
--input_dir alfworld_data/json_2.1.1 \
--output_dir data/alfworld
# WebShop small/full data
python3 recipe/webshop/prepare_webshop_agent_r1.py \
--dataset_mode small \
--input_dir webshop_data \
--output_dir data/webshop
Then choose any supported algorithm script. The StepPO entry points are:
bash examples/run_qwen3-4b_gsm8k_tool.sh
bash examples/run_qwen3-4b_gsm8k_tool_steppo.sh
bash examples/run_hotpotqa_steppo.sh
bash examples/run_papersearch_steppo.sh
bash examples/run_alfworld_steppo.sh
bash examples/run_webshop_steppo.sh
All scripts accept additional Hydra overrides through "$@", for example:
bash examples/run_hotpotqa_steppo.sh \
actor_rollout_ref.model.path=/path/to/model \
trainer.n_gpus_per_node=4
Training Data Contract¶
Agent-R1 uses parquet files compatible with the veRL trainer. For agent tasks, each row should include:
| Field | Required | Meaning |
|---|---|---|
data_source |
Yes | Dataset or benchmark name. |
prompt |
Yes | Chat messages passed to the tokenizer and rollout engine. |
ability |
Recommended | Task category used for logging and reward routing. |
reward_model |
Yes | Rule/model reward metadata, usually including ground_truth. |
extra_info |
Recommended | Split, index, raw question, raw answer, or task-specific metadata. |
agent_name |
Agent tasks | Agent flow name. agent_env_loop is the default runnable flow. |
env_kwargs |
Tool/env tasks | JSON string consumed by AgentEnvLoop._create_env. |
examples/data_preprocess/gsm8k_tool.py is the smallest complete reference. It stores the environment configuration in env_kwargs:
{
"env_type": "tool",
"tools": ["calc_gsm8k_reward"],
"tool_format": "hermes",
"tools_kwargs": {"ground_truth": "<answer>"}
}
This is the key extension point for new datasets. A HotpotQA recipe can point to a retrieval tool, a WebShop recipe can point to a web-shopping environment server, and an ALFWorld recipe can point to a text-world environment. The trainer does not need a dataset-specific rewrite as long as the recipe emits the same fields and the configured agent flow knows how to step the environment.
Dataset Recipe Pattern¶
Agent-R1-derived recipes generally use the following layout:
recipe/<task>/
base.yaml # task-specific Hydra defaults
prepare_<task>_agent_r1.py # raw data -> parquet
<task>_agent_flow.py # maps prompts/actions/observations into Agent-R1 steps
reward_fn.py # task reward
prompts.py # prompt templates
utils.py # task helpers
env/ # optional environment service or wrappers
This repository now includes the following concrete recipes:
| Dataset / Environment | Data Processing | Environment / Flow | Notes |
|---|---|---|---|
| HotpotQA | recipe/hotpotqa/prepare_hotpotqa_agent_r1.py, process_hotpotqa.py, build_retrieval_corpus.py, verify_dataset.py |
HotpotQAAgentFlow |
Multi-hop QA with local retrieval. |
| Paper Search | recipe/paper_search/prepare_paper_search_agent_r1.py, bundled AutoScholarQuery/RealScholarQuery JSONL files, inference utilities |
PaperSearchAgentFlow |
Academic search and citation expansion. |
| ALFWorld | recipe/alfworld/prepare_alfworld_agent_r1.py |
AlfworldAgentFlow, ALFWorld wrapper |
Text-world embodied household tasks. |
| WebShop | recipe/webshop/prepare_webshop_agent_r1.py, index/artifact builders, environment server |
WebShopAgentFlow |
Web shopping navigation and scoring. |
These recipes demonstrate that the Agent-R1 abstraction is not tied to GSM8K. They add task-specific data preparation and environment logic, while reusing the same rollout, reward-loop, and trainer interfaces.
Supported Algorithms¶
Agent-R1 supports StepPO as a composed recipe:
algorithm.adv_estimator=gae
actor_rollout_ref.actor.policy_loss.loss_mode=gspo
actor_rollout_ref.actor.loss_agg_mode=seq-mean-token-mean
The first setting computes credit assignment over the agent step timeline. The GSPO policy loss then uses sequence-level importance ratios for the complete generated action at each step.
The Agent-R1 trainer also routes algorithm.adv_estimator to the following estimators:
algorithm.adv_estimator |
Granularity | Critic Required | Typical Use |
|---|---|---|---|
gae |
Step-level | Yes | PPO-style actor-critic over agent steps. |
token_gae |
Token-level | Yes | Token-level actor-critic baseline for multi-step rollouts. |
grpo |
Trajectory outcome | No | Group-relative outcome optimization. |
rloo |
Trajectory outcome | No | Leave-one-out baseline over multiple rollouts. |
reinforce_plus_plus |
Token return | No | REINFORCE++ baseline with KL in reward. |
reinforce_plus_plus_baseline |
Trajectory outcome | No | REINFORCE++ with prompt-level mean baseline. |
gigpo |
Trajectory + step group | No | Requires anchor_obs in the agent flow for step grouping. |
The policy objective is controlled separately by actor_rollout_ref.actor.policy_loss.loss_mode. Agent-R1 keeps this axis separate from advantage estimation so that a task recipe can compare different credit-assignment strategies without changing the environment.
For convenience, the examples/run_*_steppo.sh scripts set the StepPO combination directly. Other algorithms can still be selected by overriding algorithm.adv_estimator and, when needed, actor_rollout_ref.actor.policy_loss.loss_mode.