Datasets, Data Processing, and Algorithms¶

Agent-R1 is designed as a framework rather than a single benchmark implementation. A task recipe only needs to provide the same training data contract, an optional environment/tool configuration, and an agent flow. Once those pieces are present, the same trainer can run multiple RL algorithms by changing Hydra overrides.

This page documents the datasets, preparation scripts, task recipes, and algorithm entry points included in this repository. StepPO is included as a first-class Agent-R1 recipe, while the remaining entries are generic baselines that can run on the same dataset and agent-flow contracts.

Current Runnable Scripts¶

Dataset	Main StepPO Script	Baseline Scripts	Data Preparation
GSM8K Tool	`examples/run_qwen3-4b_gsm8k_tool_steppo.sh`	GRPO, PPO, token GAE, RLOO, REINFORCE++ scripts under `examples/`	`examples/data_preprocess/gsm8k_tool.py`
HotpotQA	`examples/run_hotpotqa_steppo.sh`	`examples/hotpotqa/run_{grpo,gspo,token_adv,rloo,gigpo,reinforce_plus_plus}.sh`	`recipe/hotpotqa/prepare_hotpotqa_agent_r1.py`
Paper Search	`examples/run_papersearch_steppo.sh`	`examples/papersearch/run_{grpo,gspo,token_adv,rloo,gigpo,reinforce_plus_plus}.sh`	Built-in JSONL under `recipe/paper_search/inference/datasets/` + `recipe/paper_search/prepare_paper_search_agent_r1.py`
ALFWorld	`examples/run_alfworld_steppo.sh`	`examples/alfworld/run_{grpo,gspo,token_adv,rloo,gigpo,reinforce_plus_plus}.sh`	`recipe/alfworld/prepare_alfworld_agent_r1.py`
WebShop	`examples/run_webshop_steppo.sh`	`examples/webshop/run_{grpo,gspo,token_adv,rloo,gigpo,reinforce_plus_plus}.sh`	`recipe/webshop/prepare_webshop_agent_r1.py`

The repository includes the Paper Search raw JSONL files from the StepPO recipe:

recipe/paper_search/inference/datasets/AutoScholarQuery/{train,dev,test,test_lt_5}.jsonl
recipe/paper_search/inference/datasets/RealScholarQuery/test.jsonl
recipe/paper_search/inference/datasets/test_case.jsonl

HotpotQA, ALFWorld, and WebShop are larger benchmark/environment datasets. They are prepared through the included scripts but are not committed as generated parquet, retrieval indexes, environment caches, or copied game/product artifacts.

Data Preparation¶

# GSM8K Tool
python3 examples/data_preprocess/gsm8k_tool.py --local_save_dir ~/data/gsm8k_tool

# HotpotQA + retrieval corpus
python3 recipe/hotpotqa/prepare_hotpotqa_agent_r1.py \
  --output_dir data/corpus/hotpotqa \
  --corpus_output_path data/corpus/hotpotqa_corpus/hpqa_corpus.jsonl

# Optional HotpotQA FAISS assets
python3 recipe/hotpotqa/process_hotpotqa.py

# Paper Search from bundled AutoScholarQuery JSONL files
python3 recipe/paper_search/prepare_paper_search_agent_r1.py \
  --input_dir recipe/paper_search/inference/datasets/AutoScholarQuery \
  --output_dir data/pasa

# ALFWorld from local ALFWorld raw data
python3 recipe/alfworld/prepare_alfworld_agent_r1.py \
  --input_dir alfworld_data/json_2.1.1 \
  --output_dir data/alfworld

# WebShop small/full data
python3 recipe/webshop/prepare_webshop_agent_r1.py \
  --dataset_mode small \
  --input_dir webshop_data \
  --output_dir data/webshop

Then choose any supported algorithm script. The StepPO entry points are:

bash examples/run_qwen3-4b_gsm8k_tool.sh
bash examples/run_qwen3-4b_gsm8k_tool_steppo.sh
bash examples/run_hotpotqa_steppo.sh
bash examples/run_papersearch_steppo.sh
bash examples/run_alfworld_steppo.sh
bash examples/run_webshop_steppo.sh

All scripts accept additional Hydra overrides through "$@", for example:

bash examples/run_hotpotqa_steppo.sh \
  actor_rollout_ref.model.path=/path/to/model \
  trainer.n_gpus_per_node=4

Training Data Contract¶

Agent-R1 uses parquet files compatible with the veRL trainer. For agent tasks, each row should include:

Field	Required	Meaning
`data_source`	Yes	Dataset or benchmark name.
`prompt`	Yes	Chat messages passed to the tokenizer and rollout engine.
`ability`	Recommended	Task category used for logging and reward routing.
`reward_model`	Yes	Rule/model reward metadata, usually including `ground_truth`.
`extra_info`	Recommended	Split, index, raw question, raw answer, or task-specific metadata.
`agent_name`	Agent tasks	Agent flow name. `agent_env_loop` is the default runnable flow.
`env_kwargs`	Tool/env tasks	JSON string consumed by `AgentEnvLoop._create_env`.

examples/data_preprocess/gsm8k_tool.py is the smallest complete reference. It stores the environment configuration in env_kwargs:

{
  "env_type": "tool",
  "tools": ["calc_gsm8k_reward"],
  "tool_format": "hermes",
  "tools_kwargs": {"ground_truth": "<answer>"}
}

This is the key extension point for new datasets. A HotpotQA recipe can point to a retrieval tool, a WebShop recipe can point to a web-shopping environment server, and an ALFWorld recipe can point to a text-world environment. The trainer does not need a dataset-specific rewrite as long as the recipe emits the same fields and the configured agent flow knows how to step the environment.

Dataset Recipe Pattern¶

Agent-R1-derived recipes generally use the following layout:

recipe/<task>/
  base.yaml                  # task-specific Hydra defaults
  prepare_<task>_agent_r1.py # raw data -> parquet
  <task>_agent_flow.py       # maps prompts/actions/observations into Agent-R1 steps
  reward_fn.py               # task reward
  prompts.py                 # prompt templates
  utils.py                   # task helpers
  env/                       # optional environment service or wrappers

This repository now includes the following concrete recipes:

Dataset / Environment	Data Processing	Environment / Flow	Notes
HotpotQA	`recipe/hotpotqa/prepare_hotpotqa_agent_r1.py`, `process_hotpotqa.py`, `build_retrieval_corpus.py`, `verify_dataset.py`	`HotpotQAAgentFlow`	Multi-hop QA with local retrieval.
Paper Search	`recipe/paper_search/prepare_paper_search_agent_r1.py`, bundled AutoScholarQuery/RealScholarQuery JSONL files, inference utilities	`PaperSearchAgentFlow`	Academic search and citation expansion.
ALFWorld	`recipe/alfworld/prepare_alfworld_agent_r1.py`	`AlfworldAgentFlow`, ALFWorld wrapper	Text-world embodied household tasks.
WebShop	`recipe/webshop/prepare_webshop_agent_r1.py`, index/artifact builders, environment server	`WebShopAgentFlow`	Web shopping navigation and scoring.

These recipes demonstrate that the Agent-R1 abstraction is not tied to GSM8K. They add task-specific data preparation and environment logic, while reusing the same rollout, reward-loop, and trainer interfaces.

Supported Algorithms¶

Agent-R1 supports StepPO as a composed recipe:

algorithm.adv_estimator=gae
actor_rollout_ref.actor.policy_loss.loss_mode=gspo
actor_rollout_ref.actor.loss_agg_mode=seq-mean-token-mean

The first setting computes credit assignment over the agent step timeline. The GSPO policy loss then uses sequence-level importance ratios for the complete generated action at each step.

The Agent-R1 trainer also routes algorithm.adv_estimator to the following estimators:

`algorithm.adv_estimator`	Granularity	Critic Required	Typical Use
`gae`	Step-level	Yes	PPO-style actor-critic over agent steps.
`token_gae`	Token-level	Yes	Token-level actor-critic baseline for multi-step rollouts.
`grpo`	Trajectory outcome	No	Group-relative outcome optimization.
`rloo`	Trajectory outcome	No	Leave-one-out baseline over multiple rollouts.
`reinforce_plus_plus`	Token return	No	REINFORCE++ baseline with KL in reward.
`reinforce_plus_plus_baseline`	Trajectory outcome	No	REINFORCE++ with prompt-level mean baseline.
`gigpo`	Trajectory + step group	No	Requires `anchor_obs` in the agent flow for step grouping.

The policy objective is controlled separately by actor_rollout_ref.actor.policy_loss.loss_mode. Agent-R1 keeps this axis separate from advantage estimation so that a task recipe can compare different credit-assignment strategies without changing the environment.

For convenience, the examples/run_*_steppo.sh scripts set the StepPO combination directly. Other algorithms can still be selected by overriding algorithm.adv_estimator and, when needed, actor_rollout_ref.actor.policy_loss.loss_mode.