Recipes and Algorithms¶
Agent-R1 recipes are task-level integrations. A recipe usually owns the prompts, rollout-time agent flow, reward logic, environment wrapper, data preprocessing entry points, and task-specific dependency notes needed to run a benchmark without mixing that code into the core framework.
The launch scripts live under examples/. The intended pattern is:
examples/<task>/run_ppo.sh
examples/<task>/run_steppo.sh
examples/<task>/run_grpo.sh
examples/<task>/run_rloo.sh
examples/<task>/run_reinforce.sh
examples/<task>/run_gspo.sh
examples/<task>/run_gigpo.sh
Not every task needs every script. PPO and StepPO use their own launch configuration, while several other algorithm variants share the GRPO-style script structure with different Hydra overrides.
GSM8K is intentionally kept as a lightweight test recipe. It provides only examples/gsm8k/run_steppo.sh for the single-turn sanity check and examples/gsm8k/run_steppo_tool.sh for the multi-turn ToolEnv check.
Recipe Layout¶
Most recipes follow this shape:
recipes/<task>/
├── README.md
├── requirements.txt
├── base.yaml
├── prompts.py
├── reward_fn.py
├── <task>_agent_flow.py
├── data_preprocess/
│ └── process_<task>.py
└── env/
└── ...
reward_fn.py is present when the task has a standalone rule or model reward function. Some online environments compute rewards during interaction instead.
Datasets and Environments¶
| Recipe | What it covers | Main entry points |
|---|---|---|
gsm8k |
Grade-school math reasoning. Plain GSM8K is kept as a single-turn sanity check, while GSM8K + Tool is the minimal ToolEnv + BaseTool example with recipe-local calc_gsm8k_reward. |
data_preprocess/process_gsm8k.py, data_preprocess/process_gsm8k_tool.py, tool.py |
hotpotqa |
Multi-hop question answering with a retrieval environment. The recipe keeps preprocessing separate from retrieval-index construction. | data_preprocess/process_hotpotqa.py, env/build_retrieval_corpus.py, env/build_index.py, hotpotqa_agent_flow.py |
alfworld |
Text-based household task completion through an ALFWorld-style environment wrapper and tool executor. | data_preprocess/process_alfworld.py, env/alfworld_wrapper.py, env/tool_executor.py, alfworld_agent_flow.py |
webshop |
Shopping-agent training with a local WebShop environment server and catalog artifacts. | data_preprocess/process_webshop.py, env/run_env_server.sh, env/full_catalog.py, webshop_agent_flow.py |
paper_search |
Academic paper-search agents that query a paper service, expand citation/reference neighborhoods, and use a selector service for scoring. | paper_search_agent_flow.py, runtime.py, env/paper_client.py, inference/run.py, inference/evaluation.py |
Processed datasets are available from the Agent-R1 data release on ModelScope. After downloading the release, place or symlink each task's files to the paths listed in its recipe README. Recipe preprocessing scripts are still kept for local regeneration or format inspection when public raw data is available.
Algorithms¶
| Algorithm | Script | Notes |
|---|---|---|
| PPO | run_ppo.sh |
Actor-critic baseline. |
| GRPO | run_grpo.sh |
Group-relative policy optimization baseline used as the common script base for several variants. |
| StepPO | run_steppo.sh |
Step-aligned policy optimization for multi-step agent trajectories. |
| RLOO | run_rloo.sh |
Leave-one-out baseline variant. |
| REINFORCE | run_reinforce.sh |
Step-level critic-free policy-gradient variant. |
| GSPO | run_gspo.sh |
GRPO-family algorithm variant exposed through script overrides. |
| GiGPO | run_gigpo.sh |
GRPO-family algorithm variant exposed through script overrides. |
Where to Start¶
- Use
recipes/<task>/README.mdfor task-specific dependencies, resource expectations, and environment setup. - Download processed data from ModelScope before running the full recipe scripts.
- Use
examples/<task>/run_*.shfor launch commands and Hydra overrides. - Use
recipes/<task>/base.yamlto see the recipe-local rollout configuration. - Use
data_preprocess/process_*.pyonly when you need to regenerate or inspect dataset formatting.