Configuration Reference¶
Claw-R1 uses Hydra for hierarchical configuration management. All YAML configs are located in claw_r1/config/.
Config Files¶
| File | Purpose |
|---|---|
agent_ppo_trainer.yaml |
Base PPO trainer config (extends veRL's ppo_trainer) |
async_ppo_trainer.yaml |
Async-specific overrides for fully-asynchronous training |
overrides/rollout.yaml |
Rollout worker settings (async mode, agent flow) |
async_ppo_trainer.yaml¶
defaults:
- agent_ppo_trainer
async_trainer:
# Off-policy tolerance: steps where policy_version lag > threshold
# are treated as off-policy and corrected with importance sampling
staleness_threshold: 0.1
# Sync model weights from Trainer to Rollouter every N training steps
trigger_parameter_sync_step: 4
# Whether to use log-probs collected during rollout for IS ratio computation
use_rollout_log_probs: true
# Maximum number of prompt_uid groups held in DataPool
# null = unlimited (use with caution on memory-constrained machines)
max_queue_size: null
overrides/rollout.yaml¶
rollout:
mode: async # "sync" or "async"
agent_flow:
num_workers: 8
default_agent_flow: single_step_single_turn_agent
# Custom async server configuration (optional)
async_server:
host: "0.0.0.0"
port: 8000
Model Configuration (BaseModelConfig)¶
@dataclass
class BaseModelConfig:
path: str # HuggingFace model path or local directory
trust_remote_code: bool = False
Set via Hydra:
python -m claw_r1.async_main \
trainer.model.path=/path/to/model \
trainer.model.trust_remote_code=true
Checkpoint Configuration (CheckpointConfig)¶
@dataclass
class CheckpointConfig:
save_freq: int = 100 # save every N training steps
save_path: str = "./checkpoints"
load_path: str | None = None # resume from checkpoint
Reward Configuration¶
reward:
type: rule # "rule", "disrm", or "genrm"
# For rule-based rewards:
# rule_fn: path.to.reward_function
# For discriminative reward model:
# model_path: /path/to/reward/model
# batch_size: 32
# For generative reward model:
# reward_loop_manager: path.to.custom_reward.compute_reward
# model_path: /path/to/eval/model
Gateway Command-line Arguments¶
The Gateway is configured entirely via CLI arguments (not Hydra), since it runs as an independent process:
python -m claw_r1.gateway.gateway \
--data-pool-name data_pool \ # Ray actor name for DataPool
--vllm-addresses host1:8001,host2:8001 \ # comma-separated, load-balanced
--tokenizer-path /path/to/model \
--prompt-length 4096 \
--response-length 1024
Multi-GPU Setup¶
# Separate GPU pools for trainer and rollouter
trainer:
tensor_model_parallel_size: 8 # 8 GPUs for training
pipeline_model_parallel_size: 1
rollout:
tensor_model_parallel_size: 4 # 4 GPUs for inference (vLLM)
n_gpus_per_node: 4
Resource Pool Separation
Claw-R1 uses Ray's resource group mechanism to ensure Trainer and Rollouter GPUs never overlap. This is configured automatically when using async_ppo_trainer.yaml. See Async Training for details.