Reward System¶

The RewardLoopWorker is a Ray Actor responsible for assigning reward scores to trajectory steps. It bridges the gap between raw agent interactions and trainable reward signals.

Three Reward Sources¶

Claw-R1 supports three types of reward computation, which can be combined:

Type	Description	Best For
Rule-based	Deterministic function of step output	Verifiable tasks (math, code execution)
Discriminative RM	Binary classifier reward model	Preference learning, safety evaluation
Generative RM	LLM-based evaluator via custom scoring function	Complex quality assessment, nuanced feedback

Reward in Production vs. Research Settings¶

In research settings (white-box offline mode), rewards are computed from known ground truth:

Trajectory:   [user msg] → [agent think] → [tool call] → [tool result] → [final reply]
Reward:            0.0            0.3            0.7            0.9            0.8

Rule-based: is the final answer correct? does the code pass tests?
Model-based: is each step logically sound? is the tool use appropriate?

In production settings (online mode), rewards come from real user signals:

Signal	Type	Interpretation
User sends follow-up	Implicit positive	Agent answer was relevant but incomplete
User corrects the agent	Negative feedback	Factual or task error
User says "thanks"	Positive signal	Task completed satisfactorily
No follow-up after task	Neutral / estimated	Reward Model estimates step quality

Claw-R1 uses a Reward Model to convert these soft signals into scalar process rewards, filling the gap between verifiable task rewards and open-ended conversational rewards.

RewardLoopWorker API¶

`compute_score_batch(steps: list[Step]) → list[float]`¶

Computes rewards for a batch of steps. This is the primary interface used by the Trainer.

# In AsyncTrainer
rewards = await reward_worker.compute_score_batch.remote(batch_steps)
for step, reward in zip(batch_steps, rewards):
    step.reward = reward

Custom Reward Function¶

Register a custom generative reward model by implementing the reward_loop_manager interface:

# custom_reward.py
def compute_reward(step: dict, model, tokenizer) -> float:
    """
    Args:
        step: dict with keys 'messages', 'response', 'metadata'
        model: loaded reward model
        tokenizer: model tokenizer
    Returns:
        scalar reward in [0.0, 1.0]
    """
    prompt = build_evaluation_prompt(step)
    score = model.score(prompt)
    return score

Then register it in the configuration:

reward:
  type: genrm
  reward_loop_manager: path.to.custom_reward.compute_reward
  model_path: /path/to/reward/model

Reward in the Training Loop¶

Reward computation is decoupled from the agent service:

The Gateway does not compute rewards before submitting steps to DataPool
DataPool stores steps with reward=0.0 initially
The Trainer calls RewardLoopWorker.compute_score_batch() before the PPO update
Updated rewards are used for advantage computation

This ensures that even slow generative reward models (which may call an external LLM) do not affect agent service latency.

Reward Design

For new tasks, start with simple rule-based rewards (e.g., exact match, code execution pass rate). Generative reward models are more expressive but introduce variance and computational cost. Use discriminative models as a middle ground.