Production Agent Scenario¶

The Hidden Assumption in Agentic RL¶

Almost every Agentic RL framework is built on an implicit assumption:

Training phase ≠ Deployment phase

The standard workflow: train on offline/simulated data → deploy a fixed model → retrain periodically.

This works in research settings, but in production it hits fundamental walls:

Problem	Manifestation
Distribution shift	Training data is synthetic; real user requests have a different distribution → capability degradation after deployment
Cold start	A newly deployed model knows nothing about a specific user's habits, tools, or workflows → long "warmup" period
Long-tail tasks	Benchmarks cover common tasks; real users' niche needs cannot be covered by offline training
Environment drift	Tool APIs update, user behavior changes → static models cannot self-adapt

Claw-R1's Core Scenario: Personal Agent Self-Evolution¶

Claw-R1's first validation scenario is the OpenClaw personal assistant:

Setup:
  User deploys OpenClaw on a Mac Mini, connected to Slack / WeChat / email.
  Every day they interact with OpenClaw via messages:
  scheduling, information retrieval, code assistance, etc.

Traditional approach:
  OpenClaw uses a fixed GPT-4o / Claude 3.5.
  Capabilities do not grow with usage.

Claw-R1 approach:
  1. User message → OpenClaw → Gateway (intercepts LLM call)
  2. Gateway logs each interaction step → DataPool (local)
  3. Reward Model scores each interaction (user satisfaction signals, task completion, etc.)
  4. Training Engine on a remote server continuously consumes DataPool, updates model weights
  5. Updated weights are pushed back to the Gateway; the next call uses the improved model

Result:
  The OpenClaw running on the user's Mac Mini becomes progressively better
  at understanding this specific user over time.

Three Requirements Traditional RL Frameworks Cannot Meet¶

This scenario requires three capabilities that traditional frameworks lack:

① Service continuity¶

Model weight updates must not interrupt Gateway request handling. In Claw-R1:

The Trainer directly manages the lifecycle of Rollout Engine and Reward Model (wake_up / sleep / weight sync)
The Gateway is a pure HTTP proxy — it only forwards requests and submits steps; it does not manage any engine lifecycle
This guarantees continuous request forwarding and data collection even during weight updates

② No preset data¶

Traditional frameworks require a pre-collected dataset (SFT corpus or RL environment). Claw-R1's training data comes entirely from live user interactions:

What questions the user asked, how the agent answered, which tools were called — all of this becomes training data automatically
Zero data engineering; data accumulates naturally as the service runs

③ Reward signals from the real environment¶

Traditional RLVR rewards come from verifiable task outcomes (does the code run? is the math answer correct?). Production rewards are more nuanced:

User follows up → implicit positive signal
User corrects the agent → negative feedback
Task completed with no follow-up → Reward Model estimates quality of intermediate steps

Claw-R1 uses a Reward Model to convert these soft rewards into trainable process rewards, bridging the gap between "verifiable tasks" and "real conversational tasks".

Three Operating Modes¶

Mode	Agent type	Data source	Notes
White-box offline	AgentFlow (Python)	Synthetic dataset or pre-collected trajectories	Fully implemented; recommended for research
Black-box offline	Any HTTP agent	Pre-collected logs	Gateway endpoint reserved
Black-box online	Any HTTP agent	Live user interactions	Gateway endpoint reserved; target production mode

Current Status

White-box offline mode is fully implemented. The black-box online endpoints (/complete_trajectory, /{traj_uid}/{prompt_uid}/v1/chat/completions) are designed and stubbed, actively being developed.

Deployment = Training¶

Claw-R1 introduces a new paradigm:

┌─────────────────────────────────────────────────────┐
│         Traditional: Train → Deploy (fixed)          │
│                                                      │
│  [Synthetic data] → [Train] → [Fixed model] → Users  │
└─────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────┐
│         Claw-R1: Deploy = Train (continuous)         │
│                                                      │
│  Users ──► Agent ──► [Live data] ──► Train ──► Agent │
│            ▲___________________________________|      │
└─────────────────────────────────────────────────────┘

In this paradigm:

Every user interaction is a training sample
Every model update improves the agent's real-world performance
The longer the agent runs, the better it becomes for its specific users and environment