Quick Start¶
This guide walks you through two minimal examples: a white-box agent (AgentFlow managed by Claw-R1) and a black-box agent (any HTTP-based agent, zero modification required).
White-box Mode¶
In white-box mode, Claw-R1 owns the agent loop via AgentFlowBase. The framework can observe token IDs, compute rewards internally, and submit steps directly.
1. Start the Gateway¶
python -m claw_r1.gateway.gateway \
--data-pool-name data_pool \
--vllm-addresses http://localhost:8001 \
--tokenizer-path /path/to/your/model \
--prompt-length 4096 \
--response-length 1024
2. Launch Training¶
Use the provided Hydra configuration:
python -m claw_r1.main_agent_ppo \
--config-name agent_ppo_trainer \
trainer.model.path=/path/to/your/model \
trainer.project_name=claw_r1_quickstart
3. Monitor¶
Ray Dashboard is available at http://localhost:8265 by default once the Ray cluster is initialized.
Black-box Mode¶
In black-box mode, your existing agent needs only one change: redirect its base_url to the Gateway.
Any Python agent (OpenAI SDK)¶
from openai import OpenAI
import uuid
traj_uid = str(uuid.uuid4()) # unique per conversation
prompt_uid = str(uuid.uuid4()) # unique per prompt group (for GRPO)
# Before
# client = OpenAI(base_url="https://api.openai.com/v1")
# After — single line change
client = OpenAI(
base_url=f"http://localhost:8000/{traj_uid}/{prompt_uid}",
api_key="not-used",
)
response = client.chat.completions.create(
model="your-model-name",
messages=[{"role": "user", "content": "Hello, what can you do?"}],
)
print(response.choices[0].message.content)
OpenClaw¶
In your OpenClaw configuration file, change:
That's it. Every LLM call OpenClaw makes will be transparently intercepted, logged to DataPool, and used for training — with no interruption to service.
Async Training Mode¶
For production scenarios where the agent serves requests continuously, use the fully-async entry point:
python -m claw_r1.async_main \
--config-name async_ppo_trainer \
trainer.model.path=/path/to/your/model
The async runner automatically:
- Initializes the
DataPoolRay actor - Starts the
AsyncRollouteron the rollout GPU pool - Starts the
AsyncTraineron the training GPU pool - Starts a
ParameterSynchronizerthat periodically pushes updated weights to the rollout vLLM servers - Starts the
Gatewayas an independent FastAPI process
No dataset required
In async online mode, the training data comes entirely from live agent interactions. No pre-collected dataset is needed — the DataPool fills automatically as the agent handles real requests.