Why RPB Exists

Business agents fail in places normal benchmarks barely touch: a calendar update after a partial search, a duplicated invoice write, a missing permission check, or a tool result that needs verification. RPB was built to score those practical moments.

Signals We Score

The score blends completion, tool accuracy, reasoning, safety, user sentiment, latency, and cost. Each category is evidence-adjusted so a tiny clean sample does not outrank a mature, repeatable performer.

Privacy Boundary

Public evidence is useful but deliberately incomplete. We publish aggregate scores, domains, sanitized tool plans, broad outcomes, and methodology. We do not publish prompts, records, payloads, files, names, emails, or private identifiers.

Realtime And 24H Views

Long windows are better for settled rankings, but live and last-24-hour views are useful when testing new models, catching regressions, or watching provider behavior change during active traffic.

How RPB Differs

Normal benchmarkRPB benchmark

Mostly static questionsRotating production and sandbox workflows

Answer correctnessTool outcomes, updates, safety, latency, and feedback

Prompt-visible tasksPrivate context with public aggregate evidence

One score snapshotRealtime, 24h, 30d, 90d, 180d, and all-time windows

RPB-1.1 Release Notes

RPB-1.1 adds realtime and last-24-hours views for short-window model health.Domain winners require minimum evidence before promotion.Trace cards expose tool sequence and status, not private prompts or payloads.Small samples are pulled toward conservative priors and marked warming when evidence is thin.

Research Note

AI agents need benchmarks that watch the work, not the pitch.