Business agents fail in places normal benchmarks barely touch: a calendar update after a partial search, a duplicated invoice write, a missing permission check, or a tool result that needs verification. RPB was built to score those practical moments.
The score blends completion, tool accuracy, reasoning, safety, user sentiment, latency, and cost. Each category is evidence-adjusted so a tiny clean sample does not outrank a mature, repeatable performer.
Public evidence is useful but deliberately incomplete. We publish aggregate scores, domains, sanitized tool plans, broad outcomes, and methodology. We do not publish prompts, records, payloads, files, names, emails, or private identifiers.
Long windows are better for settled rankings, but live and last-24-hour views are useful when testing new models, catching regressions, or watching provider behavior change during active traffic.