MyTabulon RPB

Research Note

Why RPB scores the work agents actually perform, not the pitch around them.

Benchmark Research Note

AI agents need benchmarks that watch the work, not the pitch.

MyTabulon RPB is a public benchmark for model behavior inside real business workflows. It is designed for the messy middle: tools, permissions, partial context, latency, cost, and users who care whether the task actually got done.

ContentsWhy RPB ExistsSignals We ScorePrivacy BoundaryRealtime And 24H ViewsComparisonRPB-1.1
Why RPB Exists

Business agents fail in places normal benchmarks barely touch: a calendar update after a partial search, a duplicated invoice write, a missing permission check, or a tool result that needs verification. RPB was built to score those practical moments.

Signals We Score

The score blends completion, tool accuracy, reasoning, safety, user sentiment, latency, and cost. Each category is evidence-adjusted so a tiny clean sample does not outrank a mature, repeatable performer.

Privacy Boundary

Public evidence is useful but deliberately incomplete. We publish aggregate scores, domains, sanitized tool plans, broad outcomes, and methodology. We do not publish prompts, records, payloads, files, names, emails, or private identifiers.

Realtime And 24H Views

Long windows are better for settled rankings, but live and last-24-hour views are useful when testing new models, catching regressions, or watching provider behavior change during active traffic.

How RPB Differs
Normal benchmarkRPB benchmark
Mostly static questionsRotating production and sandbox workflows
Answer correctnessTool outcomes, updates, safety, latency, and feedback
Prompt-visible tasksPrivate context with public aggregate evidence
One score snapshotRealtime, 24h, 30d, 90d, 180d, and all-time windows
RPB-1.1 Release Notes
RPB-1.1 adds realtime and last-24-hours views for short-window model health.Domain winners require minimum evidence before promotion.Trace cards expose tool sequence and status, not private prompts or payloads.Small samples are pulled toward conservative priors and marked warming when evidence is thin.