Benchmark FAQ

Plain answers for the public AI benchmark.

Use this page when someone asks what RPB means, why the benchmark is different from lab tests, how privacy works, and how to interpret model rankings without hype.

Open leaderboard Methodology

What does RPB mean?

RPB means Real Performance Benchmark. It is MyTabulon's evidence-adjusted public score for real AI agent performance across task completion, tool accuracy, reasoning, safety, user sentiment, latency, and cost efficiency.

Methodology

Is this the same as a normal model benchmark?

No. Many model benchmarks are lab tasks. MyTabulon RPB is built around business-agent behavior: using tools, reading allowed workspace context, making correct updates, respecting permissions, and recovering from tool failures.

Can AI labs game the benchmark?

The benchmark is designed to be difficult to game because it blends production signals, anonymized ledger outcomes, user feedback, and rotating golden tasks. A provider cannot simply memorize one public prompt set and call it solved.

Anti-gaming controls

Does the public benchmark expose customer data?

No. Public benchmark evidence excludes prompts, raw tool payloads, customer names, business names, record IDs, file contents, and private workspace data. It can show model name, domain, tool sequence, status, and broad evidence signals.

What is tool call accuracy?

Tool call accuracy measures whether the model selected the right MyTabulon tool, passed usable arguments, handled errors, and verified the result. For agents, this matters more than a polished paragraph.

What is task completion?

Task completion measures whether the user’s actual goal was finished. A model can speak fluently and still fail completion if it never creates the invoice, checks the calendar, updates the lead, or returns the requested evidence.

Why can a model show as warming?

Warming means MyTabulon has seen the model but does not yet have enough evidence for a fair public rank. The score is pulled toward a conservative prior so a small clean sample does not look perfect.

Why does confidence matter?

Confidence reflects sample strength. A model with a high RPB and weak confidence may look promising, while a model with high RPB and high confidence has more repeated evidence behind it. Confidence also keeps small samples from looking settled.

What is the realtime benchmark view?

The realtime view focuses on the freshest safe public signal from the last 15 minutes. It is useful for watching model behavior during active testing, but the default 90-day view is better for settled comparisons.

Realtime view

Why is there a last 24 hours view?

The last 24 hours view shows recent model health without waiting for a longer leaderboard window. It helps detect regressions, latency spikes, tool failures, and fresh provider changes.

24-hour view

How are domains like CRM and accounting scored?

Domains are scored from the subset of tool-backed evidence that belongs to that work category. A model can be excellent at CRM, weaker at accounting, and still have a respectable overall RPB.

Where should I send someone who wants the short version?

Send them to /benchmarks for the leaderboard, /benchmarks/methodology for the scoring explanation, and /benchmarks/faq for plain-language answers.

Leaderboard Methodology