Benchmark Methodology

RPB means Real Performance Benchmark.

RPB is MyTabulon's public score for real AI agent performance. It measures whether a model can complete business work with tools, reason through multi-step tasks, stay safe, move quickly, and earn good user feedback inside live MyTabulon workflows.

Open leaderboard Benchmark FAQ

Production Signals

Anonymized outcomes from real Maximo AI turns: tool calls, ledger events, feedback, latency, and completion indicators.

Golden Tasks

Controlled MyTabulon scenarios covering CRM, accounting, documents, memory, operations, inventory, and integrations.

Sanitized Traces

Public evidence cards showing domain, tool plan, outcome, and broad evidence without revealing business data.

Score Weights

The RPB score uses a 0-100 scale, but published scores are evidence-adjusted. It is not a token benchmark, a popularity vote, or a lab exam.

25%Task CompletionDid the model actually finish the requested business job, not merely produce a confident answer?

20%Tool AccuracyDid it call the right tools, pass valid inputs, handle errors, and verify the returned result?

15%Reasoning QualityDid it plan sensibly, choose the right sequence, avoid unnecessary work, and explain tradeoffs clearly?

15%Safety & PermissionsDid it respect access, avoid leaking private data, ask before risky actions, and keep audit trails intact?

10%User SentimentDid users rate the response positively, continue the workflow, or correct the model after the turn?

10%LatencyHow quickly did the model complete useful work after adjusting for task complexity and tool count?

5%Cost EfficiencyHow much useful work did the model deliver for the observed token and provider cost profile?

How To Read RPB

RPB is best read together with confidence, sample size, and domain evidence.

90+: Exceptional real-world agent performance with strong evidence and still no perfect-score shortcut.80-89: Strong production performer; check domain winners and trace evidence for fit.70-79: Useful but uneven; likely good in some workflows and weaker in others.Below 70: Early, limited, or struggling evidence. Review sample size before judging.

Privacy Line

The benchmark is public. The underlying business workspace is not.

Public pages may show model labels, domains, tool names, statuses, and broad outcome signals.Public pages do not show prompts, business names, customer names, record IDs, file text, or raw payloads.When evidence is thin, the score is pulled toward a conservative prior and the model is marked warming or insufficient data.

Anti-Gaming Controls

The goal is a public benchmark AI labs cannot tune against by memorizing a fixed worksheet. MyTabulon blends live work, rotating test tasks, human feedback, and tool-ledger evidence.

Scores are based on production tool outcomes and golden tasks, not on self-reported model claims.

Private prompts, payloads, customer names, record identifiers, and document contents are excluded from public evidence.

Golden tasks rotate so model providers cannot train against one frozen public test set.

Confidence is shown separately from RPB so small samples cannot masquerade as settled truth.

A model needs enough live evidence before it is treated as a ranked public leader.

Methodology versions are named, so a score can be compared against the scoring rules that produced it.