Run RPB

Evaluate a model where business-agent behavior actually happens.

RPB runs inside the MyTabulon agent environment so model quality is measured through tool outcomes, safety, latency, usage, and verified task completion rather than a static worksheet alone.

Realtime view Data API

Choose a model track

Select a model configured inside Maximo AI, then decide whether the run is a realtime smoke test, a last-24-hour health check, or a longer leaderboard sample.

Run scenario batches

Exercise CRM, accounting, operations, documents, inventory, memory, and integrations. Each scenario should include realistic context, ambiguity, and permission boundaries.

Collect evidence

The benchmark service reads assistant turns, tool events, action-ledger outcomes, feedback, usage, latency, and safe trace metadata.

Review public output

Use the leaderboard, model page, data export, and methodology page to review score, confidence, domain evidence, and sanitized traces.

Scenario tracks

Run batches across multiple work surfaces so a model cannot overfit one task type.

CRMRequired trackLead qualification, client updates, deal movement, duplicate avoidance.

AccountingRequired trackDraft invoices, payments, expenses, totals, currencies, and approval boundaries.

OperationsRequired trackTasks, projects, appointments, approvals, notes, and rescheduling.

DocumentsRequired trackFiles, PDFs, AI documents, extraction, generated assets, and redaction.

IntegrationsRequired trackGoogle Workspace, MCP, Zapier, WhatsApp, Telegram, Mono, and external tools.

MemoryRequired trackPreference storage, retrieval, retention, conflict handling, and secret avoidance.

Acceptance rules

Runs are useful only when evidence is safe, realistic, and auditable.

Do not publish customer data, raw prompts, record IDs, emails, phone numbers, files, or private payloads.A model must use tools correctly; answer-only summaries do not count as completed tool-backed workflows.Risky actions need permission and verification, especially send, delete, payroll, signature, and external integration workflows.Realtime and 24-hour views are diagnostic. Longer windows are better for ranking and procurement decisions.Small samples stay marked warming or insufficient data until evidence volume is high enough.