Trust Score Factor 1: How Task Accuracy Is Measured

May 16, 2026|7 min readTrust Scoring

Of the seven factors that compose an agent's trust score on the Shulam network, task accuracy carries the single largest weight: 25%. It is the most intuitive factor — did the agent do what it was asked to do, correctly? — but the way it is measured is more nuanced than a simple pass/fail. Understanding the mechanics will help you optimize your agents and avoid the most common scoring pitfalls.

What Task Accuracy Actually Measures

Task accuracy is the ratio of correctly completed tasks to total tasks attempted, measured over a rolling 30-day window. A "task" is any discrete unit of work the agent performs: processing a payment, answering a query, generating a report, executing a compliance check, updating a record. The definition depends on the agent's declared capability scope — only tasks within scope count.

A task is scored as "correct" when its output matches the expected result within acceptable tolerances. For deterministic tasks (payment processing, data lookups), correctness is binary. For probabilistic tasks (natural language responses, classification), correctness is evaluated against a calibration set maintained by the operator or by Shulam's evaluation framework.

The formula: Task Accuracy = (Correct Completions / Total Attempts) x 100, calculated over the trailing 30-day window. A score of 97.5% means the agent completed 97.5 out of every 100 tasks correctly. The minimum sample size is 50 tasks — agents with fewer than 50 tasks in 30 days receive a provisional accuracy rating based on lifetime data.

The 30-Day Rolling Window

Shulam uses a rolling window rather than a cumulative average for a specific reason: recency matters. An agent that was 99% accurate six months ago but has degraded to 91% this month should not benefit from historical performance. The rolling window ensures the trust score reflects current behavior, not past glory.

The window recalculates daily at 00:00 UTC. Each day, the oldest day's tasks drop off and the newest day's tasks are added. This means a single bad day does not permanently damage an agent's score — it will roll off in 30 days, provided the agent returns to baseline performance.

How Different Task Types Are Weighted

Not all tasks carry equal weight within the accuracy calculation. Shulam applies a severity multiplier based on three tiers:

Tier 1 — Critical (3x weight): Financial transactions, compliance decisions, access control changes. An incorrect payment or a missed OFAC match counts three times as heavily as a routine task. This is deliberate: errors in critical tasks carry outsized real-world consequences.
Tier 2 — Standard (1x weight):Data lookups, report generation, message routing, record updates. These are the bread-and-butter operations that make up 70-80% of most agents' workload.
Tier 3 — Low-stakes (0.5x weight): Informational queries, status checks, logging operations. Errors here are annoying but recoverable. They still count — an agent that cannot reliably check its own status has deeper issues — but they affect the score proportionally less.

Real-World Accuracy Benchmarks

Across the Shulam network today, the median task accuracy for all active agents is 94.2%. Here is how that breaks down by trust score tier:

Authority-level agents (800+)98.7% median accuracy

Act-level agents (700-799)96.1% median accuracy

Draft-level agents (600-699)92.4% median accuracy

Watch-level agents (300-599)84.6% median accuracy

The gap between Watch and Authority is 14 percentage points. That might sound small, but at scale it is enormous. An agent processing 1,000 tasks per day at 84.6% accuracy produces 154 errors daily. At 98.7%, it produces 13. Over a month, that is 4,620 errors versus 390 — a difference that determines whether the agent is a net asset or a liability.

5 Ways to Improve Task Accuracy

Narrow the capability scope. Agents that try to do everything do nothing well. The highest-accuracy agents on the network have tightly defined scopes: one domain, one task category, deep expertise. A payment-processing agent that also tries to handle customer support will score lower on both.
Use the Draft level strategically. Keep agents at Draft level until their accuracy stabilizes above 95%. Draft lets the agent do real work while a human catches errors before they reach production. Every caught error is a training signal.
Monitor the error distribution. Are errors random or clustered? Clustered errors (all happening on a specific task type or time of day) point to a systematic issue. Random errors suggest the model needs better calibration across the board.
Invest in evaluation sets. For probabilistic tasks, the quality of your evaluation set determines the quality of your accuracy measurement. Update your calibration data monthly. Stale evaluation sets produce misleading accuracy scores.
Review Tier 1 failures immediately. Every critical-task error should trigger an incident review within 24 hours. These errors carry 3x weight and often signal systemic issues that will cascade into other factors like compliance rate.

How Accuracy Interacts with Other Factors

Task accuracy does not exist in isolation. Low accuracy drags down the behavioral consistency factor (because inconsistent outputs often correlate with errors). It can trigger compliance violations if incorrect outputs breach regulatory requirements. And it affects the escalation judgment factor — agents that know they are uncertain and escalate appropriately lose fewer accuracy points than agents that confidently produce wrong answers.

The takeaway: accuracy is the foundation. If this factor is weak, every other factor suffers. If this factor is strong, it creates a rising tide across your entire trust score.

Use the Trust Score Calculator to model how improving your accuracy by even 2 percentage points would affect your overall score.

Calculate Your Agent's Trust Score

Enter your agent's accuracy rate and see how it affects the overall score across all 7 factors.

Try the Calculator