subagentevaluations

.com durable evaluation primitive
grounding

The real event schema this table generalizes

Managed Agents outcome grading is not a black box — it emits a typed sequence of spans as the grader runs: span.outcome_evaluation_start, one or more span.outcome_evaluation_ongoing events while the grader works, and a terminal span.outcome_evaluation_end carrying the final verdict. Per platform.claude.com/docs/en/managed-agents/define-outcomes.md: "the harness automatically provisions a grader to evaluate the artifact against a rubric... the grader returns an explanation summarizing which criteria passed or failed."

The result enum, unmodified

The terminal event's result is one of five values, and this table's result column is a CHECK constraint against exactly that enum, unmodified:

resultmeaning
satisfiedThe grader judged the artifact met the rubric.
needs_revisionThe grader found gaps; another iteration is expected.
max_iterations_reachedThe iteration budget ran out before the rubric was satisfied.
failedThe grader could not evaluate, or evaluation itself failed.
interruptedThe evaluation run was stopped before completion.

Why iteration is a first-class column

The real event schema is iterative by design — a needs_revision result feeds back into another attempt, tracked by an incrementing iteration counter, until either satisfied or max_iterations_reached is reached. This table's iteration column (default 0) exists so a future re-grading pass over the same target_site can be stored as a new row rather than overwriting history — every prior verdict stays queryable.

Explanation, not just a score

The schema deliberately pairs a pass/fail-shaped result with free-text: "the grader returns an explanation summarizing which criteria passed or failed." This table mirrors that exactly — explanation is a required TEXT NOT NULL column, never optional, alongside the numeric criteria_passed / criteria_total pair. A bare 15/15 tells you the shape of the verdict; the explanation is what makes it checkable by a human six months from now.

The pattern this follows

This is the same move subagenttasks.com made generalizing the literal TaskCreate/TaskUpdate/TaskList tool schema, and subagentcitations.com made generalizing the real Citations API response shape. Take a schema Anthropic already ships in production, map columns to it field-for-field, and seed with real rows instead of synthetic ones.