Favur Evals
favur.dev

FAVUR EVALS

Multi-agent SDLC, evaluated.

We run the same statement of work through different LLM configurations, observing how each behaves inside Favur's agentic process. 13 runs so far, more coming.

favur.dev
scroll

FILTER

SoW
Family
Agent
🏆 TOP RUN

Highest composite across visible runs

Mimo Pro v2.5Jun 2026all-Xiaomi

all-Xiaomi

Highest cache hit rate at 85.2%, driving cost efficiency.

1,060 reqs · 3h 13m · 0.0% failure rate

✨ SPECIALIZED LEADERS
Deepseek Flash v459.5

59.5 composite · 11.6× the cohort-median value

Avg across 2 runs

Glm 5.261.5

22.9 / 36 across Test + Code Quality

Claude Sonnet 4.646.5

11.9 / 12 Tool Discipline

Gemini Flash 3 (preview)60.1

6.9 / 12 Workflow & Reporting

Mimo Pro v2.567.9

0.0% failure rate across 1,060 requests

ALL RUNS

Ranked by composite score within circles · 9 runs visible

SoW
Run leaderboard, sorted by composite
RunCompositeVolumeConfiguration
Mimo Pro v2.567.9 out of 1001,060 reqs · 3h 13mall-Xiaomi
Glm 5.261.5 out of 1001,251 reqs · 3h 47mall-Z.ai
Deepseek Flash v460.3 out of 100701 reqs · 1h 30mall-DeepSeek
Gemini Flash 3 (preview)60.1 out of 100809 reqs · 1h 5mall-Google
Deepseek Flash v458.7 out of 100917 reqs · 1h 36mall-DeepSeek
Deepseek Pro v458.2 out of 1001,041 reqs · 2h 35mall-DeepSeek
Kimi k2.657.0 out of 1001,091 reqs · 1h 43mall-Moonshot AI
Gemini Pro 3.1 (preview)55.6 out of 100505 reqs · 1h 11mall-Google
Claude Sonnet 4.646.5 out of 1001,260 reqs · 4h 14mall-Anthropic

Bar length = composite score · Segment width = token share

SUBJECT BREAKDOWN

How each run earns its composite — subject by subject

View:
11.598.01713.777.8987.24710.6317.6677.84Code /189.4510.5499.1210.118.769.1018.16.93Test /1812.3498.8756.5815.7569.5417.0447.702Cost /1411.79.98911.78510.30311.8039.87811.25111.855Tools /126.126.58767.32766.2186.4916Workflow /128.0157.1236.9035.7687.8176.9617.1255.798Effort /125.3144.5655.5494.4275.1925.2395.65.027Process /83.3624.5291.7853.7211.8021.9761.7121.772Deliver /6
Subject breakdown — accessible mirror
RunCode /18Test /18Cost /14Tools /12Workflow /12Effort /12Process /8Deliver /6Composite
Mimo Pro v2.511.59 / 18 (64%)9.45 / 18 (53%)12.349 / 14 (88%)11.7 / 12 (98%)6.12 / 12 (51%)8.015 / 12 (67%)5.314 / 8 (66%)3.362 / 6 (56%)67.9
Deepseek Flash v48.017 / 18 (45%)10.549 / 18 (59%)8.875 / 14 (63%)9.989 / 12 (83%)6.587 / 12 (55%)7.123 / 12 (59%)4.565 / 8 (57%)4.529 / 6 (75%)60.2
Z.ai Glm 5.213.77 / 18 (77%)9.12 / 18 (51%)6.581 / 14 (47%)11.785 / 12 (98%)6 / 12 (50%)6.903 / 12 (58%)5.549 / 8 (69%)1.785 / 6 (30%)61.5
Gemini Flash 3 (preview)7.898 / 18 (44%)10.11 / 18 (56%)5.756 / 14 (41%)10.303 / 12 (86%)7.327 / 12 (61%)5.768 / 12 (48%)4.427 / 8 (55%)3.721 / 6 (62%)55.3
Deepseek Pro v47.247 / 18 (40%)8.76 / 18 (49%)9.541 / 14 (68%)11.803 / 12 (98%)6 / 12 (50%)7.817 / 12 (65%)5.192 / 8 (65%)1.802 / 6 (30%)58.2
AI Kimi k2.610.631 / 18 (59%)9.101 / 18 (51%)7.044 / 14 (50%)9.878 / 12 (82%)6.218 / 12 (52%)6.961 / 12 (58%)5.239 / 8 (65%)1.976 / 6 (33%)57.0
Gemini Pro 3.1 (preview)7.667 / 18 (43%)8.1 / 18 (45%)7.702 / 14 (55%)11.251 / 12 (94%)6.491 / 12 (54%)7.125 / 12 (59%)5.6 / 8 (70%)1.712 / 6 (29%)55.6
Claude Sonnet 4.67.84 / 18 (44%)6.93 / 18 (39%)1.292 / 14 (9%)11.855 / 12 (99%)6 / 12 (50%)5.798 / 12 (48%)5.027 / 8 (63%)1.772 / 6 (30%)46.5

Hover any bar for the full formula breakdown.

WHERE THE MONEY GOES

Across all visible runs — which agents and models drive the cost

Sort:
CodeCode ReviewDevelopOrchestratorSprint ReviewPseudocodeSprint PlanScoutTestBuild

Across the visible runs, spend concentrates in a handful of agents: Code (~26%), Code Review (~18%), Develop (~17%). The Code agent is the single largest cost driver in the current data.

Every Statement of Work runs the full Favur SDLC regardless of size — phase docs, sprint scoping, cycle reviews, code-review passes, document versioning — so the agents carrying those protocols dominate the bill, whatever the project.

Cost share aggregated across visible runs by agent and contributing model
AgentTotal %Top contributing modelTop contribution %
Code25.893%DeepSeek Deepseek Flash v412.675%
Code Review18.298%DeepSeek Deepseek Pro v47.999%
Develop16.502%DeepSeek Deepseek Flash v48.553%
Orchestrator15.858%DeepSeek Deepseek Flash v45.848%
Sprint Review7.001%DeepSeek Deepseek Flash v43.572%
Pseudocode3.87%DeepSeek Deepseek Flash v41.774%
Sprint Plan2.808%DeepSeek Deepseek Flash v41.075%
Scout2.52%DeepSeek Deepseek Flash v41.702%
Test1.763%Google Gemini Pro 3.1 (preview)0.746%
Build1.543%DeepSeek Deepseek Flash v40.708%

BEHAVIOR FINGERPRINTS

How models behave when handling each agent role

View:
For agent:

Six axes per agent, each on an absolute scale: cache utilization, cost efficiency, throughput, responsiveness, tool intensity, and reasoning depth.

Behavior fingerprint — radar — code
ModelCache UtilCost EffThroughputSpeedTool UseReasoning
Anthropic Claude Sonnet 4.60.570.000.140.820.490.16
DeepSeek Deepseek Flash v40.720.831.000.770.590.31
DeepSeek Deepseek Pro v40.790.230.180.770.570.29
Google Gemini Flash 3 (preview)0.410.330.390.950.530.04
Google Gemini Pro 3.1 (preview)0.660.010.000.820.520.30
Moonshot AI Kimi k2.60.540.170.470.950.620.39
Xiaomi Mimo Pro v2.50.880.800.110.740.600.33
Z.ai Glm 5.20.700.090.220.580.590.31

QUICK STATS

Behavioral patterns across the cohort

CACHE HITS


REASONING DEPTH


TOOLS / RESPONSE


FAILURE TYPES


Cache hit rate by run
RunCache Hit %
Xiaomi Mimo Pro v2.585.2%
DeepSeek Deepseek Pro v482.1%
DeepSeek Deepseek Flash v481.7%
DeepSeek Deepseek Flash v477.8%
DeepSeek Deepseek Flash v471.8%
DeepSeek Deepseek Flash v470.6%
Z.ai Glm 5.269.3%
DeepSeek Deepseek Flash v467.9%
Google Gemini Pro 3.1 (preview)54.1%
Moonshot AI Kimi k2.650.2%
Google Gemini Flash 3 (preview)50.0%
Anthropic Claude Sonnet 4.646.1%
Google Gemini Flash 3 (preview)35.4%
Reasoning depth by run
RunReasoning Tokens
Google Gemini Flash 3 (preview)2,623,402
DeepSeek Deepseek Flash v42,185,380
DeepSeek Deepseek Flash v41,716,085
DeepSeek Deepseek Flash v41,014,871
Z.ai Glm 5.2265,221
Moonshot AI Kimi k2.6251,819
Xiaomi Mimo Pro v2.5201,003
DeepSeek Deepseek Pro v4191,434
DeepSeek Deepseek Flash v4151,585
Google Gemini Pro 3.1 (preview)124,400
DeepSeek Deepseek Flash v4104,506
Anthropic Claude Sonnet 4.694,040
Google Gemini Flash 3 (preview)83,406
Tools per response by run
RunTools / Response
Moonshot AI Kimi k2.61.24
Xiaomi Mimo Pro v2.51.20
DeepSeek Deepseek Flash v41.20
DeepSeek Deepseek Flash v41.18
Z.ai Glm 5.21.18
DeepSeek Deepseek Flash v41.16
DeepSeek Deepseek Flash v41.16
DeepSeek Deepseek Flash v41.15
DeepSeek Deepseek Pro v41.13
Google Gemini Flash 3 (preview)1.07
Google Gemini Flash 3 (preview)1.05
Google Gemini Pro 3.1 (preview)1.04
Anthropic Claude Sonnet 4.60.98
Failure types by run
RunTotal %HTTPNo-responseEmpty
DeepSeek Deepseek Flash v41.2%1.2%0.0%0.0%
Google Gemini Pro 3.1 (preview)0.8%0.8%0.0%0.0%
DeepSeek Deepseek Flash v40.7%0.7%0.0%0.0%
DeepSeek Deepseek Flash v40.5%0.5%0.0%0.0%
Google Gemini Flash 3 (preview)0.4%0.4%0.0%0.0%
Moonshot AI Kimi k2.60.4%0.4%0.0%0.0%
DeepSeek Deepseek Flash v40.2%0.2%0.0%0.0%
DeepSeek Deepseek Flash v40.1%0.1%0.0%0.0%
Google Gemini Flash 3 (preview)0.0%0.0%0.0%0.0%
Xiaomi Mimo Pro v2.50.0%0.0%0.0%0.0%
Z.ai Glm 5.20.0%0.0%0.0%0.0%
DeepSeek Deepseek Pro v40.0%0.0%0.0%0.0%
Anthropic Claude Sonnet 4.60.0%0.0%0.0%0.0%

PER-AGENT LEADERBOARDS

Which model does each Favur agent role best?

Each card shows that role’s pooled mean ± 95% confidence interval: every appearance of the role across all models and SoWs, since roles behave consistently between SoWs. The interval (a Student-t bound, honest at small samples) is how confident we are in the average — it tightens as more runs accrue. Click Score over time on any card to see the full per-run history.

Code

Writes the actual source files


TOP 3 MODELS


Cohort median: 8.4 · 8 models tested

Pooled mean 8.5 ±0.14 · 95% CI, n=13

🗎 Sample transcript ↗
Test

Authors tests and validates the suite


TOP 3 MODELS


Cohort median: 7.2 · 8 models tested

Pooled mean 7.2 ±0.18 · 95% CI, n=13

🗎 Sample transcript ↗
Build

Installs deps and runs the build pipeline


TOP 3 MODELS


Cohort median: 8.4 · 8 models tested

Pooled mean 8.4 ±0.12 · 95% CI, n=13

🗎 Sample transcript ↗
Scout

Spot-checks the developing product


TOP 3 MODELS


Cohort median: 6.9 · 7 models tested

Pooled mean 6.8 ±0.44 · 95% CI, n=12

🗎 Sample transcript ↗
Pseudocode

Plans implementation step-by-step before coding


TOP 3 MODELS


Cohort median: 7.9 · 8 models tested

Pooled mean 7.8 ±0.19 · 95% CI, n=13

🗎 Sample transcript ↗
Sprint Plan

Decomposes the SoW into sprints and tasks


TOP 3 MODELS


Cohort median: 6.7 · 8 models tested

Pooled mean 6.6 ±0.24 · 95% CI, n=13

🗎 Sample transcript ↗
Code Review

Reviews diffs against the sprint plan


TOP 3 MODELS


Cohort median: 7.8 · 8 models tested

Pooled mean 7.9 ±0.14 · 95% CI, n=13

🗎 Sample transcript ↗
Develop

Coordinates the per-sprint development cycle


TOP 3 MODELS


Cohort median: 8.3 · 8 models tested

Pooled mean 8.2 ±0.23 · 95% CI, n=13

🗎 Sample transcript ↗
code leaderboard, top models by Agent Performance Score
RankModelScore
1Moonshot AI Kimi k2.68.8 of 10
2Xiaomi Mimo Pro v2.58.8 of 10
3DeepSeek Deepseek Flash v48.5 of 10
test leaderboard, top models by Agent Performance Score
RankModelScore
1Moonshot AI Kimi k2.67.8 of 10
2Xiaomi Mimo Pro v2.57.3 of 10
3DeepSeek Deepseek Pro v47.3 of 10
build leaderboard, top models by Agent Performance Score
RankModelScore
1Anthropic Claude Sonnet 4.68.6 of 10
2DeepSeek Deepseek Flash v48.6 of 10
3DeepSeek Deepseek Pro v48.5 of 10
scout leaderboard, top models by Agent Performance Score
RankModelScore
1Xiaomi Mimo Pro v2.57.2 of 10
2Z.ai Glm 5.27.2 of 10
3DeepSeek Deepseek Pro v47.2 of 10
pseudocode leaderboard, top models by Agent Performance Score
RankModelScore
1Google Gemini Pro 3.1 (preview)8.1 of 10
2Xiaomi Mimo Pro v2.58.0 of 10
3Moonshot AI Kimi k2.68.0 of 10
sprint-plan leaderboard, top models by Agent Performance Score
RankModelScore
1Xiaomi Mimo Pro v2.57.1 of 10
2DeepSeek Deepseek Pro v47.1 of 10
3Google Gemini Pro 3.1 (preview)7.0 of 10
code-review leaderboard, top models by Agent Performance Score
RankModelScore
1Xiaomi Mimo Pro v2.58.2 of 10
2DeepSeek Deepseek Pro v48.1 of 10
3DeepSeek Deepseek Flash v47.9 of 10
develop leaderboard, top models by Agent Performance Score
RankModelScore
1DeepSeek Deepseek Pro v48.6 of 10
2Google Gemini Pro 3.1 (preview)8.6 of 10
3Xiaomi Mimo Pro v2.58.5 of 10

PER-MODEL PROFILES

How does each vendor's model perform on cost, tokens, cache, and reliability?

ANTHROPIC

4.6

1 run · last seen 2026-06-28


64.8M (64.1M in / 0.7M out)
46.1%
0.00%
94K tokens
0.98 (one tool/turn)

Best at:
Weakest:

anthropic.com

DEEPSEEK

5 runs · last seen 2026-06-27


1736.7M (1723.2M in / 13.5M out)
70.0%
0.27%
5172K tokens
1.17 (one tool/turn)

averaged across 5 runs


Best at:
Weakest:

deepseek.com

GOOGLE

2 runs · last seen 2026-06-28


335.6M (333.6M in / 2.0M out)
36.9%
0.10%
2707K tokens
1.06 (one tool/turn)

averaged across 2 runs


Best at:
Weakest:

ai.google.dev

MOONSHOT AI

k2.6

1 run · last seen 2026-06-27


40.0M (39.5M in / 0.5M out)
50.2%
0.37%
252K tokens
1.24 (bundles tools)

Best at:
Weakest:

XIAOMI

v2.5

1 run · last seen 2026-06-28


49.3M (48.7M in / 0.5M out)
85.2%
0.00%
201K tokens
1.20 (bundles tools)

Best at:
Weakest:

Z.AI

5.2

1 run · last seen 2026-06-28


52.6M (51.8M in / 0.8M out)
69.3%
0.00%
265K tokens
1.18 (one tool/turn)

Best at:
Weakest:

z.ai
Anthropic 4.6 model profile
MetricValue
Tokens total64,845,271
Cache hit %46.1%
Failure %0.00%
Reasoning tokens94,040
Tools per response0.98
Best atBuild Agent (8.6)
WeakestOrchestrator (5.9)
DeepSeek Deepseek Flash v4 model profile
MetricValue
Tokens total1,736,721,095
Cache hit %70.0%
Failure %0.27%
Reasoning tokens5,172,427
Tools per response1.17
Best atCode Agent (8.8)
WeakestSprint-Plan Agent (6.9)
Google Gemini Flash 3 (preview) model profile
MetricValue
Tokens total335,631,669
Cache hit %36.9%
Failure %0.10%
Reasoning tokens2,706,808
Tools per response1.06
Best atCode Agent (8.6)
WeakestOrchestrator (6.6)
Moonshot AI k2.6 model profile
MetricValue
Tokens total40,006,693
Cache hit %50.2%
Failure %0.37%
Reasoning tokens251,819
Tools per response1.24
Best atCode Agent (8.8)
WeakestScout Agent (4.8)
Xiaomi v2.5 model profile
MetricValue
Tokens total49,272,732
Cache hit %85.2%
Failure %0.00%
Reasoning tokens201,003
Tools per response1.20
Best atCode Agent (8.8)
WeakestOrchestrator (6.9)
Z.ai 5.2 model profile
MetricValue
Tokens total52,563,701
Cache hit %69.3%
Failure %0.00%
Reasoning tokens265,221
Tools per response1.18
Best atDevelop Agent (8.3)
WeakestOrchestrator (6.4)

BROWSE THE CODE

Inspect what each model actually generated. Pick a Statement of Work, select a run, browse its files, and (on desktop) compare two runs side-by-side.

README.mdLoading…

Methodology

How the score is calculated

Every number on this page is reproducible. A run’s composite is just the sum of eight subject scores, each capped at a fixed weight that together add up to 100. There is no gate, curve, or pass/fail line — a composite measures how well a run engineered its Statement of Work, not whether it “passed”.

Every subject is a weighted average of normalized component metrics. Each component is measured deterministically from artifacts the run already produced — static analysis of the generated code, the run’s own pytest results, and tool/transcript telemetry — then mapped onto a 0–1 scale where 1 is best. The subject’s 0–1 score is multiplied by its weight to give its point contribution, and the eight contributions are added.

If a component has no data for a run (e.g. no telemetry was attributed, or static analysis could not run), it drops out and the remaining component weights renormalize, so a run is never penalized for data the harness did not capture. Each subject records a coverage fraction — the share of its weight backed by real data — which the tooltips surface alongside the score.

Worked example: what does a 67.9 mean?

Take Mimo Pro v2.5 on circles, which scored 67.9. That single number is exactly the sum of these eight subject contributions:

Subject Contribution Out of Data coverage
Code Quality 11.6 18 100%
Test Quality 9.4 18 100%
Cost Efficiency 12.3 14 100%
Tool Discipline 11.7 12 100%
Workflow & Reporting 6.1 12 100%
Effort Efficiency 8.0 12 100%
Process Discipline 5.3 8 100%
Deliverables 3.4 6 100%
Composite 67.9 100

Read it as: this run earned 11.6 of 18 for Code Quality, 9.4 of 18 for Test Quality, 12.3 of 14 for Cost Efficiency, 11.7 of 12 for Tool Discipline, 6.1 of 12 for Workflow & Reporting, 8.0 of 12 for Effort Efficiency, 5.3 of 8 for Process Discipline, 3.4 of 6 for Deliverables — adding to 67.9 out of 100. “Data coverage” is the share of each subject’s weight that was backed by real measurements; where it is below 100%, the missing components renormalized away rather than counting as zero.

Code Quality max 18

18 × (0.35·lint + 0.25·coverage + 0.20·complexity + 0.20·MI)

Measures the intrinsic quality of the code the run produced — independent of whether the tests happen to pass. It is the largest subject (tied with Test Quality) because durable, readable, low-defect code is the primary deliverable of an SDLC run.

Up to 18 points. Clean lint, high coverage, low complexity and a high maintainability index push this toward 18; messy or unmaintainable code pulls it toward 0.

Component Weight What it measures Normalized to 0–1
Lint cleanliness 35% Number of ruff violations when the produced source is linted against one uniform ruleset (the same standard_ruff.toml for every run, so the bar is identical across models). 0 errors → 1.0; 1–2 → 0.6; 3–5 → 0.2; 6–10 → 0.1; 11+ → 0.0. If the linter itself failed to launch, the component is treated as unavailable.
Test coverage 25% Line coverage percentage reported by the run’s own pytest run over the code it wrote. coverage% ÷ 100, clamped to 0–1 (90% → 0.90).
Cyclomatic complexity 20% Average cyclomatic complexity of the produced source (radon) — how branchy the average function is. Lower is better. Banded: avg < 2 → 1.0; < 3 → 0.66; < 5 → 0.33; < 8 → 0.16; otherwise 0.0.
Maintainability index 20% Radon maintainability index (0–100) of the produced source — a blend of volume, complexity and lines of code. Higher is better. MI ÷ 100, clamped to 0–1.

Test Quality max 18

18 × (0.30·structure + 0.25·assertion_density + 0.20·test_density + 0.15·coverage_floor + 0.10·pass_rate)

Measures how good the run’s test suite actually is — not merely that it is green. Favur enforces an all-green suite and a ≥90% coverage floor, so raw pass-rate and coverage saturate near the top for every conformant run and cannot discriminate them. Test Quality therefore leans on structural signals parsed from the test code itself — how deeply each test asserts, how dense the suite is relative to the source, and how sophisticated its organisation is — and keeps coverage and pass-rate only as small, re-centred terms.

Up to 18 points. A large, deeply-asserting, well-structured suite with fixtures, parametrization, markers and integration tests approaches 18; a thin suite of a few shallow tests scores well below half even when it is fully green and over 90% covered.

Component Weight What it measures Normalized to 0–1
Structural sophistication 30% Six good-practice signals extracted from the test code by AST parsing: reusable @pytest.fixture definitions, @pytest.mark.parametrize cases, organised grouping (test classes or many modules), a shared conftest.py, use of pytest markers, and dedicated integration/e2e test files. Mean of the six axes (each 0–1): fixtures and parametrize ramp to 1.0 at 3+, organisation is 1.0 with ≥2 classes or ≥4 files (0.5 with ≥1 class or ≥2 files), and conftest / markers / integration are each 0 or 1.
Assertion depth 25% Average number of assertions per test — bare assert statements plus pytest.raises/warns context managers — counted across the suite. Deeper assertion per test means each test checks behaviour more thoroughly rather than smoke-testing that code runs. Banded on assertions-per-test: <1 → 0.2, <1.5 → 0.5, <2 → 0.7, <3 → 0.9, 3+ → 1.0. A suite with no tests scores 0.
Test density 20% Number of test functions per 100 lines of source code — how broadly the suite exercises the codebase, independent of line coverage. Banded on tests-per-100-LOC: <4 → 0.3, <7 → 0.5, <10 → 0.7, <14 → 0.85, 14+ → 1.0. Zero tests scores 0.
Coverage (90% floor) 15% The same line-coverage percentage used in Code Quality, but re-centred on Favur’s 90% floor so it rewards going beyond the requirement instead of saturating at it. 0.5 + (coverage% − 90) / 20, clamped to 0–1: 90% is neutral (0.5), 100% earns the full 1.0 bonus, and 80% or below is a 0.0 penalty.
Test pass rate 10% Fraction of the run’s own pytest tests that pass, executed headlessly in a faithful sandbox copy of the run’s output. Already 0–1; used directly (clamped). Low weight — a sanity floor so a broken suite still hurts, without dominating the subject.

Cost Efficiency max 14

14 × (0.6·exp(−cost / 45) + 0.4·cache_savings_ratio)

Rewards delivering the SoW economically. It deliberately has no time term (wall-clock is dominated by external API latency, not the model’s skill) — it blends low absolute dollar cost with how effectively caching paid off: the net amount saved by cache reads at the provider’s discounted rate, less any cache-write premium, as a share of the no-cache bill.

Up to 14 points. The exponential decay means a very cheap run earns almost the full cost term, and a run whose caching meaningfully cut its bill earns the savings term on top — rewarding both spending little and caching well.

Component Weight What it measures Normalized to 0–1
Run cost 60% Total cost of the run in USD, summed from the per-request token/cost aggregates in the run’s meta. (The site never renders the raw dollar figure — only the resulting score.) Asymptotic exponential decay exp(−cost / 45): cost 0 → 1.0, and the score halves roughly every ~31 cost units, approaching but never reaching 0.
Effective caching 40% Net dollars saved by caching as a share of the no-cache bill: cache reads priced at the provider’s discounted read rate save (input − read) per token, and any cache-write premium (write − input, e.g. on models that charge to populate the cache) is netted back out. Deepseek has no write charge, so its writes cost nothing. cache_savings_usd ÷ nominal (no-cache) cost, clamped to 0–1 — the fraction of the hypothetical un-cached bill that caching avoided.

The caching term needs per-model pricing + cache-token telemetry; a run without it drops the term and the cost term renormalizes to the full subject weight.

Tool Discipline max 12

12 × (0.25·tool_err + 0.20·halluc + 0.20·retry + 0.20·first_try + 0.15·unauth)

Measures how cleanly the agents used their tools — the behavioral hygiene of the run. Frequent tool errors, hallucinated/invalid calls, retries and unauthorized-tool attempts all signal a model fighting its harness.

Up to 12 points. Clean, first-try, in-bounds tool use approaches 12; noisy or out-of-bounds tool use pulls it down.

Component Weight What it measures Normalized to 0–1
Tool error rate 25% Share of tool calls that failed, including unauthorized-tool attempts ((failed + unauthorized) ÷ total tool calls). Inverted: 1 − rate (fewer errors → higher score).
Hallucination rate 20% Share of tool calls that were schema-invalid (ToolValidationError — the model invoked an unknown tool or malformed arguments). Distinct from unauthorized attempts. Inverted: 1 − rate.
Retry burden 20% Total tool-call retries ÷ total tool calls. Inverted: 1 − rate.
First-try hit rate 20% Share of tool calls that succeeded on the first attempt (success and zero retries) ÷ total tool calls. Already 0–1; used directly (higher → better).
Unauthorized-tool attempts 15% Count of attempts to call a tool the agent was not permitted to use, detected from error events. 1 − 0.2 × attempts, floored at 0 (0 attempts → 1.0).

These are telemetry-derived. Today Favur emits tool-call events without an agent/role tag, so per-agent attribution is unavailable and these metrics are computed run-wide; when a metric has no data it renormalizes away. This is the one documented Favur-app data gap.

Workflow & Reporting max 12

12 × (0.50·workflow_adherence + 0.50·report_quality)

Measures whether the agents followed Favur’s workflow contract and reported their work well — opening and closing work cycles correctly, and producing structured completion reports rather than terse or empty ones.

Up to 12 points, split evenly between following the process and documenting it.

Component Weight What it measures Normalized to 0–1
Workflow adherence 50% Per agent, whether it both opened a work cycle (work_begin) and closed it with a terminal tool (work_end / strategy_done / completion_report), averaged across agents. Already 0–1 (mean of per-agent adherence); used directly.
Report quality 50% Structure and substance of each agent’s best terminal report — presence and length of result/summary text plus citations, decisions and cautions — averaged across agents. Already 0–1 (structured-report score); used directly.

Effort Efficiency max 12

12 × (0.30·turns + 0.30·token_econ + 0.20·tool_breadth + 0.20·ctx_window)

Measures economy of effort per unit of output — reaching completion in few turns, with lean token spend per request, a healthy breadth of tools, and without crowding the context window.

Up to 12 points. Efficient, focused runs score high; runs that thrash, over-spend tokens, or rely on a single tool score low.

Component Weight What it measures Normalized to 0–1
Turns to completion 30% Total agent sessions the run took (floored at 1). Fewer is better. Band: 1 session → 1.0, 15 sessions → 0.0 (linear).
Token economy 30% Average tokens per request (total tokens ÷ requests). Fewer is better. Band: 6,000 tok/req → 1.0, 50,000 → 0.0 (linear).
Tool breadth 20% Count of distinct tools the run used. Broader use signals fluency with the harness rather than hammering one tool. Band: 12 distinct tools → 1.0, 1 → 0.0 (linear).
Context-window usage 20% Largest single-request prompt size as a share of the model’s context window (default 200k tokens). Lower means more headroom. Band: 25% of window → 1.0, 90% → 0.0 (linear).

These are scale metrics with no natural 0–1 form, so each maps through a manifest normalization band (a “good” value scores 1.0, a “poor” value 0.0, linear between).

Process Discipline max 8

8 × (0.50·command_success_rate + 0.25·verification_density + 0.25·structured_workflow)

Measures how rigorously the run drove its own engineering workflow, read from the telemetry event stream: whether the commands it ran actually succeeded, how densely it verified its own work, and whether it used the structured sprint/phase/delegation process rather than running flat.

Up to 8 points. A run with a high command-success rate, dense verification and a full sprint/phase/delegation structure approaches 8; a run that thrashes failing commands and never verifies or structures its work scores low.

Component Weight What it measures Normalized to 0–1
Command success rate 50% Share of shell commands the run executed that exited successfully (command successes ÷ commands run), from the telemetry command events. A genuinely discriminating signal — baseline runs sit at 34–41%. Already 0–1; used directly (clamped).
Verification density 25% Verification events per 100 commands — how often the run checked its own work relative to how much it did. Band: 20 verifications / 100 commands → 1.0, 0 → 0.0 (linear).
Structured workflow 25% Whether the run used the structured workflow at all — credited for opening at least one sprint, at least one phase, and at least one delegation (one third each). Mean of the three 0/1 signals.

Telemetry-derived. A run with no `.favur/telemetry` event stream drops every component and the subject renormalizes away rather than scoring 0.

Deliverables max 6

6 × (0.40·completeness + 0.30·output_volume + 0.30·cost_per_loc)

Measures the shipped output recorded in the run’s deliverables manifest — whether it produced a complete set of artifacts, how much code it shipped, and how economically (cost per line of code).

Up to 6 points. A run that ships source, tests and docs in healthy volume at a low cost-per-LOC approaches 6; a thin or expensive-per-LOC run scores low.

Component Weight What it measures Normalized to 0–1
Completeness 40% Whether the run shipped each of the three artifact kinds — source, tests and docs — counted as the fraction of the three present. Fraction of {source, test, doc} kinds present (0, ⅓, ⅔, 1).
Output volume 30% Shipped code lines of code — source + test LOC from the manifest, de-duplicated by content hash (docs are excluded so they cannot dominate the volume). Band: 8,000 code LOC → 1.0, 200 → 0.0 (linear).
Cost per LOC 30% Run cost divided by shipped code LOC — how expensive each line of delivered code was. Lower is better. Band: 0.001 / LOC → 1.0, 0.02 / LOC → 0.0 (linear).

A run without a `meta.deliverables` manifest drops every component and the subject renormalizes away. The cost-per-LOC term also drops when cost or code LOC is unavailable.