FAVUR EVALS

Multi-agent SDLC, evaluated.

We run the same statement of work through different LLM configurations, observing how each behaves inside Favur's agentic process. 13 runs so far, more coming.

favur.dev

scroll

FILTER

SoW

Family

Agentⓘ

Single-model only

🏆 TOP RUN

Highest composite across visible runs

Mimo Pro v2.5Jun 2026all-Xiaomi

all-Xiaomi

Highest cache hit rate at 85.2%, driving cost efficiency.

1,060 reqs · 3h 13m · 0.0% failure rate

✨ SPECIALIZED LEADERS

Deepseek Flash v459.5

59.5 composite · 11.6× the cohort-median value

Avg across 2 runs

Glm 5.261.5

22.9 / 36 across Test + Code Quality

Claude Sonnet 4.646.5

11.9 / 12 Tool Discipline

Gemini Flash 3 (preview)60.1

6.9 / 12 Workflow & Reporting

Mimo Pro v2.567.9

0.0% failure rate across 1,060 requests

ALL RUNS

Ranked by composite score within circles · 9 runs visible

SoW

Sort

Run leaderboard, sorted by composite
Run	Composite	Volume	Configuration
Mimo Pro v2.5	67.9 out of 100	1,060 reqs · 3h 13m	all-Xiaomi
Glm 5.2	61.5 out of 100	1,251 reqs · 3h 47m	all-Z.ai
Deepseek Flash v4	60.3 out of 100	701 reqs · 1h 30m	all-DeepSeek
Gemini Flash 3 (preview)	60.1 out of 100	809 reqs · 1h 5m	all-Google
Deepseek Flash v4	58.7 out of 100	917 reqs · 1h 36m	all-DeepSeek
Deepseek Pro v4	58.2 out of 100	1,041 reqs · 2h 35m	all-DeepSeek
Kimi k2.6	57.0 out of 100	1,091 reqs · 1h 43m	all-Moonshot AI
Gemini Pro 3.1 (preview)	55.6 out of 100	505 reqs · 1h 11m	all-Google
Claude Sonnet 4.6	46.5 out of 100	1,260 reqs · 4h 14m	all-Anthropic

SUBJECT BREAKDOWN

How each run earns its composite — subject by subject

View:

Highlight:

Subject breakdown — accessible mirror
Run	Code /18	Test /18	Cost /14	Tools /12	Workflow /12	Effort /12	Process /8	Deliver /6	Composite
Mimo Pro v2.5	11.59 / 18 (64%)	9.45 / 18 (53%)	12.349 / 14 (88%)	11.7 / 12 (98%)	6.12 / 12 (51%)	8.015 / 12 (67%)	5.314 / 8 (66%)	3.362 / 6 (56%)	67.9
Deepseek Flash v4	8.017 / 18 (45%)	10.549 / 18 (59%)	8.875 / 14 (63%)	9.989 / 12 (83%)	6.587 / 12 (55%)	7.123 / 12 (59%)	4.565 / 8 (57%)	4.529 / 6 (75%)	60.2
Z.ai Glm 5.2	13.77 / 18 (77%)	9.12 / 18 (51%)	6.581 / 14 (47%)	11.785 / 12 (98%)	6 / 12 (50%)	6.903 / 12 (58%)	5.549 / 8 (69%)	1.785 / 6 (30%)	61.5
Gemini Flash 3 (preview)	7.898 / 18 (44%)	10.11 / 18 (56%)	5.756 / 14 (41%)	10.303 / 12 (86%)	7.327 / 12 (61%)	5.768 / 12 (48%)	4.427 / 8 (55%)	3.721 / 6 (62%)	55.3
Deepseek Pro v4	7.247 / 18 (40%)	8.76 / 18 (49%)	9.541 / 14 (68%)	11.803 / 12 (98%)	6 / 12 (50%)	7.817 / 12 (65%)	5.192 / 8 (65%)	1.802 / 6 (30%)	58.2
AI Kimi k2.6	10.631 / 18 (59%)	9.101 / 18 (51%)	7.044 / 14 (50%)	9.878 / 12 (82%)	6.218 / 12 (52%)	6.961 / 12 (58%)	5.239 / 8 (65%)	1.976 / 6 (33%)	57.0
Gemini Pro 3.1 (preview)	7.667 / 18 (43%)	8.1 / 18 (45%)	7.702 / 14 (55%)	11.251 / 12 (94%)	6.491 / 12 (54%)	7.125 / 12 (59%)	5.6 / 8 (70%)	1.712 / 6 (29%)	55.6
Claude Sonnet 4.6	7.84 / 18 (44%)	6.93 / 18 (39%)	1.292 / 14 (9%)	11.855 / 12 (99%)	6 / 12 (50%)	5.798 / 12 (48%)	5.027 / 8 (63%)	1.772 / 6 (30%)	46.5

WHERE THE MONEY GOES

Across all visible runs — which agents and models drive the cost

Sort:

Across the visible runs, spend concentrates in a handful of agents: Code (~26%), Code Review (~18%), Develop (~17%). The Code agent is the single largest cost driver in the current data.

Every Statement of Work runs the full Favur SDLC regardless of size — phase docs, sprint scoping, cycle reviews, code-review passes, document versioning — so the agents carrying those protocols dominate the bill, whatever the project.

See for yourself:

Cost share aggregated across visible runs by agent and contributing model
Agent	Total %	Top contributing model	Top contribution %
Code	25.893%	DeepSeek Deepseek Flash v4	12.675%
Code Review	18.298%	DeepSeek Deepseek Pro v4	7.999%
Develop	16.502%	DeepSeek Deepseek Flash v4	8.553%
Orchestrator	15.858%	DeepSeek Deepseek Flash v4	5.848%
Sprint Review	7.001%	DeepSeek Deepseek Flash v4	3.572%
Pseudocode	3.87%	DeepSeek Deepseek Flash v4	1.774%
Sprint Plan	2.808%	DeepSeek Deepseek Flash v4	1.075%
Scout	2.52%	DeepSeek Deepseek Flash v4	1.702%
Test	1.763%	Google Gemini Pro 3.1 (preview)	0.746%
Build	1.543%	DeepSeek Deepseek Flash v4	0.708%

BEHAVIOR FINGERPRINTS

How models behave when handling each agent role

View:

For agent:

Six axes per agent, each on an absolute scale: cache utilization, cost efficiency, throughput, responsiveness, tool intensity, and reasoning depth.

Behavior fingerprint — radar — code
Model	Cache Util	Cost Eff	Throughput	Speed	Tool Use	Reasoning
Anthropic Claude Sonnet 4.6	0.57	0.00	0.14	0.82	0.49	0.16
DeepSeek Deepseek Flash v4	0.72	0.83	1.00	0.77	0.59	0.31
DeepSeek Deepseek Pro v4	0.79	0.23	0.18	0.77	0.57	0.29
Google Gemini Flash 3 (preview)	0.41	0.33	0.39	0.95	0.53	0.04
Google Gemini Pro 3.1 (preview)	0.66	0.01	0.00	0.82	0.52	0.30
Moonshot AI Kimi k2.6	0.54	0.17	0.47	0.95	0.62	0.39
Xiaomi Mimo Pro v2.5	0.88	0.80	0.11	0.74	0.60	0.33
Z.ai Glm 5.2	0.70	0.09	0.22	0.58	0.59	0.31

QUICK STATS

Behavioral patterns across the cohort

CACHE HITS

REASONING DEPTH

TOOLS / RESPONSE

FAILURE TYPES

Cache hit rate by run
Run	Cache Hit %
Xiaomi Mimo Pro v2.5	85.2%
DeepSeek Deepseek Pro v4	82.1%
DeepSeek Deepseek Flash v4	81.7%
DeepSeek Deepseek Flash v4	77.8%
DeepSeek Deepseek Flash v4	71.8%
DeepSeek Deepseek Flash v4	70.6%
Z.ai Glm 5.2	69.3%
DeepSeek Deepseek Flash v4	67.9%
Google Gemini Pro 3.1 (preview)	54.1%
Moonshot AI Kimi k2.6	50.2%
Google Gemini Flash 3 (preview)	50.0%
Anthropic Claude Sonnet 4.6	46.1%
Google Gemini Flash 3 (preview)	35.4%

Reasoning depth by run
Run	Reasoning Tokens
Google Gemini Flash 3 (preview)	2,623,402
DeepSeek Deepseek Flash v4	2,185,380
DeepSeek Deepseek Flash v4	1,716,085
DeepSeek Deepseek Flash v4	1,014,871
Z.ai Glm 5.2	265,221
Moonshot AI Kimi k2.6	251,819
Xiaomi Mimo Pro v2.5	201,003
DeepSeek Deepseek Pro v4	191,434
DeepSeek Deepseek Flash v4	151,585
Google Gemini Pro 3.1 (preview)	124,400
DeepSeek Deepseek Flash v4	104,506
Anthropic Claude Sonnet 4.6	94,040
Google Gemini Flash 3 (preview)	83,406

Tools per response by run
Run	Tools / Response
Moonshot AI Kimi k2.6	1.24
Xiaomi Mimo Pro v2.5	1.20
DeepSeek Deepseek Flash v4	1.20
DeepSeek Deepseek Flash v4	1.18
Z.ai Glm 5.2	1.18
DeepSeek Deepseek Flash v4	1.16
DeepSeek Deepseek Flash v4	1.16
DeepSeek Deepseek Flash v4	1.15
DeepSeek Deepseek Pro v4	1.13
Google Gemini Flash 3 (preview)	1.07
Google Gemini Flash 3 (preview)	1.05
Google Gemini Pro 3.1 (preview)	1.04
Anthropic Claude Sonnet 4.6	0.98

Failure types by run
Run	Total %	HTTP	No-response	Empty
DeepSeek Deepseek Flash v4	1.2%	1.2%	0.0%	0.0%
Google Gemini Pro 3.1 (preview)	0.8%	0.8%	0.0%	0.0%
DeepSeek Deepseek Flash v4	0.7%	0.7%	0.0%	0.0%
DeepSeek Deepseek Flash v4	0.5%	0.5%	0.0%	0.0%
Google Gemini Flash 3 (preview)	0.4%	0.4%	0.0%	0.0%
Moonshot AI Kimi k2.6	0.4%	0.4%	0.0%	0.0%
DeepSeek Deepseek Flash v4	0.2%	0.2%	0.0%	0.0%
DeepSeek Deepseek Flash v4	0.1%	0.1%	0.0%	0.0%
Google Gemini Flash 3 (preview)	0.0%	0.0%	0.0%	0.0%
Xiaomi Mimo Pro v2.5	0.0%	0.0%	0.0%	0.0%
Z.ai Glm 5.2	0.0%	0.0%	0.0%	0.0%
DeepSeek Deepseek Pro v4	0.0%	0.0%	0.0%	0.0%
Anthropic Claude Sonnet 4.6	0.0%	0.0%	0.0%	0.0%

PER-AGENT LEADERBOARDS

Which model does each Favur agent role best?

Each card shows that role’s pooled mean ± 95% confidence interval: every appearance of the role across all models and SoWs, since roles behave consistently between SoWs. The interval (a Student-t bound, honest at small samples) is how confident we are in the average — it tightens as more runs accrue. Click Score over time on any card to see the full per-run history.

Code

Writes the actual source files

TOP 3 MODELS

Cohort median: 8.4 · 8 models tested

Pooled mean 8.5 ±0.14 · 95% CI, n=13

🗎 Sample transcript ↗

Test

Authors tests and validates the suite

TOP 3 MODELS

Cohort median: 7.2 · 8 models tested

Pooled mean 7.2 ±0.18 · 95% CI, n=13

🗎 Sample transcript ↗

Build

Installs deps and runs the build pipeline

TOP 3 MODELS

Cohort median: 8.4 · 8 models tested

Pooled mean 8.4 ±0.12 · 95% CI, n=13

🗎 Sample transcript ↗

Scout

Spot-checks the developing product

TOP 3 MODELS

Cohort median: 6.9 · 7 models tested

Pooled mean 6.8 ±0.44 · 95% CI, n=12

🗎 Sample transcript ↗

Pseudocode

Plans implementation step-by-step before coding

TOP 3 MODELS

Cohort median: 7.9 · 8 models tested

Pooled mean 7.8 ±0.19 · 95% CI, n=13

🗎 Sample transcript ↗

Sprint Plan

Decomposes the SoW into sprints and tasks

TOP 3 MODELS

Cohort median: 6.7 · 8 models tested

Pooled mean 6.6 ±0.24 · 95% CI, n=13

🗎 Sample transcript ↗

Code Review

Reviews diffs against the sprint plan

TOP 3 MODELS

Cohort median: 7.8 · 8 models tested

Pooled mean 7.9 ±0.14 · 95% CI, n=13

🗎 Sample transcript ↗

Develop

Coordinates the per-sprint development cycle

TOP 3 MODELS

Cohort median: 8.3 · 8 models tested

Pooled mean 8.2 ±0.23 · 95% CI, n=13

🗎 Sample transcript ↗

code leaderboard, top models by Agent Performance Score
Rank	Model	Score
1	Moonshot AI Kimi k2.6	8.8 of 10
2	Xiaomi Mimo Pro v2.5	8.8 of 10
3	DeepSeek Deepseek Flash v4	8.5 of 10

test leaderboard, top models by Agent Performance Score
Rank	Model	Score
1	Moonshot AI Kimi k2.6	7.8 of 10
2	Xiaomi Mimo Pro v2.5	7.3 of 10
3	DeepSeek Deepseek Pro v4	7.3 of 10

build leaderboard, top models by Agent Performance Score
Rank	Model	Score
1	Anthropic Claude Sonnet 4.6	8.6 of 10
2	DeepSeek Deepseek Flash v4	8.6 of 10
3	DeepSeek Deepseek Pro v4	8.5 of 10

scout leaderboard, top models by Agent Performance Score
Rank	Model	Score
1	Xiaomi Mimo Pro v2.5	7.2 of 10
2	Z.ai Glm 5.2	7.2 of 10
3	DeepSeek Deepseek Pro v4	7.2 of 10

pseudocode leaderboard, top models by Agent Performance Score
Rank	Model	Score
1	Google Gemini Pro 3.1 (preview)	8.1 of 10
2	Xiaomi Mimo Pro v2.5	8.0 of 10
3	Moonshot AI Kimi k2.6	8.0 of 10

sprint-plan leaderboard, top models by Agent Performance Score
Rank	Model	Score
1	Xiaomi Mimo Pro v2.5	7.1 of 10
2	DeepSeek Deepseek Pro v4	7.1 of 10
3	Google Gemini Pro 3.1 (preview)	7.0 of 10

code-review leaderboard, top models by Agent Performance Score
Rank	Model	Score
1	Xiaomi Mimo Pro v2.5	8.2 of 10
2	DeepSeek Deepseek Pro v4	8.1 of 10
3	DeepSeek Deepseek Flash v4	7.9 of 10

develop leaderboard, top models by Agent Performance Score
Rank	Model	Score
1	DeepSeek Deepseek Pro v4	8.6 of 10
2	Google Gemini Pro 3.1 (preview)	8.6 of 10
3	Xiaomi Mimo Pro v2.5	8.5 of 10

PER-MODEL PROFILES

How does each vendor's model perform on cost, tokens, cache, and reliability?

ANTHROPIC

4.6

1 run · last seen 2026-06-28

: 64.8M (64.1M in / 0.7M out)
: 46.1%
: 0.00%
: 94K tokens
: 0.98 (one tool/turn)

Best at:

Weakest:

anthropic.com ↗

DEEPSEEK

5 runs · last seen 2026-06-27

: 1736.7M (1723.2M in / 13.5M out)
: 70.0%
: 0.27%
: 5172K tokens
: 1.17 (one tool/turn)

averaged across 5 runs

Best at:

Weakest:

deepseek.com ↗

GOOGLE

2 runs · last seen 2026-06-28

: 335.6M (333.6M in / 2.0M out)
: 36.9%
: 0.10%
: 2707K tokens
: 1.06 (one tool/turn)

averaged across 2 runs

Best at:

Weakest:

ai.google.dev ↗

MOONSHOT AI

k2.6

1 run · last seen 2026-06-27

: 40.0M (39.5M in / 0.5M out)
: 50.2%
: 0.37%
: 252K tokens
: 1.24 (bundles tools)

Best at:

Weakest:

XIAOMI

v2.5

1 run · last seen 2026-06-28

: 49.3M (48.7M in / 0.5M out)
: 85.2%
: 0.00%
: 201K tokens
: 1.20 (bundles tools)

Best at:

Weakest:

Z.AI

5.2

1 run · last seen 2026-06-28

: 52.6M (51.8M in / 0.8M out)
: 69.3%
: 0.00%
: 265K tokens
: 1.18 (one tool/turn)

Best at:

Weakest:

z.ai ↗

Anthropic 4.6 model profile
Metric	Value
Tokens total	64,845,271
Cache hit %	46.1%
Failure %	0.00%
Reasoning tokens	94,040
Tools per response	0.98
Best at	Build Agent (8.6)
Weakest	Orchestrator (5.9)

DeepSeek Deepseek Flash v4 model profile
Metric	Value
Tokens total	1,736,721,095
Cache hit %	70.0%
Failure %	0.27%
Reasoning tokens	5,172,427
Tools per response	1.17
Best at	Code Agent (8.8)
Weakest	Sprint-Plan Agent (6.9)

Google Gemini Flash 3 (preview) model profile
Metric	Value
Tokens total	335,631,669
Cache hit %	36.9%
Failure %	0.10%
Reasoning tokens	2,706,808
Tools per response	1.06
Best at	Code Agent (8.6)
Weakest	Orchestrator (6.6)

Moonshot AI k2.6 model profile
Metric	Value
Tokens total	40,006,693
Cache hit %	50.2%
Failure %	0.37%
Reasoning tokens	251,819
Tools per response	1.24
Best at	Code Agent (8.8)
Weakest	Scout Agent (4.8)

Xiaomi v2.5 model profile
Metric	Value
Tokens total	49,272,732
Cache hit %	85.2%
Failure %	0.00%
Reasoning tokens	201,003
Tools per response	1.20
Best at	Code Agent (8.8)
Weakest	Orchestrator (6.9)

Z.ai 5.2 model profile
Metric	Value
Tokens total	52,563,701
Cache hit %	69.3%
Failure %	0.00%
Reasoning tokens	265,221
Tools per response	1.18
Best at	Develop Agent (8.3)
Weakest	Orchestrator (6.4)

BROWSE THE CODE

Inspect what each model actually generated. Pick a Statement of Work, select a run, browse its files, and (on desktop) compare two runs side-by-side.

README.mdLoading…

Methodology

How the score is calculated

Every number on this page is reproducible. A run’s composite is just the sum of eight subject scores, each capped at a fixed weight that together add up to 100. There is no gate, curve, or pass/fail line — a composite measures how well a run engineered its Statement of Work, not whether it “passed”.

Every subject is a weighted average of normalized component metrics. Each component is measured deterministically from artifacts the run already produced — static analysis of the generated code, the run’s own pytest results, and tool/transcript telemetry — then mapped onto a 0–1 scale where 1 is best. The subject’s 0–1 score is multiplied by its weight to give its point contribution, and the eight contributions are added.

If a component has no data for a run (e.g. no telemetry was attributed, or static analysis could not run), it drops out and the remaining component weights renormalize, so a run is never penalized for data the harness did not capture. Each subject records a coverage fraction — the share of its weight backed by real data — which the tooltips surface alongside the score.

Worked example: what does a 67.9 mean?

Take Mimo Pro v2.5 on circles, which scored 67.9. That single number is exactly the sum of these eight subject contributions:

Subject	Contribution	Out of	Data coverage
Code Quality	11.6	18	100%
Test Quality	9.4	18	100%
Cost Efficiency	12.3	14	100%
Tool Discipline	11.7	12	100%
Workflow & Reporting	6.1	12	100%
Effort Efficiency	8.0	12	100%
Process Discipline	5.3	8	100%
Deliverables	3.4	6	100%
Composite	67.9	100

Read it as: this run earned 11.6 of 18 for Code Quality, 9.4 of 18 for Test Quality, 12.3 of 14 for Cost Efficiency, 11.7 of 12 for Tool Discipline, 6.1 of 12 for Workflow & Reporting, 8.0 of 12 for Effort Efficiency, 5.3 of 8 for Process Discipline, 3.4 of 6 for Deliverables — adding to 67.9 out of 100. “Data coverage” is the share of each subject’s weight that was backed by real measurements; where it is below 100%, the missing components renormalized away rather than counting as zero.

Code Quality max 18

18 × (0.35·lint + 0.25·coverage + 0.20·complexity + 0.20·MI)

Measures the intrinsic quality of the code the run produced — independent of whether the tests happen to pass. It is the largest subject (tied with Test Quality) because durable, readable, low-defect code is the primary deliverable of an SDLC run.

Up to 18 points. Clean lint, high coverage, low complexity and a high maintainability index push this toward 18; messy or unmaintainable code pulls it toward 0.

Component	Weight	What it measures	Normalized to 0–1
Lint cleanliness	35%	Number of ruff violations when the produced source is linted against one uniform ruleset (the same standard_ruff.toml for every run, so the bar is identical across models).	0 errors → 1.0; 1–2 → 0.6; 3–5 → 0.2; 6–10 → 0.1; 11+ → 0.0. If the linter itself failed to launch, the component is treated as unavailable.
Test coverage	25%	Line coverage percentage reported by the run’s own pytest run over the code it wrote.	coverage% ÷ 100, clamped to 0–1 (90% → 0.90).
Cyclomatic complexity	20%	Average cyclomatic complexity of the produced source (radon) — how branchy the average function is. Lower is better.	Banded: avg < 2 → 1.0; < 3 → 0.66; < 5 → 0.33; < 8 → 0.16; otherwise 0.0.
Maintainability index	20%	Radon maintainability index (0–100) of the produced source — a blend of volume, complexity and lines of code. Higher is better.	MI ÷ 100, clamped to 0–1.

Test Quality max 18

18 × (0.30·structure + 0.25·assertion_density + 0.20·test_density + 0.15·coverage_floor + 0.10·pass_rate)

Measures how good the run’s test suite actually is — not merely that it is green. Favur enforces an all-green suite and a ≥90% coverage floor, so raw pass-rate and coverage saturate near the top for every conformant run and cannot discriminate them. Test Quality therefore leans on structural signals parsed from the test code itself — how deeply each test asserts, how dense the suite is relative to the source, and how sophisticated its organisation is — and keeps coverage and pass-rate only as small, re-centred terms.

Up to 18 points. A large, deeply-asserting, well-structured suite with fixtures, parametrization, markers and integration tests approaches 18; a thin suite of a few shallow tests scores well below half even when it is fully green and over 90% covered.

Component	Weight	What it measures	Normalized to 0–1
Structural sophistication	30%	Six good-practice signals extracted from the test code by AST parsing: reusable @pytest.fixture definitions, @pytest.mark.parametrize cases, organised grouping (test classes or many modules), a shared conftest.py, use of pytest markers, and dedicated integration/e2e test files.	Mean of the six axes (each 0–1): fixtures and parametrize ramp to 1.0 at 3+, organisation is 1.0 with ≥2 classes or ≥4 files (0.5 with ≥1 class or ≥2 files), and conftest / markers / integration are each 0 or 1.
Assertion depth	25%	Average number of assertions per test — bare assert statements plus pytest.raises/warns context managers — counted across the suite. Deeper assertion per test means each test checks behaviour more thoroughly rather than smoke-testing that code runs.	Banded on assertions-per-test: <1 → 0.2, <1.5 → 0.5, <2 → 0.7, <3 → 0.9, 3+ → 1.0. A suite with no tests scores 0.
Test density	20%	Number of test functions per 100 lines of source code — how broadly the suite exercises the codebase, independent of line coverage.	Banded on tests-per-100-LOC: <4 → 0.3, <7 → 0.5, <10 → 0.7, <14 → 0.85, 14+ → 1.0. Zero tests scores 0.
Coverage (90% floor)	15%	The same line-coverage percentage used in Code Quality, but re-centred on Favur’s 90% floor so it rewards going beyond the requirement instead of saturating at it.	0.5 + (coverage% − 90) / 20, clamped to 0–1: 90% is neutral (0.5), 100% earns the full 1.0 bonus, and 80% or below is a 0.0 penalty.
Test pass rate	10%	Fraction of the run’s own pytest tests that pass, executed headlessly in a faithful sandbox copy of the run’s output.	Already 0–1; used directly (clamped). Low weight — a sanity floor so a broken suite still hurts, without dominating the subject.

Cost Efficiency max 14

14 × (0.6·exp(−cost / 45) + 0.4·cache_savings_ratio)

Rewards delivering the SoW economically. It deliberately has no time term (wall-clock is dominated by external API latency, not the model’s skill) — it blends low absolute dollar cost with how effectively caching paid off: the net amount saved by cache reads at the provider’s discounted rate, less any cache-write premium, as a share of the no-cache bill.

Up to 14 points. The exponential decay means a very cheap run earns almost the full cost term, and a run whose caching meaningfully cut its bill earns the savings term on top — rewarding both spending little and caching well.

Component	Weight	What it measures	Normalized to 0–1
Run cost	60%	Total cost of the run in USD, summed from the per-request token/cost aggregates in the run’s meta. (The site never renders the raw dollar figure — only the resulting score.)	Asymptotic exponential decay exp(−cost / 45): cost 0 → 1.0, and the score halves roughly every ~31 cost units, approaching but never reaching 0.
Effective caching	40%	Net dollars saved by caching as a share of the no-cache bill: cache reads priced at the provider’s discounted read rate save (input − read) per token, and any cache-write premium (write − input, e.g. on models that charge to populate the cache) is netted back out. Deepseek has no write charge, so its writes cost nothing.	cache_savings_usd ÷ nominal (no-cache) cost, clamped to 0–1 — the fraction of the hypothetical un-cached bill that caching avoided.

The caching term needs per-model pricing + cache-token telemetry; a run without it drops the term and the cost term renormalizes to the full subject weight.

Tool Discipline max 12

12 × (0.25·tool_err + 0.20·halluc + 0.20·retry + 0.20·first_try + 0.15·unauth)

Measures how cleanly the agents used their tools — the behavioral hygiene of the run. Frequent tool errors, hallucinated/invalid calls, retries and unauthorized-tool attempts all signal a model fighting its harness.

Up to 12 points. Clean, first-try, in-bounds tool use approaches 12; noisy or out-of-bounds tool use pulls it down.

Component	Weight	What it measures	Normalized to 0–1
Tool error rate	25%	Share of tool calls that failed, including unauthorized-tool attempts ((failed + unauthorized) ÷ total tool calls).	Inverted: 1 − rate (fewer errors → higher score).
Hallucination rate	20%	Share of tool calls that were schema-invalid (ToolValidationError — the model invoked an unknown tool or malformed arguments). Distinct from unauthorized attempts.	Inverted: 1 − rate.
Retry burden	20%	Total tool-call retries ÷ total tool calls.	Inverted: 1 − rate.
First-try hit rate	20%	Share of tool calls that succeeded on the first attempt (success and zero retries) ÷ total tool calls.	Already 0–1; used directly (higher → better).
Unauthorized-tool attempts	15%	Count of attempts to call a tool the agent was not permitted to use, detected from error events.	1 − 0.2 × attempts, floored at 0 (0 attempts → 1.0).

These are telemetry-derived. Today Favur emits tool-call events without an agent/role tag, so per-agent attribution is unavailable and these metrics are computed run-wide; when a metric has no data it renormalizes away. This is the one documented Favur-app data gap.

Workflow & Reporting max 12

12 × (0.50·workflow_adherence + 0.50·report_quality)

Measures whether the agents followed Favur’s workflow contract and reported their work well — opening and closing work cycles correctly, and producing structured completion reports rather than terse or empty ones.

Up to 12 points, split evenly between following the process and documenting it.

Component	Weight	What it measures	Normalized to 0–1
Workflow adherence	50%	Per agent, whether it both opened a work cycle (work_begin) and closed it with a terminal tool (work_end / strategy_done / completion_report), averaged across agents.	Already 0–1 (mean of per-agent adherence); used directly.
Report quality	50%	Structure and substance of each agent’s best terminal report — presence and length of result/summary text plus citations, decisions and cautions — averaged across agents.	Already 0–1 (structured-report score); used directly.

Effort Efficiency max 12

12 × (0.30·turns + 0.30·token_econ + 0.20·tool_breadth + 0.20·ctx_window)

Measures economy of effort per unit of output — reaching completion in few turns, with lean token spend per request, a healthy breadth of tools, and without crowding the context window.

Up to 12 points. Efficient, focused runs score high; runs that thrash, over-spend tokens, or rely on a single tool score low.

Component	Weight	What it measures	Normalized to 0–1
Turns to completion	30%	Total agent sessions the run took (floored at 1). Fewer is better.	Band: 1 session → 1.0, 15 sessions → 0.0 (linear).
Token economy	30%	Average tokens per request (total tokens ÷ requests). Fewer is better.	Band: 6,000 tok/req → 1.0, 50,000 → 0.0 (linear).
Tool breadth	20%	Count of distinct tools the run used. Broader use signals fluency with the harness rather than hammering one tool.	Band: 12 distinct tools → 1.0, 1 → 0.0 (linear).
Context-window usage	20%	Largest single-request prompt size as a share of the model’s context window (default 200k tokens). Lower means more headroom.	Band: 25% of window → 1.0, 90% → 0.0 (linear).

These are scale metrics with no natural 0–1 form, so each maps through a manifest normalization band (a “good” value scores 1.0, a “poor” value 0.0, linear between).

Process Discipline max 8

8 × (0.50·command_success_rate + 0.25·verification_density + 0.25·structured_workflow)

Measures how rigorously the run drove its own engineering workflow, read from the telemetry event stream: whether the commands it ran actually succeeded, how densely it verified its own work, and whether it used the structured sprint/phase/delegation process rather than running flat.

Up to 8 points. A run with a high command-success rate, dense verification and a full sprint/phase/delegation structure approaches 8; a run that thrashes failing commands and never verifies or structures its work scores low.

Component	Weight	What it measures	Normalized to 0–1
Command success rate	50%	Share of shell commands the run executed that exited successfully (command successes ÷ commands run), from the telemetry command events. A genuinely discriminating signal — baseline runs sit at 34–41%.	Already 0–1; used directly (clamped).
Verification density	25%	Verification events per 100 commands — how often the run checked its own work relative to how much it did.	Band: 20 verifications / 100 commands → 1.0, 0 → 0.0 (linear).
Structured workflow	25%	Whether the run used the structured workflow at all — credited for opening at least one sprint, at least one phase, and at least one delegation (one third each).	Mean of the three 0/1 signals.

Telemetry-derived. A run with no `.favur/telemetry` event stream drops every component and the subject renormalizes away rather than scoring 0.

Deliverables max 6

6 × (0.40·completeness + 0.30·output_volume + 0.30·cost_per_loc)

Measures the shipped output recorded in the run’s deliverables manifest — whether it produced a complete set of artifacts, how much code it shipped, and how economically (cost per line of code).

Up to 6 points. A run that ships source, tests and docs in healthy volume at a low cost-per-LOC approaches 6; a thin or expensive-per-LOC run scores low.

Component	Weight	What it measures	Normalized to 0–1
Completeness	40%	Whether the run shipped each of the three artifact kinds — source, tests and docs — counted as the fraction of the three present.	Fraction of {source, test, doc} kinds present (0, ⅓, ⅔, 1).
Output volume	30%	Shipped code lines of code — source + test LOC from the manifest, de-duplicated by content hash (docs are excluded so they cannot dominate the volume).	Band: 8,000 code LOC → 1.0, 200 → 0.0 (linear).
Cost per LOC	30%	Run cost divided by shipped code LOC — how expensive each line of delivered code was. Lower is better.	Band: 0.001 / LOC → 1.0, 0.02 / LOC → 0.0 (linear).

A run without a `meta.deliverables` manifest drops every component and the subject renormalizes away. The cost-per-LOC term also drops when cost or code LOC is unavailable.