all-Xiaomi
Highest cache hit rate at 85.2%, driving cost efficiency.
1,060 reqs · 3h 13m · 0.0% failure rate
We run the same statement of work through different LLM configurations, observing how each behaves inside Favur's agentic process. 13 runs so far, more coming.
favur.devHighest composite across visible runs
all-Xiaomi
Highest cache hit rate at 85.2%, driving cost efficiency.
1,060 reqs · 3h 13m · 0.0% failure rate
22.9 / 36 across Test + Code Quality
11.9 / 12 Tool Discipline
6.9 / 12 Workflow & Reporting
0.0% failure rate across 1,060 requests
Ranked by composite score
| Run | Composite | Volume | Configuration |
|---|---|---|---|
| Mimo Pro v2.5 | 67.9 out of 100 | 1,060 reqs · 3h 13m | all-Xiaomi |
| Glm 5.2 | 61.5 out of 100 | 1,251 reqs · 3h 47m | all-Z.ai |
| Deepseek Flash v4 | 60.3 out of 100 | 701 reqs · 1h 30m | all-DeepSeek |
| Gemini Flash 3 (preview) | 60.1 out of 100 | 809 reqs · 1h 5m | all-Google |
| Deepseek Flash v4 | 58.7 out of 100 | 917 reqs · 1h 36m | all-DeepSeek |
| Deepseek Pro v4 | 58.2 out of 100 | 1,041 reqs · 2h 35m | all-DeepSeek |
| Kimi k2.6 | 57.0 out of 100 | 1,091 reqs · 1h 43m | all-Moonshot AI |
| Gemini Pro 3.1 (preview) | 55.6 out of 100 | 505 reqs · 1h 11m | all-Google |
| Claude Sonnet 4.6 | 46.5 out of 100 | 1,260 reqs · 4h 14m | all-Anthropic |
How each run earns its composite — subject by subject
| Run | Code /18 | Test /18 | Cost /14 | Tools /12 | Workflow /12 | Effort /12 | Process /8 | Deliver /6 | Composite |
|---|---|---|---|---|---|---|---|---|---|
| Mimo Pro v2.5 | 11.59 / 18 (64%) | 9.45 / 18 (53%) | 12.349 / 14 (88%) | 11.7 / 12 (98%) | 6.12 / 12 (51%) | 8.015 / 12 (67%) | 5.314 / 8 (66%) | 3.362 / 6 (56%) | 67.9 |
| Deepseek Flash v4 | 8.017 / 18 (45%) | 10.549 / 18 (59%) | 8.875 / 14 (63%) | 9.989 / 12 (83%) | 6.587 / 12 (55%) | 7.123 / 12 (59%) | 4.565 / 8 (57%) | 4.529 / 6 (75%) | 60.2 |
| Z.ai Glm 5.2 | 13.77 / 18 (77%) | 9.12 / 18 (51%) | 6.581 / 14 (47%) | 11.785 / 12 (98%) | 6 / 12 (50%) | 6.903 / 12 (58%) | 5.549 / 8 (69%) | 1.785 / 6 (30%) | 61.5 |
| Gemini Flash 3 (preview) | 7.898 / 18 (44%) | 10.11 / 18 (56%) | 5.756 / 14 (41%) | 10.303 / 12 (86%) | 7.327 / 12 (61%) | 5.768 / 12 (48%) | 4.427 / 8 (55%) | 3.721 / 6 (62%) | 55.3 |
| Deepseek Pro v4 | 7.247 / 18 (40%) | 8.76 / 18 (49%) | 9.541 / 14 (68%) | 11.803 / 12 (98%) | 6 / 12 (50%) | 7.817 / 12 (65%) | 5.192 / 8 (65%) | 1.802 / 6 (30%) | 58.2 |
| AI Kimi k2.6 | 10.631 / 18 (59%) | 9.101 / 18 (51%) | 7.044 / 14 (50%) | 9.878 / 12 (82%) | 6.218 / 12 (52%) | 6.961 / 12 (58%) | 5.239 / 8 (65%) | 1.976 / 6 (33%) | 57.0 |
| Gemini Pro 3.1 (preview) | 7.667 / 18 (43%) | 8.1 / 18 (45%) | 7.702 / 14 (55%) | 11.251 / 12 (94%) | 6.491 / 12 (54%) | 7.125 / 12 (59%) | 5.6 / 8 (70%) | 1.712 / 6 (29%) | 55.6 |
| Claude Sonnet 4.6 | 7.84 / 18 (44%) | 6.93 / 18 (39%) | 1.292 / 14 (9%) | 11.855 / 12 (99%) | 6 / 12 (50%) | 5.798 / 12 (48%) | 5.027 / 8 (63%) | 1.772 / 6 (30%) | 46.5 |
Across all visible runs — which agents and models drive the cost
Across the visible runs, spend concentrates in a handful of agents: Code (~26%), Code Review (~18%), Develop (~17%). The Code agent is the single largest cost driver in the current data.
Every Statement of Work runs the full Favur SDLC regardless of size — phase docs, sprint scoping, cycle reviews, code-review passes, document versioning — so the agents carrying those protocols dominate the bill, whatever the project.
| Agent | Total % | Top contributing model | Top contribution % |
|---|---|---|---|
| Code | 25.893% | DeepSeek Deepseek Flash v4 | 12.675% |
| Code Review | 18.298% | DeepSeek Deepseek Pro v4 | 7.999% |
| Develop | 16.502% | DeepSeek Deepseek Flash v4 | 8.553% |
| Orchestrator | 15.858% | DeepSeek Deepseek Flash v4 | 5.848% |
| Sprint Review | 7.001% | DeepSeek Deepseek Flash v4 | 3.572% |
| Pseudocode | 3.87% | DeepSeek Deepseek Flash v4 | 1.774% |
| Sprint Plan | 2.808% | DeepSeek Deepseek Flash v4 | 1.075% |
| Scout | 2.52% | DeepSeek Deepseek Flash v4 | 1.702% |
| Test | 1.763% | Google Gemini Pro 3.1 (preview) | 0.746% |
| Build | 1.543% | DeepSeek Deepseek Flash v4 | 0.708% |
How models behave when handling each agent role
Six axes per agent, each on an absolute scale: cache utilization, cost efficiency, throughput, responsiveness, tool intensity, and reasoning depth.
| Model | Cache Util | Cost Eff | Throughput | Speed | Tool Use | Reasoning |
|---|---|---|---|---|---|---|
| Anthropic Claude Sonnet 4.6 | 0.57 | 0.00 | 0.14 | 0.82 | 0.49 | 0.16 |
| DeepSeek Deepseek Flash v4 | 0.72 | 0.83 | 1.00 | 0.77 | 0.59 | 0.31 |
| DeepSeek Deepseek Pro v4 | 0.79 | 0.23 | 0.18 | 0.77 | 0.57 | 0.29 |
| Google Gemini Flash 3 (preview) | 0.41 | 0.33 | 0.39 | 0.95 | 0.53 | 0.04 |
| Google Gemini Pro 3.1 (preview) | 0.66 | 0.01 | 0.00 | 0.82 | 0.52 | 0.30 |
| Moonshot AI Kimi k2.6 | 0.54 | 0.17 | 0.47 | 0.95 | 0.62 | 0.39 |
| Xiaomi Mimo Pro v2.5 | 0.88 | 0.80 | 0.11 | 0.74 | 0.60 | 0.33 |
| Z.ai Glm 5.2 | 0.70 | 0.09 | 0.22 | 0.58 | 0.59 | 0.31 |
Behavioral patterns across the cohort
| Run | Cache Hit % |
|---|---|
| Xiaomi Mimo Pro v2.5 | 85.2% |
| DeepSeek Deepseek Pro v4 | 82.1% |
| DeepSeek Deepseek Flash v4 | 81.7% |
| DeepSeek Deepseek Flash v4 | 77.8% |
| DeepSeek Deepseek Flash v4 | 71.8% |
| DeepSeek Deepseek Flash v4 | 70.6% |
| Z.ai Glm 5.2 | 69.3% |
| DeepSeek Deepseek Flash v4 | 67.9% |
| Google Gemini Pro 3.1 (preview) | 54.1% |
| Moonshot AI Kimi k2.6 | 50.2% |
| Google Gemini Flash 3 (preview) | 50.0% |
| Anthropic Claude Sonnet 4.6 | 46.1% |
| Google Gemini Flash 3 (preview) | 35.4% |
| Run | Reasoning Tokens |
|---|---|
| Google Gemini Flash 3 (preview) | 2,623,402 |
| DeepSeek Deepseek Flash v4 | 2,185,380 |
| DeepSeek Deepseek Flash v4 | 1,716,085 |
| DeepSeek Deepseek Flash v4 | 1,014,871 |
| Z.ai Glm 5.2 | 265,221 |
| Moonshot AI Kimi k2.6 | 251,819 |
| Xiaomi Mimo Pro v2.5 | 201,003 |
| DeepSeek Deepseek Pro v4 | 191,434 |
| DeepSeek Deepseek Flash v4 | 151,585 |
| Google Gemini Pro 3.1 (preview) | 124,400 |
| DeepSeek Deepseek Flash v4 | 104,506 |
| Anthropic Claude Sonnet 4.6 | 94,040 |
| Google Gemini Flash 3 (preview) | 83,406 |
| Run | Tools / Response |
|---|---|
| Moonshot AI Kimi k2.6 | 1.24 |
| Xiaomi Mimo Pro v2.5 | 1.20 |
| DeepSeek Deepseek Flash v4 | 1.20 |
| DeepSeek Deepseek Flash v4 | 1.18 |
| Z.ai Glm 5.2 | 1.18 |
| DeepSeek Deepseek Flash v4 | 1.16 |
| DeepSeek Deepseek Flash v4 | 1.16 |
| DeepSeek Deepseek Flash v4 | 1.15 |
| DeepSeek Deepseek Pro v4 | 1.13 |
| Google Gemini Flash 3 (preview) | 1.07 |
| Google Gemini Flash 3 (preview) | 1.05 |
| Google Gemini Pro 3.1 (preview) | 1.04 |
| Anthropic Claude Sonnet 4.6 | 0.98 |
| Run | Total % | HTTP | No-response | Empty |
|---|---|---|---|---|
| DeepSeek Deepseek Flash v4 | 1.2% | 1.2% | 0.0% | 0.0% |
| Google Gemini Pro 3.1 (preview) | 0.8% | 0.8% | 0.0% | 0.0% |
| DeepSeek Deepseek Flash v4 | 0.7% | 0.7% | 0.0% | 0.0% |
| DeepSeek Deepseek Flash v4 | 0.5% | 0.5% | 0.0% | 0.0% |
| Google Gemini Flash 3 (preview) | 0.4% | 0.4% | 0.0% | 0.0% |
| Moonshot AI Kimi k2.6 | 0.4% | 0.4% | 0.0% | 0.0% |
| DeepSeek Deepseek Flash v4 | 0.2% | 0.2% | 0.0% | 0.0% |
| DeepSeek Deepseek Flash v4 | 0.1% | 0.1% | 0.0% | 0.0% |
| Google Gemini Flash 3 (preview) | 0.0% | 0.0% | 0.0% | 0.0% |
| Xiaomi Mimo Pro v2.5 | 0.0% | 0.0% | 0.0% | 0.0% |
| Z.ai Glm 5.2 | 0.0% | 0.0% | 0.0% | 0.0% |
| DeepSeek Deepseek Pro v4 | 0.0% | 0.0% | 0.0% | 0.0% |
| Anthropic Claude Sonnet 4.6 | 0.0% | 0.0% | 0.0% | 0.0% |
Which model does each Favur agent role best?
Each card shows that role’s pooled mean ± 95% confidence interval: every appearance of the role across all models and SoWs, since roles behave consistently between SoWs. The interval (a Student-t bound, honest at small samples) is how confident we are in the average — it tightens as more runs accrue. Click Score over time on any card to see the full per-run history.
Writes the actual source files
Cohort median: 8.4 · 8 models tested
Pooled mean 8.5 ±0.14 · 95% CI, n=13
Authors tests and validates the suite
Cohort median: 7.2 · 8 models tested
Pooled mean 7.2 ±0.18 · 95% CI, n=13
Installs deps and runs the build pipeline
Cohort median: 8.4 · 8 models tested
Pooled mean 8.4 ±0.12 · 95% CI, n=13
Spot-checks the developing product
Cohort median: 6.9 · 7 models tested
Pooled mean 6.8 ±0.44 · 95% CI, n=12
Plans implementation step-by-step before coding
Cohort median: 7.9 · 8 models tested
Pooled mean 7.8 ±0.19 · 95% CI, n=13
Decomposes the SoW into sprints and tasks
Cohort median: 6.7 · 8 models tested
Pooled mean 6.6 ±0.24 · 95% CI, n=13
Reviews diffs against the sprint plan
Cohort median: 7.8 · 8 models tested
Pooled mean 7.9 ±0.14 · 95% CI, n=13
Coordinates the per-sprint development cycle
Cohort median: 8.3 · 8 models tested
Pooled mean 8.2 ±0.23 · 95% CI, n=13
| Rank | Model | Score |
|---|---|---|
| 1 | Moonshot AI Kimi k2.6 | 8.8 of 10 |
| 2 | Xiaomi Mimo Pro v2.5 | 8.8 of 10 |
| 3 | DeepSeek Deepseek Flash v4 | 8.5 of 10 |
| Rank | Model | Score |
|---|---|---|
| 1 | Moonshot AI Kimi k2.6 | 7.8 of 10 |
| 2 | Xiaomi Mimo Pro v2.5 | 7.3 of 10 |
| 3 | DeepSeek Deepseek Pro v4 | 7.3 of 10 |
| Rank | Model | Score |
|---|---|---|
| 1 | Anthropic Claude Sonnet 4.6 | 8.6 of 10 |
| 2 | DeepSeek Deepseek Flash v4 | 8.6 of 10 |
| 3 | DeepSeek Deepseek Pro v4 | 8.5 of 10 |
| Rank | Model | Score |
|---|---|---|
| 1 | Xiaomi Mimo Pro v2.5 | 7.2 of 10 |
| 2 | Z.ai Glm 5.2 | 7.2 of 10 |
| 3 | DeepSeek Deepseek Pro v4 | 7.2 of 10 |
| Rank | Model | Score |
|---|---|---|
| 1 | Google Gemini Pro 3.1 (preview) | 8.1 of 10 |
| 2 | Xiaomi Mimo Pro v2.5 | 8.0 of 10 |
| 3 | Moonshot AI Kimi k2.6 | 8.0 of 10 |
| Rank | Model | Score |
|---|---|---|
| 1 | Xiaomi Mimo Pro v2.5 | 7.1 of 10 |
| 2 | DeepSeek Deepseek Pro v4 | 7.1 of 10 |
| 3 | Google Gemini Pro 3.1 (preview) | 7.0 of 10 |
| Rank | Model | Score |
|---|---|---|
| 1 | Xiaomi Mimo Pro v2.5 | 8.2 of 10 |
| 2 | DeepSeek Deepseek Pro v4 | 8.1 of 10 |
| 3 | DeepSeek Deepseek Flash v4 | 7.9 of 10 |
| Rank | Model | Score |
|---|---|---|
| 1 | DeepSeek Deepseek Pro v4 | 8.6 of 10 |
| 2 | Google Gemini Pro 3.1 (preview) | 8.6 of 10 |
| 3 | Xiaomi Mimo Pro v2.5 | 8.5 of 10 |
How does each vendor's model perform on cost, tokens, cache, and reliability?
1 run · last seen 2026-06-28
5 runs · last seen 2026-06-27
averaged across 5 runs
2 runs · last seen 2026-06-28
averaged across 2 runs
1 run · last seen 2026-06-27
1 run · last seen 2026-06-28
1 run · last seen 2026-06-28
| Metric | Value |
|---|---|
| Tokens total | 64,845,271 |
| Cache hit % | 46.1% |
| Failure % | 0.00% |
| Reasoning tokens | 94,040 |
| Tools per response | 0.98 |
| Best at | Build Agent (8.6) |
| Weakest | Orchestrator (5.9) |
| Metric | Value |
|---|---|
| Tokens total | 1,736,721,095 |
| Cache hit % | 70.0% |
| Failure % | 0.27% |
| Reasoning tokens | 5,172,427 |
| Tools per response | 1.17 |
| Best at | Code Agent (8.8) |
| Weakest | Sprint-Plan Agent (6.9) |
| Metric | Value |
|---|---|
| Tokens total | 335,631,669 |
| Cache hit % | 36.9% |
| Failure % | 0.10% |
| Reasoning tokens | 2,706,808 |
| Tools per response | 1.06 |
| Best at | Code Agent (8.6) |
| Weakest | Orchestrator (6.6) |
| Metric | Value |
|---|---|
| Tokens total | 40,006,693 |
| Cache hit % | 50.2% |
| Failure % | 0.37% |
| Reasoning tokens | 251,819 |
| Tools per response | 1.24 |
| Best at | Code Agent (8.8) |
| Weakest | Scout Agent (4.8) |
| Metric | Value |
|---|---|
| Tokens total | 49,272,732 |
| Cache hit % | 85.2% |
| Failure % | 0.00% |
| Reasoning tokens | 201,003 |
| Tools per response | 1.20 |
| Best at | Code Agent (8.8) |
| Weakest | Orchestrator (6.9) |
| Metric | Value |
|---|---|
| Tokens total | 52,563,701 |
| Cache hit % | 69.3% |
| Failure % | 0.00% |
| Reasoning tokens | 265,221 |
| Tools per response | 1.18 |
| Best at | Develop Agent (8.3) |
| Weakest | Orchestrator (6.4) |
Inspect what each model actually generated. Pick a Statement of Work, select a run, browse its files, and (on desktop) compare two runs side-by-side.
Methodology
Every number on this page is reproducible. A run’s composite is just the sum of eight subject scores, each capped at a fixed weight that together add up to 100. There is no gate, curve, or pass/fail line — a composite measures how well a run engineered its Statement of Work, not whether it “passed”.
Every subject is a weighted average of normalized component metrics. Each component is measured deterministically from artifacts the run already produced — static analysis of the generated code, the run’s own pytest results, and tool/transcript telemetry — then mapped onto a 0–1 scale where 1 is best. The subject’s 0–1 score is multiplied by its weight to give its point contribution, and the eight contributions are added.
If a component has no data for a run (e.g. no telemetry was attributed, or static analysis could not run), it drops out and the remaining component weights renormalize, so a run is never penalized for data the harness did not capture. Each subject records a coverage fraction — the share of its weight backed by real data — which the tooltips surface alongside the score.
Take Mimo Pro v2.5 on circles, which scored 67.9. That single number is exactly the sum of these eight subject contributions:
| Subject | Contribution | Out of | Data coverage |
|---|---|---|---|
| Code Quality | 11.6 | 18 | 100% |
| Test Quality | 9.4 | 18 | 100% |
| Cost Efficiency | 12.3 | 14 | 100% |
| Tool Discipline | 11.7 | 12 | 100% |
| Workflow & Reporting | 6.1 | 12 | 100% |
| Effort Efficiency | 8.0 | 12 | 100% |
| Process Discipline | 5.3 | 8 | 100% |
| Deliverables | 3.4 | 6 | 100% |
| Composite | 67.9 | 100 |
Read it as: this run earned 11.6 of 18 for Code Quality, 9.4 of 18 for Test Quality, 12.3 of 14 for Cost Efficiency, 11.7 of 12 for Tool Discipline, 6.1 of 12 for Workflow & Reporting, 8.0 of 12 for Effort Efficiency, 5.3 of 8 for Process Discipline, 3.4 of 6 for Deliverables — adding to 67.9 out of 100. “Data coverage” is the share of each subject’s weight that was backed by real measurements; where it is below 100%, the missing components renormalized away rather than counting as zero.
18 × (0.35·lint + 0.25·coverage + 0.20·complexity + 0.20·MI) Measures the intrinsic quality of the code the run produced — independent of whether the tests happen to pass. It is the largest subject (tied with Test Quality) because durable, readable, low-defect code is the primary deliverable of an SDLC run.
Up to 18 points. Clean lint, high coverage, low complexity and a high maintainability index push this toward 18; messy or unmaintainable code pulls it toward 0.
| Component | Weight | What it measures | Normalized to 0–1 |
|---|---|---|---|
| Lint cleanliness | 35% | Number of ruff violations when the produced source is linted against one uniform ruleset (the same standard_ruff.toml for every run, so the bar is identical across models). | 0 errors → 1.0; 1–2 → 0.6; 3–5 → 0.2; 6–10 → 0.1; 11+ → 0.0. If the linter itself failed to launch, the component is treated as unavailable. |
| Test coverage | 25% | Line coverage percentage reported by the run’s own pytest run over the code it wrote. | coverage% ÷ 100, clamped to 0–1 (90% → 0.90). |
| Cyclomatic complexity | 20% | Average cyclomatic complexity of the produced source (radon) — how branchy the average function is. Lower is better. | Banded: avg < 2 → 1.0; < 3 → 0.66; < 5 → 0.33; < 8 → 0.16; otherwise 0.0. |
| Maintainability index | 20% | Radon maintainability index (0–100) of the produced source — a blend of volume, complexity and lines of code. Higher is better. | MI ÷ 100, clamped to 0–1. |
18 × (0.30·structure + 0.25·assertion_density + 0.20·test_density + 0.15·coverage_floor + 0.10·pass_rate) Measures how good the run’s test suite actually is — not merely that it is green. Favur enforces an all-green suite and a ≥90% coverage floor, so raw pass-rate and coverage saturate near the top for every conformant run and cannot discriminate them. Test Quality therefore leans on structural signals parsed from the test code itself — how deeply each test asserts, how dense the suite is relative to the source, and how sophisticated its organisation is — and keeps coverage and pass-rate only as small, re-centred terms.
Up to 18 points. A large, deeply-asserting, well-structured suite with fixtures, parametrization, markers and integration tests approaches 18; a thin suite of a few shallow tests scores well below half even when it is fully green and over 90% covered.
| Component | Weight | What it measures | Normalized to 0–1 |
|---|---|---|---|
| Structural sophistication | 30% | Six good-practice signals extracted from the test code by AST parsing: reusable @pytest.fixture definitions, @pytest.mark.parametrize cases, organised grouping (test classes or many modules), a shared conftest.py, use of pytest markers, and dedicated integration/e2e test files. | Mean of the six axes (each 0–1): fixtures and parametrize ramp to 1.0 at 3+, organisation is 1.0 with ≥2 classes or ≥4 files (0.5 with ≥1 class or ≥2 files), and conftest / markers / integration are each 0 or 1. |
| Assertion depth | 25% | Average number of assertions per test — bare assert statements plus pytest.raises/warns context managers — counted across the suite. Deeper assertion per test means each test checks behaviour more thoroughly rather than smoke-testing that code runs. | Banded on assertions-per-test: <1 → 0.2, <1.5 → 0.5, <2 → 0.7, <3 → 0.9, 3+ → 1.0. A suite with no tests scores 0. |
| Test density | 20% | Number of test functions per 100 lines of source code — how broadly the suite exercises the codebase, independent of line coverage. | Banded on tests-per-100-LOC: <4 → 0.3, <7 → 0.5, <10 → 0.7, <14 → 0.85, 14+ → 1.0. Zero tests scores 0. |
| Coverage (90% floor) | 15% | The same line-coverage percentage used in Code Quality, but re-centred on Favur’s 90% floor so it rewards going beyond the requirement instead of saturating at it. | 0.5 + (coverage% − 90) / 20, clamped to 0–1: 90% is neutral (0.5), 100% earns the full 1.0 bonus, and 80% or below is a 0.0 penalty. |
| Test pass rate | 10% | Fraction of the run’s own pytest tests that pass, executed headlessly in a faithful sandbox copy of the run’s output. | Already 0–1; used directly (clamped). Low weight — a sanity floor so a broken suite still hurts, without dominating the subject. |
14 × (0.6·exp(−cost / 45) + 0.4·cache_savings_ratio) Rewards delivering the SoW economically. It deliberately has no time term (wall-clock is dominated by external API latency, not the model’s skill) — it blends low absolute dollar cost with how effectively caching paid off: the net amount saved by cache reads at the provider’s discounted rate, less any cache-write premium, as a share of the no-cache bill.
Up to 14 points. The exponential decay means a very cheap run earns almost the full cost term, and a run whose caching meaningfully cut its bill earns the savings term on top — rewarding both spending little and caching well.
| Component | Weight | What it measures | Normalized to 0–1 |
|---|---|---|---|
| Run cost | 60% | Total cost of the run in USD, summed from the per-request token/cost aggregates in the run’s meta. (The site never renders the raw dollar figure — only the resulting score.) | Asymptotic exponential decay exp(−cost / 45): cost 0 → 1.0, and the score halves roughly every ~31 cost units, approaching but never reaching 0. |
| Effective caching | 40% | Net dollars saved by caching as a share of the no-cache bill: cache reads priced at the provider’s discounted read rate save (input − read) per token, and any cache-write premium (write − input, e.g. on models that charge to populate the cache) is netted back out. Deepseek has no write charge, so its writes cost nothing. | cache_savings_usd ÷ nominal (no-cache) cost, clamped to 0–1 — the fraction of the hypothetical un-cached bill that caching avoided. |
The caching term needs per-model pricing + cache-token telemetry; a run without it drops the term and the cost term renormalizes to the full subject weight.
12 × (0.25·tool_err + 0.20·halluc + 0.20·retry + 0.20·first_try + 0.15·unauth) Measures how cleanly the agents used their tools — the behavioral hygiene of the run. Frequent tool errors, hallucinated/invalid calls, retries and unauthorized-tool attempts all signal a model fighting its harness.
Up to 12 points. Clean, first-try, in-bounds tool use approaches 12; noisy or out-of-bounds tool use pulls it down.
| Component | Weight | What it measures | Normalized to 0–1 |
|---|---|---|---|
| Tool error rate | 25% | Share of tool calls that failed, including unauthorized-tool attempts ((failed + unauthorized) ÷ total tool calls). | Inverted: 1 − rate (fewer errors → higher score). |
| Hallucination rate | 20% | Share of tool calls that were schema-invalid (ToolValidationError — the model invoked an unknown tool or malformed arguments). Distinct from unauthorized attempts. | Inverted: 1 − rate. |
| Retry burden | 20% | Total tool-call retries ÷ total tool calls. | Inverted: 1 − rate. |
| First-try hit rate | 20% | Share of tool calls that succeeded on the first attempt (success and zero retries) ÷ total tool calls. | Already 0–1; used directly (higher → better). |
| Unauthorized-tool attempts | 15% | Count of attempts to call a tool the agent was not permitted to use, detected from error events. | 1 − 0.2 × attempts, floored at 0 (0 attempts → 1.0). |
These are telemetry-derived. Today Favur emits tool-call events without an agent/role tag, so per-agent attribution is unavailable and these metrics are computed run-wide; when a metric has no data it renormalizes away. This is the one documented Favur-app data gap.
12 × (0.50·workflow_adherence + 0.50·report_quality) Measures whether the agents followed Favur’s workflow contract and reported their work well — opening and closing work cycles correctly, and producing structured completion reports rather than terse or empty ones.
Up to 12 points, split evenly between following the process and documenting it.
| Component | Weight | What it measures | Normalized to 0–1 |
|---|---|---|---|
| Workflow adherence | 50% | Per agent, whether it both opened a work cycle (work_begin) and closed it with a terminal tool (work_end / strategy_done / completion_report), averaged across agents. | Already 0–1 (mean of per-agent adherence); used directly. |
| Report quality | 50% | Structure and substance of each agent’s best terminal report — presence and length of result/summary text plus citations, decisions and cautions — averaged across agents. | Already 0–1 (structured-report score); used directly. |
12 × (0.30·turns + 0.30·token_econ + 0.20·tool_breadth + 0.20·ctx_window) Measures economy of effort per unit of output — reaching completion in few turns, with lean token spend per request, a healthy breadth of tools, and without crowding the context window.
Up to 12 points. Efficient, focused runs score high; runs that thrash, over-spend tokens, or rely on a single tool score low.
| Component | Weight | What it measures | Normalized to 0–1 |
|---|---|---|---|
| Turns to completion | 30% | Total agent sessions the run took (floored at 1). Fewer is better. | Band: 1 session → 1.0, 15 sessions → 0.0 (linear). |
| Token economy | 30% | Average tokens per request (total tokens ÷ requests). Fewer is better. | Band: 6,000 tok/req → 1.0, 50,000 → 0.0 (linear). |
| Tool breadth | 20% | Count of distinct tools the run used. Broader use signals fluency with the harness rather than hammering one tool. | Band: 12 distinct tools → 1.0, 1 → 0.0 (linear). |
| Context-window usage | 20% | Largest single-request prompt size as a share of the model’s context window (default 200k tokens). Lower means more headroom. | Band: 25% of window → 1.0, 90% → 0.0 (linear). |
These are scale metrics with no natural 0–1 form, so each maps through a manifest normalization band (a “good” value scores 1.0, a “poor” value 0.0, linear between).
8 × (0.50·command_success_rate + 0.25·verification_density + 0.25·structured_workflow) Measures how rigorously the run drove its own engineering workflow, read from the telemetry event stream: whether the commands it ran actually succeeded, how densely it verified its own work, and whether it used the structured sprint/phase/delegation process rather than running flat.
Up to 8 points. A run with a high command-success rate, dense verification and a full sprint/phase/delegation structure approaches 8; a run that thrashes failing commands and never verifies or structures its work scores low.
| Component | Weight | What it measures | Normalized to 0–1 |
|---|---|---|---|
| Command success rate | 50% | Share of shell commands the run executed that exited successfully (command successes ÷ commands run), from the telemetry command events. A genuinely discriminating signal — baseline runs sit at 34–41%. | Already 0–1; used directly (clamped). |
| Verification density | 25% | Verification events per 100 commands — how often the run checked its own work relative to how much it did. | Band: 20 verifications / 100 commands → 1.0, 0 → 0.0 (linear). |
| Structured workflow | 25% | Whether the run used the structured workflow at all — credited for opening at least one sprint, at least one phase, and at least one delegation (one third each). | Mean of the three 0/1 signals. |
Telemetry-derived. A run with no `.favur/telemetry` event stream drops every component and the subject renormalizes away rather than scoring 0.
6 × (0.40·completeness + 0.30·output_volume + 0.30·cost_per_loc) Measures the shipped output recorded in the run’s deliverables manifest — whether it produced a complete set of artifacts, how much code it shipped, and how economically (cost per line of code).
Up to 6 points. A run that ships source, tests and docs in healthy volume at a low cost-per-LOC approaches 6; a thin or expensive-per-LOC run scores low.
| Component | Weight | What it measures | Normalized to 0–1 |
|---|---|---|---|
| Completeness | 40% | Whether the run shipped each of the three artifact kinds — source, tests and docs — counted as the fraction of the three present. | Fraction of {source, test, doc} kinds present (0, ⅓, ⅔, 1). |
| Output volume | 30% | Shipped code lines of code — source + test LOC from the manifest, de-duplicated by content hash (docs are excluded so they cannot dominate the volume). | Band: 8,000 code LOC → 1.0, 200 → 0.0 (linear). |
| Cost per LOC | 30% | Run cost divided by shipped code LOC — how expensive each line of delivered code was. Lower is better. | Band: 0.001 / LOC → 1.0, 0.02 / LOC → 0.0 (linear). |
A run without a `meta.deliverables` manifest drops every component and the subject renormalizes away. The cost-per-LOC term also drops when cost or code LOC is unavailable.