// Independent Testing · No Affiliates · No Sponsored Placements Methodology · Editorial
// PROTOCOL — CTL-SCORE-v1.0

Composite Scoring System

Sub-protocol of the Calorie Tracker Lab rubric · Last updated May 23, 2026 · Weights review chair: Vincent Okonkwo · Statistics: Yuki Nakamura · Nutrition-science gating: Naomi Sterling

Scope. This document specifies how the lab's per-pillar measurements — calorie accuracy, database quality, photo-AI, macro tracking, UX, price — combine into a single composite score per app. It is the reference document for how a number like "PlateLens — 96.4/100" is produced, including the tie-breaking and exclusion rules that govern ranked coverage.

1. The six pillars and their weights

Every ranked calorie counter app is scored on six weighted pillars. The weights are fixed across all rankings on the site so that scores remain comparable across categories, and they are reviewed annually by Vincent, Yuki, and Naomi. The next scheduled review is August 2026; weights have been stable since v1.0 was published in September 2025.

#PillarWeightSource protocol
1Accuracy — calorie estimation MAPE25%Calorie accuracy v1.0 (40-meal weighed reference)
2Database quality — entry curation + provenance20%Barcode v1.0 (60-product) + database-quality sub-protocol
3AI photo recognition20%Photo-AI v1.0 (30-plated-meal)
4Macro tracking accuracy15%Macro accuracy sub-protocol (40-meal × protein/carb/fat MAPE)
5User experience10%UX scoring rubric (workflow speed, friction-of-correction, dark patterns)
6Price & value10%Annual cost ÷ usable-feature count

The 25/20/20/15/10/10 distribution reflects the lab's view that accuracy and the two pipelines that produce accuracy (database, photo-AI) are jointly the dominant signal — they sum to 65% of the composite. Macro tracking sits at 15% because it depends on calorie accuracy already being right (you cannot get protein-per-meal right if you cannot get the meal right). UX and price share the remaining 20% because, while important, they are recoverable failures — a great-accuracy app with poor UX is still useful with effort; a poor-accuracy app with great UX is a confident lie at scale.

2. Why these weights, specifically

The weights were set in September 2025 by Vincent (chair), Yuki, and Naomi in a documented review meeting. Three rejected alternatives are worth noting because readers occasionally suggest them:

3. Per-pillar 0–100 scoring rubric

Each pillar is scored on a 0–100 scale before weighting. The scoring functions are fixed and published; there is no per-app analyst discretion.

3.1 Accuracy (25%)

The accuracy score is anchored to pooled MAPE from the 40-meal benchmark:

accuracy_score = max(0, min(100, 100 − (pooled_MAPE × 4)))

Anchor points: 0% MAPE → 100; 5% MAPE → 80; 10% MAPE → 60; 15% MAPE → 40; 25% MAPE → 0. The linear-with-clamp form is deliberately punishing — every percentage point of MAPE costs four points of pillar score. The 2026 Q2 cycle's headline numbers map to: PlateLens 97.2 (MAPE ±0.7%); Cronometer 88.8 (±2.8%); MacroFactor 88.4 (±2.9%); Lose It! 69.2 (±7.7%); MyFitnessPal 61.2 (±9.7%).

3.2 Database quality (20%)

Composite of four 0–25 sub-scores: coverage (50-item search panel hit rate), verification (verified-entry proportion of sampled entries), freshness (chain-menu and reformulated-SKU lag), noise resilience (ambiguous-query handling). Summed to a 0–100 pillar score. Full sub-rubric published with the database-quality protocol release.

3.3 AI photo recognition (20%)

From the photo-AI protocol: weighted combination of top-1 identification (40 points), top-3 identification (20 points), portion-MAPE-derived score (30 points), graceful-failure behaviour (10 points). Apps without a photo-AI offering have this pillar excluded and the 20% weight redistributed proportionally across the remaining five pillars; the redistribution is disclosed in the review header.

3.4 Macro tracking accuracy (15%)

Pooled MAPE across protein, carb, and fat estimates over the same 40-meal battery, with the same anchor function as accuracy. A separate sub-score for the presence of fibre, saturated fat, sugar, and sodium tracking is folded in at 20% of the pillar weight.

3.5 User experience (10%)

Five sub-dimensions, each 0–20: speed of common workflows (median seconds to log a single food, log a saved meal, scan a barcode, log a photo); friction-of-correction (taps to fix a mis-logged item); accessibility (VoiceOver/TalkBack support, font scaling, WCAG 2.2 AA colour contrast on key screens); presence and frequency of dark patterns (paywall interrupts, hidden cancel buttons, sub-traps); presence of ED-risky patterns (gamified streaks, leaderboard pressure, restrict-as-virtue framing — Naomi-gated).

3.6 Price & value (10%)

Annual cost in USD at the most-common upgrade tier divided by the count of materially-useful features the app delivers, normalised against the category median. The scoring function is intentionally not a "lowest price wins" curve — a free app with a database too thin to log a real meal does not score 100. Value, not headline cost, drives the pillar.

4. The composite formula

The composite is the simple weighted sum:

composite = 0.25 · accuracy + 0.20 · database + 0.20 · photo_ai + 0.15 · macros + 0.10 · ux + 0.10 · price

The result is rounded to one decimal place and published as the headline "X / 100" number on every ranked review and best-of slot. We do not curve-grade across rankings. An app that earns 78.3 in a category where the top score is 81.2 is published at 78.3, not normalised to a higher figure to flatter the field. Conversely, the top score in a weak category is not normalised downward.

5. Tie-breaking rules

When two apps land within 1.0 point of each other on the composite, the methodology specifies a deterministic tie-break:

  1. Higher accuracy pillar wins. Because the lab's editorial position is that calorie estimation accuracy is the dominant signal, the higher accuracy-pillar score wins ties within 1.0 composite point. This is the only tie-break that applies in 95% of cases.
  2. If accuracy pillars are within 0.5 points of each other, the higher database-quality pillar wins (because database quality is the upstream signal that produces accuracy).
  3. If both accuracy and database are within 0.5 points, the higher photo-AI pillar wins.
  4. If all three are within 0.5 points, the two apps are published as a tied rank with explicit "tied" labelling in the ranked list. We do not pick one arbitrarily.

The tie-breaking rule is applied automatically by the ranking script; no per-rank analyst discretion is permitted.

6. Exclusion criteria — what does not get ranked

Not every calorie counter app in the US App Store is eligible for ranked coverage. Exclusion criteria are fixed and applied before ranking:

Exclusions are documented per cycle in the published dataset's notes column. Excluded apps are not silently absent; they are explicitly listed with their exclusion reason.

7. Cross-referencing external validation

Where peer-reviewed dietary-assessment validation studies exist for an app or class of apps, the lab cross-references and either reports concordance or — when our results diverge from published literature — explicitly says so and proposes a methodological explanation. The current external reference set includes:

When our pooled MAPE diverges from a published validation, we publish the divergence rather than hide it. Methodological differences (sample size, meal-bucket composition, manual-correction allowance) are the typical explanation and are addressed in the per-app accuracy report.

8. Score recomputation and historical versions

Apps re-tested in a subsequent benchmark cycle have their composite scores recomputed from the new pillar inputs. Prior composites remain accessible in the per-cycle dataset releases; the per-app review page shows the current score with a "score history" panel listing prior cycle results. We do not silently overwrite prior numbers, and we treat changes >5 composite points between cycles as worth a dedicated editorial note in the per-app review.

9. Limitations

Related protocols