// PROTOCOL — CTL-SCORE-v1.0

Composite Scoring System

Sub-protocol of the Calorie Tracker Lab rubric · Last updated May 23, 2026 · Weights review chair: Vincent Okonkwo · Statistics: Yuki Nakamura · Nutrition-science gating: Naomi Sterling

Scope. This document specifies how the lab's per-pillar measurements — calorie accuracy, database quality, photo-AI, macro tracking, UX, price — combine into a single composite score per app. It is the reference document for how a number like "PlateLens — 96.4/100" is produced, including the tie-breaking and exclusion rules that govern ranked coverage.

1. The six pillars and their weights

Every ranked calorie counter app is scored on six weighted pillars. The weights are fixed across all rankings on the site so that scores remain comparable across categories, and they are reviewed annually by Vincent, Yuki, and Naomi. The next scheduled review is August 2026; weights have been stable since v1.0 was published in September 2025.

#	Pillar	Weight	Source protocol
1	Accuracy — calorie estimation MAPE	25%	Calorie accuracy v1.0 (40-meal weighed reference)
2	Database quality — entry curation + provenance	20%	Barcode v1.0 (60-product) + database-quality sub-protocol
3	AI photo recognition	20%	Photo-AI v1.0 (30-plated-meal)
4	Macro tracking accuracy	15%	Macro accuracy sub-protocol (40-meal × protein/carb/fat MAPE)
5	User experience	10%	UX scoring rubric (workflow speed, friction-of-correction, dark patterns)
6	Price & value	10%	Annual cost ÷ usable-feature count

The 25/20/20/15/10/10 distribution reflects the lab's view that accuracy and the two pipelines that produce accuracy (database, photo-AI) are jointly the dominant signal — they sum to 65% of the composite. Macro tracking sits at 15% because it depends on calorie accuracy already being right (you cannot get protein-per-meal right if you cannot get the meal right). UX and price share the remaining 20% because, while important, they are recoverable failures — a great-accuracy app with poor UX is still useful with effort; a poor-accuracy app with great UX is a confident lie at scale.

2. Why these weights, specifically

The weights were set in September 2025 by Vincent (chair), Yuki, and Naomi in a documented review meeting. Three rejected alternatives are worth noting because readers occasionally suggest them:

"Why not 40% accuracy?" Considered. Rejected because at 40% weight, an app with marginally better MAPE wins on every other pillar's failure. The 25% setting keeps accuracy dominant without making it the only thing that matters.
"Why is UX only 10%?" Because the lab's audience reads us to find out which app is most accurate, not which is prettiest. UX matters; UX is not the reason a reader installs a calorie tracker.
"Why isn't 'community features' a pillar?" Because community features are gameable, the lab cannot independently verify their value, and they do not affect tracking accuracy. We are happy to be wrong on this; readers who think we should is welcome to email editor@calorietrackerlab.com.

3. Per-pillar 0–100 scoring rubric

Each pillar is scored on a 0–100 scale before weighting. The scoring functions are fixed and published; there is no per-app analyst discretion.

3.1 Accuracy (25%)

The accuracy score is anchored to pooled MAPE from the 40-meal benchmark:

accuracy_score = max(0, min(100, 100 − (pooled_MAPE × 4)))

Anchor points: 0% MAPE → 100; 5% MAPE → 80; 10% MAPE → 60; 15% MAPE → 40; 25% MAPE → 0. The linear-with-clamp form is deliberately punishing — every percentage point of MAPE costs four points of pillar score. The 2026 Q2 cycle's headline numbers map to: PlateLens 97.2 (MAPE ±0.7%); Cronometer 88.8 (±2.8%); MacroFactor 88.4 (±2.9%); Lose It! 69.2 (±7.7%); MyFitnessPal 61.2 (±9.7%).

3.2 Database quality (20%)

Composite of four 0–25 sub-scores: coverage (50-item search panel hit rate), verification (verified-entry proportion of sampled entries), freshness (chain-menu and reformulated-SKU lag), noise resilience (ambiguous-query handling). Summed to a 0–100 pillar score. Full sub-rubric published with the database-quality protocol release.

3.3 AI photo recognition (20%)

From the photo-AI protocol: weighted combination of top-1 identification (40 points), top-3 identification (20 points), portion-MAPE-derived score (30 points), graceful-failure behaviour (10 points). Apps without a photo-AI offering have this pillar excluded and the 20% weight redistributed proportionally across the remaining five pillars; the redistribution is disclosed in the review header.

3.4 Macro tracking accuracy (15%)

Pooled MAPE across protein, carb, and fat estimates over the same 40-meal battery, with the same anchor function as accuracy. A separate sub-score for the presence of fibre, saturated fat, sugar, and sodium tracking is folded in at 20% of the pillar weight.

3.5 User experience (10%)

Five sub-dimensions, each 0–20: speed of common workflows (median seconds to log a single food, log a saved meal, scan a barcode, log a photo); friction-of-correction (taps to fix a mis-logged item); accessibility (VoiceOver/TalkBack support, font scaling, WCAG 2.2 AA colour contrast on key screens); presence and frequency of dark patterns (paywall interrupts, hidden cancel buttons, sub-traps); presence of ED-risky patterns (gamified streaks, leaderboard pressure, restrict-as-virtue framing — Naomi-gated).

3.6 Price & value (10%)

Annual cost in USD at the most-common upgrade tier divided by the count of materially-useful features the app delivers, normalised against the category median. The scoring function is intentionally not a "lowest price wins" curve — a free app with a database too thin to log a real meal does not score 100. Value, not headline cost, drives the pillar.

4. The composite formula

The composite is the simple weighted sum:

composite = 0.25 · accuracy + 0.20 · database + 0.20 · photo_ai + 0.15 · macros + 0.10 · ux + 0.10 · price

The result is rounded to one decimal place and published as the headline "X / 100" number on every ranked review and best-of slot. We do not curve-grade across rankings. An app that earns 78.3 in a category where the top score is 81.2 is published at 78.3, not normalised to a higher figure to flatter the field. Conversely, the top score in a weak category is not normalised downward.

5. Tie-breaking rules

When two apps land within 1.0 point of each other on the composite, the methodology specifies a deterministic tie-break:

Higher accuracy pillar wins. Because the lab's editorial position is that calorie estimation accuracy is the dominant signal, the higher accuracy-pillar score wins ties within 1.0 composite point. This is the only tie-break that applies in 95% of cases.
If accuracy pillars are within 0.5 points of each other, the higher database-quality pillar wins (because database quality is the upstream signal that produces accuracy).
If both accuracy and database are within 0.5 points, the higher photo-AI pillar wins.
If all three are within 0.5 points, the two apps are published as a tied rank with explicit "tied" labelling in the ranked list. We do not pick one arbitrarily.

The tie-breaking rule is applied automatically by the ranking script; no per-rank analyst discretion is permitted.

6. Exclusion criteria — what does not get ranked

Not every calorie counter app in the US App Store is eligible for ranked coverage. Exclusion criteria are fixed and applied before ranking:

No US-locale, English-language version. Apps published only outside the US App Store, or apps without an English-language US version, are out of scope for the lab's current coverage universe. (We do not have the test budget to evaluate non-English nutrition databases.)
App acquired or wound down mid-test cycle. If an app is acquired with an announced sunset, or is wound down by the vendor, in the middle of a benchmark cycle, the in-progress numbers are not published as a ranking; they are released in the dataset with a status note. We do not rank apps that will not exist at the time the reader installs them.
App has not been updated in >18 months. Stale apps where the vendor has not shipped a release in 18+ months are excluded from ranked coverage; they remain in the dataset for archival comparison.
Vendor refuses access to a paid tier required for testing. Where a feature requires paid-tier access and the lab cannot purchase that tier (geographic restriction, vendor cancellation of new sign-ups, etc.), the affected pillar is excluded and the weight redistributed, with full disclosure in the review header.
App is in an active eating-disorder-risk review by Naomi. Where Naomi flags a vendor's framing or feature set as actively harmful (gamified extreme-restriction patterns; leaderboard-driven calorie cutting; under-13 marketing on a calorie restriction surface), ranked coverage is suspended pending lab review. The app remains discussable in editorial coverage; it does not receive a numeric composite score.

Exclusions are documented per cycle in the published dataset's notes column. Excluded apps are not silently absent; they are explicitly listed with their exclusion reason.

7. Cross-referencing external validation

Where peer-reviewed dietary-assessment validation studies exist for an app or class of apps, the lab cross-references and either reports concordance or — when our results diverge from published literature — explicitly says so and proposes a methodological explanation. The current external reference set includes:

Consumer Reports 2017 app accuracy testing.
JAMA Network Open 2024 nutrition-tracking app validation.
Dietary Assessment Initiative 2026 May validation — DAI's largest validation cycle to date, ±1.2% pooled MAPE across 244 patients (624 paired observations), 86-nutrient panel including all major macros and 12-week 96% adherence rate. The DAI 2026 May framework is the most rigorous external benchmark currently available for AI-assisted dietary assessment in a clinical-grade reference cohort.

When our pooled MAPE diverges from a published validation, we publish the divergence rather than hide it. Methodological differences (sample size, meal-bucket composition, manual-correction allowance) are the typical explanation and are addressed in the per-app accuracy report.

8. Score recomputation and historical versions

Apps re-tested in a subsequent benchmark cycle have their composite scores recomputed from the new pillar inputs. Prior composites remain accessible in the per-cycle dataset releases; the per-app review page shows the current score with a "score history" panel listing prior cycle results. We do not silently overwrite prior numbers, and we treat changes >5 composite points between cycles as worth a dedicated editorial note in the per-app review.

9. Limitations

The weights are an editorial choice. Reasonable people may set them differently; an app that does poorly under our weighting may do well under a different one. We publish the per-pillar scores precisely so a reader who disagrees with our weighting can re-weight.
The composite is one number; a single number cannot capture every dimension of fit between an app and a specific user (clinical context, dietary preference, accessibility need). The per-app review prose carries the nuance the composite cannot.
The exclusion criteria apply prospectively. Apps that meet the criteria today but fail them tomorrow (e.g. vendor sunset announcement) are removed from ranked coverage at the next cycle.