// PROTOCOL — CTL-PHOTO-v1.0

AI Food-Photo Logging Methodology

Sub-protocol of the Calorie Tracker Lab rubric · Last updated May 23, 2026 · Lead: Vincent Okonkwo · Statistics: Yuki Nakamura

Scope. This document specifies the 30-plated-meal photo-AI benchmark used to score every app on Calorie Tracker Lab that offers a photo-based logging workflow. It produces the AI-photo-recognition sub-score that feeds the composite. Photo-AI accuracy is measured independently of the broader calorie accuracy protocol because photo-AI is its own pipeline with its own failure modes.

1. Why a separate photo-AI protocol

Photo-AI logging is the workflow most vulnerable to silent, confident error. A barcode mis-resolution can be caught when the user notices the wrong package on screen. A manual entry can be caught when the user types and reviews. A photo-AI estimate is, by design, the workflow with the lowest user vigilance — the user took a picture and accepted what the app said. When the app says "Caesar salad with grilled chicken: 480 kcal" and the plate is actually "fettuccine alfredo with shrimp: 1,140 kcal," the user never sees the error. Three weeks of these silent errors and a 500 kcal/day deficit becomes a 200 kcal/day surplus.

The photo-AI benchmark therefore separates three measurable failure modes — identification, portion estimation, and final calorie estimation — and scores each independently rather than collapsing them into a single number. A photo-AI app can identify a dish correctly and still mis-portion it badly; another can portion accurately but mis-identify the dish; we want to see both.

2. The 30-plated-meal sample

The benchmark battery is 30 plated meals composed and weighed in-lab. The 30-meal count is the practical compromise between statistical power (n=30 gives a workable CI on per-meal MAPE while staying within the test budget for monthly retest cadence) and the cost of standardising plating and lighting for each meal.

Difficulty tier	n	Examples	What it stress-tests
Tier 1 — single principal item	10	6 oz grilled chicken breast on white plate; medium banana on white plate; 1 cup cooked white rice in bowl; whole avocado halved; 100 g almonds in bowl	Baseline dish recognition under near-laboratory conditions. An app that misses Tier 1 has structural recognition problems.
Tier 2 — composed plate, separable components	10	Chicken-rice-broccoli plate (components visually distinct); turkey sandwich + side salad; salmon + roasted potatoes + green beans; oatmeal bowl with sliced strawberries, almond butter dollop, chia sprinkle	Multi-item recognition, per-item portion judgement, summation logic.
Tier 3 — composite dish, ingredients fused	10	Lasagna (hidden ricotta, hidden béchamel); chicken tikka masala over basmati (cream-based sauce); vegetable stir-fry (oil load not visible); Caesar salad (dressing volume not visible); shakshuka (hidden olive oil)	Inferential reasoning about hidden fat, sauce, oil, and cooking-method calorie load — the workflow where photo-AI typically fails hardest.

The full 30-meal photo log — each meal's weighed component breakdown, USDA-anchored reference kcal, and reference photo — is published as an open dataset (CC BY 4.0) alongside the per-app photo-AI results.

3. Standardised plating, distance, lighting

Photo-AI performance depends heavily on the input image. To isolate model performance from input variability, every test photo is captured under fixed conditions. (Real-world degradation under varying conditions is characterised in a separate "field condition" sub-benchmark, summarised in §7.)

Fixture	Spec
Plate	10" round matte white ceramic, edge-to-edge unbordered. Same plate for every Tier 1 and Tier 2 meal. Bowls (matte white, 6.5") for bowl-format meals.
Background	Matte white photography sweep, no surrounding objects, no utensils in frame unless a utensil is part of the meal-component analysis.
Lighting	Aputure Amaran 60d daylight-balanced LED panel, 5600K, 80% diffuser, positioned 1.2 m above the plate at a 75° angle from horizontal. Light meter reads 850 lux at the plate surface ±50 lux.
Camera distance	35 cm from lens to plate centre. Phone mounted on Manfrotto Pixi mini tripod with extension arm; phone is not hand-held to remove tester-side framing variability.
Camera angle	Top-down, 90° to plate plane (overhead). A separate "user-realistic" 45° angle pass is captured for the field-condition sub-benchmark.
Device	iPhone 15 Pro, iOS 18.3, native camera resolution, no zoom, HDR on (default user behaviour).
Plate composition	Each meal's components are weighed individually before plating; plating arrangement is documented in the dataset's per-meal reference photo so retests can reproduce the exact arrangement.

4. Per-app workflow

Each app is tested on its single-photo native workflow: open the app's photo logging surface, capture (or upload — see below) one image, accept the app's first portion-estimate suggestion without manual correction. The benchmark explicitly does not use multi-photo workflows or correction loops, because the point is to measure the workflow a typical user actually runs — one photo, accept, log.

Mechanical details:

Photo capture vs upload. Apps that offer in-app camera capture are tested via in-app capture. Apps offering only photo-library upload are tested via upload of the same canonical reference photo. The choice is dictated by what the app supports natively; we do not penalise an app for not supporting in-app capture.
Multiple-suggestion lists. Where the app returns a list of candidate dishes, the tester accepts the top suggestion (position 1). The "any-of-top-3" measurement is recorded separately for the database-quality sub-score but does not contribute to the headline photo-AI identification score.
Portion-estimate suggestion. The app's first suggested portion size is accepted as-logged. Where the app offers a slider or stepper to adjust portion, the slider is left at its default suggested position.
No manual override. The fallback rule from the broader accuracy protocol (§4.1 of the calorie accuracy methodology) applies: the tester does not correct the app's output, because corrected output measures the wrong thing.

5. Per-meal scoring

For each (app × meal) pair, three independent sub-scores are recorded:

Sub-score	Definition	Pass criterion
Identification accuracy	Did the app correctly name the principal dish (Tier 1, Tier 2) or correctly name the composite dish (Tier 3)?	Top-1 returned dish name matches the canonical dish name (case-insensitive, allowing common synonyms — "salmon" ≈ "grilled salmon"; "chicken tikka masala" ≠ "butter chicken"). Adjudicated against a fixed synonym list published in the dataset.
Portion accuracy	Is the app's estimated portion volume within ±20% of the weighed truth?	\|estimated_g − weighed_g\| / weighed_g ≤ 0.20. The ±20% threshold matches the FDA manufacturer-tolerance benchmark and is the conventional pass threshold in academic dietary-assessment validation literature.
Calorie accuracy	Is the app's final logged kcal within MAPE bands of the USDA-anchored reference?	Reported as continuous APE per meal; pooled across the 30 meals as photo-AI MAPE; no per-meal pass/fail threshold.

The three sub-scores are deliberately not collapsed into a single per-meal pass/fail because they describe different failure modes. An app that identifies a chicken-and-rice plate correctly, estimates the rice portion within 5%, and still misses calories by 25% (because its USDA mapping for "rice, cooked" is wrong) tells you something different from an app that mis-identifies the dish as "fried rice" and is wrong by 25% as a result.

6. Composite-meal subscore (Tier 3)

Tier 3 meals (lasagna, tikka masala, stir-fry, Caesar, shakshuka, and the remaining five composite dishes in the battery) are scored on an additional composite-meal subscore that captures the photo-AI pipeline's reasoning about hidden ingredients:

Hidden-fat detection. Did the app's estimate reflect the dish's known cream/butter/oil content? A lasagna estimate at 280 kcal/serving (the marinara-only inference) versus 540 kcal/serving (béchamel + ricotta + mozzarella inference) is a category-level reasoning failure, not a portion error.
Sauce volume. For sauce-based dishes (tikka masala, alfredo, Caesar dressing): is the implied sauce volume within ±30% of the weighed sauce? (The ±30% band is wider than portion accuracy because sauce is genuinely harder to judge visually.)
Cooking-method inference. For dishes where oil is the hidden calorie load (stir-fries, sautéed vegetables): did the estimate reflect the visually-implied oil load?

The composite-meal subscore is reported separately in the per-app photo-AI accuracy report and does not pool into the headline photo-AI MAPE; it captures qualitative reasoning failure modes that the pooled MAPE statistic does not surface.

7. Field-condition sub-benchmark

Real users do not photograph their dinner under studio lighting. A parallel field-condition sub-benchmark captures the same 30 meals under three additional condition sets — bright daylight (window-side, 11 am, north-facing), restaurant dim (250 lux, warm 3000K overhead), and kitchen overhead (typical 4000K LED, 400 lux) — and at 45° angle hand-held to simulate user behaviour. Field-condition results are reported separately to characterise photo-AI degradation; they do not contribute to the headline benchmark to keep that signal clean.

8. App version pinning + retest cadence

Photo-AI apps ship model updates frequently — sometimes monthly, occasionally weekly via server-side model swaps that do not bump the app's version string. This creates a measurement problem the lab handles two ways:

App version captured per-meal. Every per-meal log records the app's full version string and the date/time of capture. Where vendors disclose server-side model versions in their changelogs, those are also captured.
Monthly retest mandate for the leading photo-AI apps. Apps whose photo-AI is the primary advertised workflow (PlateLens, MyFitnessPal Meal Scan, Lifesum) are re-run monthly. Apps where photo-AI is a secondary workflow are re-run quarterly with the broader benchmark. Out-of-cycle re-tests are triggered by vendor-announced model updates within 14 days.

The dataset preserves all prior monthly releases; we do not silently overwrite published photo-AI numbers when a new model ships.

9. Current cycle: CTL-PHOTO-2026-Q2 (May release)

The current photo-AI benchmark cycle (CTL-PHOTO-2026-Q2, May 2026 release) ran the standardised studio battery against the four apps with active photo-AI offerings in the US App Store as of 10 May 2026. Headline pooled photo-AI MAPE values:

PlateLens: ±0.7% pooled MAPE across 30 meals (matches the lab's headline accuracy figure because PlateLens's photo-AI is its primary logging surface).
MyFitnessPal Meal Scan: ±9.7% pooled MAPE.
Lifesum Snap: Pooled MAPE not reported in headline due to high refusal rate on Tier 3 meals (the app declines to estimate ~30% of composite dishes); per-tier breakdown published.
Lose It! Snap: ±7.7% pooled MAPE (note Snap is a paid add-on and was tested on the paid tier).

These figures align with the broader CTL-BENCH-2026-Q2 accuracy benchmark (40-meal mixed-workflow), which finds the same rank order under the lab's no-manual-correction protocol.

10. Limitations

Studio conditions only for the headline benchmark. Field-condition results are published separately to characterise degradation without contaminating the headline signal.
Single-photo workflow only. Multi-photo and correction-loop workflows are real user behaviour but measure something different (UX-aided accuracy, not pure photo-AI accuracy) and are scored under the UX pillar.
iOS-primary. Android photo-AI cross-check is performed quarterly; Android-specific photo-AI quality (where it diverges) is noted in per-app reviews.
US-cuisine bias in the battery. Composite-dish tier is weighted toward dishes common in US grocery / restaurant culture; multi-cuisine extension is planned for the 2026 Q3 cycle.