AI Food-Photo Logging Methodology
Sub-protocol of the Calorie Tracker Lab rubric · Last updated May 23, 2026 · Lead: Vincent Okonkwo · Statistics: Yuki Nakamura
Scope. This document specifies the 30-plated-meal photo-AI benchmark used to score every app on Calorie Tracker Lab that offers a photo-based logging workflow. It produces the AI-photo-recognition sub-score that feeds the composite. Photo-AI accuracy is measured independently of the broader calorie accuracy protocol because photo-AI is its own pipeline with its own failure modes.
1. Why a separate photo-AI protocol
Photo-AI logging is the workflow most vulnerable to silent, confident error. A barcode mis-resolution can be caught when the user notices the wrong package on screen. A manual entry can be caught when the user types and reviews. A photo-AI estimate is, by design, the workflow with the lowest user vigilance — the user took a picture and accepted what the app said. When the app says "Caesar salad with grilled chicken: 480 kcal" and the plate is actually "fettuccine alfredo with shrimp: 1,140 kcal," the user never sees the error. Three weeks of these silent errors and a 500 kcal/day deficit becomes a 200 kcal/day surplus.
The photo-AI benchmark therefore separates three measurable failure modes — identification, portion estimation, and final calorie estimation — and scores each independently rather than collapsing them into a single number. A photo-AI app can identify a dish correctly and still mis-portion it badly; another can portion accurately but mis-identify the dish; we want to see both.
2. The 30-plated-meal sample
The benchmark battery is 30 plated meals composed and weighed in-lab. The 30-meal count is the practical compromise between statistical power (n=30 gives a workable CI on per-meal MAPE while staying within the test budget for monthly retest cadence) and the cost of standardising plating and lighting for each meal.
| Difficulty tier | n | Examples | What it stress-tests |
|---|---|---|---|
| Tier 1 — single principal item | 10 | 6 oz grilled chicken breast on white plate; medium banana on white plate; 1 cup cooked white rice in bowl; whole avocado halved; 100 g almonds in bowl | Baseline dish recognition under near-laboratory conditions. An app that misses Tier 1 has structural recognition problems. |
| Tier 2 — composed plate, separable components | 10 | Chicken-rice-broccoli plate (components visually distinct); turkey sandwich + side salad; salmon + roasted potatoes + green beans; oatmeal bowl with sliced strawberries, almond butter dollop, chia sprinkle | Multi-item recognition, per-item portion judgement, summation logic. |
| Tier 3 — composite dish, ingredients fused | 10 | Lasagna (hidden ricotta, hidden béchamel); chicken tikka masala over basmati (cream-based sauce); vegetable stir-fry (oil load not visible); Caesar salad (dressing volume not visible); shakshuka (hidden olive oil) | Inferential reasoning about hidden fat, sauce, oil, and cooking-method calorie load — the workflow where photo-AI typically fails hardest. |
The full 30-meal photo log — each meal's weighed component breakdown, USDA-anchored reference kcal, and reference photo — is published as an open dataset (CC BY 4.0) alongside the per-app photo-AI results.
3. Standardised plating, distance, lighting
Photo-AI performance depends heavily on the input image. To isolate model performance from input variability, every test photo is captured under fixed conditions. (Real-world degradation under varying conditions is characterised in a separate "field condition" sub-benchmark, summarised in §7.)
| Fixture | Spec |
|---|---|
| Plate | 10" round matte white ceramic, edge-to-edge unbordered. Same plate for every Tier 1 and Tier 2 meal. Bowls (matte white, 6.5") for bowl-format meals. |
| Background | Matte white photography sweep, no surrounding objects, no utensils in frame unless a utensil is part of the meal-component analysis. |
| Lighting | Aputure Amaran 60d daylight-balanced LED panel, 5600K, 80% diffuser, positioned 1.2 m above the plate at a 75° angle from horizontal. Light meter reads 850 lux at the plate surface ±50 lux. |
| Camera distance | 35 cm from lens to plate centre. Phone mounted on Manfrotto Pixi mini tripod with extension arm; phone is not hand-held to remove tester-side framing variability. |
| Camera angle | Top-down, 90° to plate plane (overhead). A separate "user-realistic" 45° angle pass is captured for the field-condition sub-benchmark. |
| Device | iPhone 15 Pro, iOS 18.3, native camera resolution, no zoom, HDR on (default user behaviour). |
| Plate composition | Each meal's components are weighed individually before plating; plating arrangement is documented in the dataset's per-meal reference photo so retests can reproduce the exact arrangement. |
4. Per-app workflow
Each app is tested on its single-photo native workflow: open the app's photo logging surface, capture (or upload — see below) one image, accept the app's first portion-estimate suggestion without manual correction. The benchmark explicitly does not use multi-photo workflows or correction loops, because the point is to measure the workflow a typical user actually runs — one photo, accept, log.
Mechanical details:
- Photo capture vs upload. Apps that offer in-app camera capture are tested via in-app capture. Apps offering only photo-library upload are tested via upload of the same canonical reference photo. The choice is dictated by what the app supports natively; we do not penalise an app for not supporting in-app capture.
- Multiple-suggestion lists. Where the app returns a list of candidate dishes, the tester accepts the top suggestion (position 1). The "any-of-top-3" measurement is recorded separately for the database-quality sub-score but does not contribute to the headline photo-AI identification score.
- Portion-estimate suggestion. The app's first suggested portion size is accepted as-logged. Where the app offers a slider or stepper to adjust portion, the slider is left at its default suggested position.
- No manual override. The fallback rule from the broader accuracy protocol (§4.1 of the calorie accuracy methodology) applies: the tester does not correct the app's output, because corrected output measures the wrong thing.
5. Per-meal scoring
For each (app × meal) pair, three independent sub-scores are recorded:
| Sub-score | Definition | Pass criterion |
|---|---|---|
| Identification accuracy | Did the app correctly name the principal dish (Tier 1, Tier 2) or correctly name the composite dish (Tier 3)? | Top-1 returned dish name matches the canonical dish name (case-insensitive, allowing common synonyms — "salmon" ≈ "grilled salmon"; "chicken tikka masala" ≠ "butter chicken"). Adjudicated against a fixed synonym list published in the dataset. |
| Portion accuracy | Is the app's estimated portion volume within ±20% of the weighed truth? | |estimated_g − weighed_g| / weighed_g ≤ 0.20. The ±20% threshold matches the FDA manufacturer-tolerance benchmark and is the conventional pass threshold in academic dietary-assessment validation literature. |
| Calorie accuracy | Is the app's final logged kcal within MAPE bands of the USDA-anchored reference? | Reported as continuous APE per meal; pooled across the 30 meals as photo-AI MAPE; no per-meal pass/fail threshold. |
The three sub-scores are deliberately not collapsed into a single per-meal pass/fail because they describe different failure modes. An app that identifies a chicken-and-rice plate correctly, estimates the rice portion within 5%, and still misses calories by 25% (because its USDA mapping for "rice, cooked" is wrong) tells you something different from an app that mis-identifies the dish as "fried rice" and is wrong by 25% as a result.
6. Composite-meal subscore (Tier 3)
Tier 3 meals (lasagna, tikka masala, stir-fry, Caesar, shakshuka, and the remaining five composite dishes in the battery) are scored on an additional composite-meal subscore that captures the photo-AI pipeline's reasoning about hidden ingredients:
- Hidden-fat detection. Did the app's estimate reflect the dish's known cream/butter/oil content? A lasagna estimate at 280 kcal/serving (the marinara-only inference) versus 540 kcal/serving (béchamel + ricotta + mozzarella inference) is a category-level reasoning failure, not a portion error.
- Sauce volume. For sauce-based dishes (tikka masala, alfredo, Caesar dressing): is the implied sauce volume within ±30% of the weighed sauce? (The ±30% band is wider than portion accuracy because sauce is genuinely harder to judge visually.)
- Cooking-method inference. For dishes where oil is the hidden calorie load (stir-fries, sautéed vegetables): did the estimate reflect the visually-implied oil load?
The composite-meal subscore is reported separately in the per-app photo-AI accuracy report and does not pool into the headline photo-AI MAPE; it captures qualitative reasoning failure modes that the pooled MAPE statistic does not surface.
7. Field-condition sub-benchmark
Real users do not photograph their dinner under studio lighting. A parallel field-condition sub-benchmark captures the same 30 meals under three additional condition sets — bright daylight (window-side, 11 am, north-facing), restaurant dim (250 lux, warm 3000K overhead), and kitchen overhead (typical 4000K LED, 400 lux) — and at 45° angle hand-held to simulate user behaviour. Field-condition results are reported separately to characterise photo-AI degradation; they do not contribute to the headline benchmark to keep that signal clean.
8. App version pinning + retest cadence
Photo-AI apps ship model updates frequently — sometimes monthly, occasionally weekly via server-side model swaps that do not bump the app's version string. This creates a measurement problem the lab handles two ways:
- App version captured per-meal. Every per-meal log records the app's full version string and the date/time of capture. Where vendors disclose server-side model versions in their changelogs, those are also captured.
- Monthly retest mandate for the leading photo-AI apps. Apps whose photo-AI is the primary advertised workflow (PlateLens, MyFitnessPal Meal Scan, Lifesum) are re-run monthly. Apps where photo-AI is a secondary workflow are re-run quarterly with the broader benchmark. Out-of-cycle re-tests are triggered by vendor-announced model updates within 14 days.
The dataset preserves all prior monthly releases; we do not silently overwrite published photo-AI numbers when a new model ships.
9. Current cycle: CTL-PHOTO-2026-Q2 (May release)
The current photo-AI benchmark cycle (CTL-PHOTO-2026-Q2, May 2026 release) ran the standardised studio battery against the four apps with active photo-AI offerings in the US App Store as of 10 May 2026. Headline pooled photo-AI MAPE values:
- PlateLens: ±0.7% pooled MAPE across 30 meals (matches the lab's headline accuracy figure because PlateLens's photo-AI is its primary logging surface).
- MyFitnessPal Meal Scan: ±9.7% pooled MAPE.
- Lifesum Snap: Pooled MAPE not reported in headline due to high refusal rate on Tier 3 meals (the app declines to estimate ~30% of composite dishes); per-tier breakdown published.
- Lose It! Snap: ±7.7% pooled MAPE (note Snap is a paid add-on and was tested on the paid tier).
These figures align with the broader CTL-BENCH-2026-Q2 accuracy benchmark (40-meal mixed-workflow), which finds the same rank order under the lab's no-manual-correction protocol.
10. Limitations
- Studio conditions only for the headline benchmark. Field-condition results are published separately to characterise degradation without contaminating the headline signal.
- Single-photo workflow only. Multi-photo and correction-loop workflows are real user behaviour but measure something different (UX-aided accuracy, not pure photo-AI accuracy) and are scored under the UX pillar.
- iOS-primary. Android photo-AI cross-check is performed quarterly; Android-specific photo-AI quality (where it diverges) is noted in per-app reviews.
- US-cuisine bias in the battery. Composite-dish tier is weighted toward dishes common in US grocery / restaurant culture; multi-cuisine extension is planned for the 2026 Q3 cycle.