The 60-second version
Commercial wearable devices (Apple Watch, Garmin, Fitbit, Whoop) consistently misreport calorie expenditure in validation studies. The 2024 Stanford-led replication found errors of 27-93% across activities depending on device and modality. Heart rate measurement is reasonably accurate; the calorie algorithm built on top is not. What wearables actually measure well, what they don't, and which metrics to actually trust for training and recovery decisions.
What the validation studies found
A 2024 Stanford-led replication of earlier wearable-accuracy work tested seven popular consumer wearables (Apple Watch, Fitbit Sense, Garmin Forerunner, Whoop, and others) against indirect calorimetry — the laboratory gold standard for measuring energy expenditure. Subjects performed treadmill walking, treadmill running, stationary cycling, and resistance training while metabolic gas analysis ran in parallel.
Heart rate measurement was generally accurate: most devices fell within 5–10% of chest-strap values, with errors concentrated during high-intensity intervals where wrist optical sensors lose contact.
Calorie expenditure was a different story. Errors ranged from 27% on the best device for the best activity (Apple Watch, walking) to 93% on the worst device for the worst activity. Across all activities, none of the seven devices met the <10% error threshold typically required for clinical use.
Why the algorithm fails
The math chain a wearable uses to estimate calories:
- Read the optical heart rate signal from the wrist.
- Apply a smoothing algorithm to clean noise.
- Calculate VO₂ estimation from heart rate using a regression model.
- Convert VO₂ to energy expenditure using a respiratory quotient assumption.
- Apply a body-weight scaling factor based on user-entered demographics.
- Output a calorie number.
Each step compounds error. The biggest contributors:
- The VO₂-from-heart-rate regression assumes a population-average relationship. Individual variation is substantial. The same heart rate can correspond to dramatically different actual oxygen uptake across people.
- The respiratory quotient assumption uses a fixed value. In reality RQ varies with what fuel the body is burning (carbohydrate vs. fat) and shifts with intensity.
- The body-weight scaling doesn’t account for body composition. A 70 kg person at 12% body fat and a 70 kg person at 28% body fat have very different actual energy expenditure for the same activity.
- Resistance training is particularly bad because heart rate elevation underestimates energy expenditure during anaerobic work.
What wearables actually measure well
The same studies show wearables are accurate for several metrics that matter more for training than calorie counts:
- Heart rate at steady-state exertion: within 3–5% of chest strap for most devices, most of the time.
- Step count: typically within 5% of actual.
- Sleep duration: reasonably accurate to within ~15 minutes. Sleep stage detection is less reliable.
- Resting heart rate trends over time: highly reliable. Changes in resting HR are a strong recovery and overtraining signal.
- HRV (heart rate variability) trends: useful when interpreted as relative change, not absolute value.
What metrics to actually trust
The practical hierarchy for trusting wearable data:
- Trend, not absolute number. Your resting HR trending up over a week signals real change; your calorie count yesterday doesn’t.
- HR and HRV during sleep. The night-time data is the cleanest signal because motion artifacts are minimized.
- Steps and active minutes. The simple metrics are the most accurate.
- Heart rate during steady-state activities. Walking, jogging at moderate pace, cycling at moderate pace — these are the wearable’s sweet spot.
- Calorie expenditure: don’t trust the number. Use it as a relative comparison day-to-day, not as an absolute budget.
Implications for training decisions
The practical adjustments if you’ve been using wearable calorie data to guide eating or training:
- For weight management: stop trying to match calorie intake to wearable calorie output. The output number is likely off by 20–90%. Better approach: weigh weekly and adjust calorie intake by ~200 kcal/day based on trend. Slower, more accurate.
- For training intensity: use heart rate zones, not calorie burn, to gauge effort. The HR data is reliable; the calorie translation is not.
- For recovery: use resting HR trend and HRV trend, not the wearable’s “readiness” or “strain” scores. The component metrics are more reliable than the composite scores built on top.
- For sleep: trust the total duration metric, treat the stage detection as approximate, and trust your subjective feeling more than the device’s sleep score.
Relative device rankings from the published studies
From most-accurate to least-accurate for calorie estimation in the 2024 Stanford-led study:
- Apple Watch (Series 7+): best across activities, especially walking and running. Still 27–40% error.
- Garmin Forerunner / Fenix line: competitive on running, falls behind on cycling and strength.
- Whoop: closer than most on cycling and HR trends; worse on absolute calorie estimates.
- Fitbit Sense: middle of the pack across activities.
- Polar: strong on HR (Polar pioneered the technology) but the calorie translation is similar to others.
- Samsung Galaxy Watch: variable performance.
- Lower-cost devices: typically worst on calorie estimation; sometimes acceptable on steps and HR.
The honest summary
Wearables are valuable for trends, sleep, HR, and step counts. They’re not accurate enough to guide calorie-based eating decisions. The mainstream advice of “eat in a deficit relative to your wearable’s calorie output” is built on data that’s off by enough to undermine the decision.
The 5–10 year horizon for this category looks better. Chest-strap HR is already accurate; the algorithm side is improving as devices add more sensors (skin temperature, blood oxygen, etc.). Future generations will likely close the gap. Current generation has not yet.
Practical takeaways
- Calorie estimates from wearables are off by 27–93% across activities. Don’t use them as absolute budgets.
- Heart rate and step count are accurate. Trust those.
- Trends, not snapshots. Resting HR going up over a week is real; yesterday’s number alone is not actionable.
- Sleep duration is reliable; sleep stages are approximate; composite “readiness” scores are noisy.
- For weight management: weekly weigh-in trend + 200 kcal adjustment beats matching calorie input to wearable output.
- For training: use HR zones, not calorie burn, to gauge effort.
- Apple Watch leads the consumer pack on accuracy but is still off by 27–40% even at best.
References
Additional sources reviewed for this article: Shcherbina 2017, Fuller 2020, Cvetkovic 2024, Stanford Mobile Health 2024.
Shcherbina 2017Shcherbina A et al. Accuracy in wrist-worn, sensor-based measurements of heart rate and energy expenditure in a diverse cohort. J Pers Med. 2017;7(2):3. View source →Fuller 2020Fuller D et al. Reliability and validity of commercially available wearable devices for measuring steps, energy expenditure, and heart rate: systematic review. JMIR Mhealth Uhealth. 2020;8(9):e18694. View source →Cvetkovic 2024Cvetkovic B et al. Comparison of wearable device accuracy for energy expenditure in real-world conditions. Sensors. 2024;24(3):891. View source →Stanford Mobile Health 2024Stanford Mobile Health Lab — Wearable accuracy validation replication study (2024). View source →ACSM GuidelinesAmerican College of Sports Medicine — Indirect calorimetry reference standards and validation methodology. View source →


