Personalized Form Coaching

Introduction

In 2016, Under Armour embarked on a journey to understand how to best leverage data from our UA HOVR™ Connected running shoes to coach runners for optimal performance. This journey was motivated by an early discovery that runners at the 2016 Boston Marathon “bonked” differently. In other words: runners managed their fatigue differently, and apparently employed unique running form strategies.

This small investigation led to the exploration of millions of workouts logged within MapMyFitness and blossomed into a rigorous scientific study of running form at scale. Along the way, we improved the reliability and precision of run data and created novel methods to hone our investigation and focus our data exploration. Ultimately, we made several discoveries that would better inform runners about their current running form, and provide the guidance intended to help them achieve a more optimal running form.

Released in March of 2018 to UA HOVR Connected running shoe owner, which use MapMyFitness (MapMyRun, MapMyWalk, MapMyFitness, MapMyHike apps), this data-driven running form coaching program has had a positive impact on of runners globally[13]. To share the research that powered this feature, we’ve organized our discoveries into three narratives: 

  1. Run Workout Data Exploration: A Data Scientist’s considerations for analyzing time series workout data
  2. Elevating Elevation: A Data Engineer’s considerations for reliable data pipelines and accurate and precise data
  3. Running Form and Strategy Insights: Definition and prevalence of a stable running form

Part 1 – Run Workout Data Exploration: A Data Scientist’s considerations for analyzing time series workout data

Background 

Running is an accessible form of exercise that enables a high percentage of runners to maintain their health[1]. Over 18.3 million Americans completed a road race in 2017, clearly indicating that running is popular[2]. While popular and accessible, at a mechanical level, running can be complicated. Running with suboptimal form negatively impacts the energy cost of running—a metric that may impact performance[3,4,5], important because 1 in 4 runners cite competition as motivation to keep running[6]. Suboptimal form may also increase injury risk[7], notable for the 30–50% of all runners who experience a running related injury each year[8].

In short, running mechanics are shaped by a number of variables. The role and relationship of some variables have been well established by the scientific community, while others have not. A plurality of controllable variables, such as experience, have been studied extensively[5,7,8,9,10]. Environmental variables, elevation, for instance, have also been studied[11,12], but its impact on running mechanics is less clear. Meanwhile, physical characteristics such as height and weight, presumably have some impact on running mechanics, but again it is not entirely clear how. So there are still significant explorations needed to understand what makes humans run the way they run.

Given that running is popular, accessible, and that runners are motivated to run to get faster and maintain their health, it is safe to assume that a significant number of people would benefit from a solution designed to help them optimize their running form. To be effective, this system would have to appropriately consider intrinsic variables to the runner, like physical characteristics, as well as understand and incorporate environmental variables, such as elevation changes.

Data Overview & Key Assumptions 

Millions of runners log and track runs with MapMyFitness each year. These runs (both individual training sessions and group races) are logged by novice to experienced runners, and each run can be matched with anonymized, user-reported biometric data. We take this data from our mobile applications, save detailed timestamped versions to our servers, and then aggregate and enhance this data to provide insights to each individual runner about their performance during that workout, via a quick post-workout tip in our mobile apps.

To deliver meaningful insights about running form, we set out to fill in gaps in the current academic literature on running. Armed with in-house running expertise, data science, software engineering, and data analytics, we determined that running pace, cadence, and stride length are each significant indicators of a runner’s form and something we’re calling running form strategy.

We define a running form strategy as the changes a runner already makes in cadence and stride length to deal with changes along their path, such as elevation. For these investigations, we focused on active runners—those who logged a run workout at least once a week for 8 consecutive weeks—and run workouts that we consider a successful performance—on race day, when a runner’s pace is faster than their average training pace leading up to that day. For these workouts, we captured latitude and longitude, speed, cadence, and stride length time series data, which was either collected directly from our UA HOVR Connected running shoes, or computed from the runner’s height, cadence and speed. We also looked at elevation-intensive workouts, and interval vs. steady-state workouts.

Data Exploration: Inconsistent Cadence to Stride Length Ratio

Early in our explorations, the team reviewed the relationship between cadence, stride length and running performance over a multi-week training program. When specifically exploring the ratio of cadence to stride length for runners who trained for and completed a marathon or half-marathon race, a trend was discovered. Runners who maintained a consistent cadence-to-stride-length ratio performed at or above their average training pace. Conversely, runners with high levels of variability in their cadence-to-stride-length ratio performed worse on race day. 

When looking at cadence, stride length, and pace values among runners we found that there was a direct correlation with these metrics and a number of physiological characteristics: gender, age, height, and weight. For example, when holding all other variables constant, age alone impacted the typical cadence employed by runners. Next, we investigated if these values and relationships were independent of experience (number of workouts logged) and work load (frequency, duration, and distance of workouts).    

Those marathon and half-marathon runners provided a good set of high work load training plans, and to be included in our data set, runners must have completed at least two workouts per week for 15 weeks prior to their race (typical marathon training spans 14 to 20 weeks). We isolated races by looking for workouts that matched the goal race distance ±5%, and that were completed within ±3 days of the end of their training plan. Runners were then grouped based on the average running pace of their training run.

In this exploration, we noticed that runners who trained at a relatively slower average pace also employed a running form strategy that resulted in a more erratic cadence-to-stride-length ratio. It was especially erratic among runners with paces under 10 minutes per mile. Conversely, we noticed that relatively faster runners displayed two behaviors: first, their cadence-to-stride-length ratio was more stable; and second, they utilized a running form strategy with a comparatively higher cadence-to-stride-length ratio near the end of their training program.

There was one other interesting finding from our analysis:  those runners who employed a more consistent cadence-to-stride-length ratio were more likely to complete their race at a pace that was faster than their average training pace. This was not the case for runners who employed a less consistent cadence-to-stride length ratio.

So, though runners who maintain a consistent cadence-to-stride-length ratio are more likely to be experienced runners and have good race performances, the opposite is also true: choosing to employ a consistent cadence-to-stride-length ratio can also lead to better performance. We thought that if we could help a runner intentionally modify their running form strategy, we could help them have a better running experience. But we needed to explore this hypothesis on a bigger data set.

Method Discussion: Race Identification

It was advantageous to study runners following a MapMyFitness training plan for two reasons: first, we were confident runners were training for a race; and second, we were able to easily identify race day. We were then able to easily grade a performance (via race day) and quantify a training load for a period of time when a runner was training with a higher work load. However, we wanted a larger sample size for this analysis, to go beyond those intentional marathon trainers, so we expanded our search. We believed if we could identify races “in the wild” it would enable us to look back at the period of training preceding that race to further investigate work load, and the associated running form strategies each runner employed.

Organized race events, such as the Boston Marathon, are great for identifying a bunch of people running an identical route within a relatively small timeframe. Races are tricky to identify, though, and comprise less than 1% of our logged workouts—given 1 million workouts, we’d expect to find a few thousand organized races logged. So we set about finding those needles in our giant haystack.

Each run workout includes a bunch of data, but particularly relevant were these: date, start time of workout, starting latitude, starting longitude, distance traveled, and some identifier for the runner. Intuitively, we assumed that organized races start in a similar location at a similar time, and that they draw a large number of participants (see e.g., USPN 10068004).

Based on this intuition, we proposed that race events could be modeled as connected graphs of closely related data points, which would function as clusters, and could then be used to identify a race. It is important to note that all of this happens without specifically targeting a particular race—instead, we did it in reverse, by identifying a race from the connected graph data, and then determining which race it was by using the date and location. We made sure to identify the official organized race, so we could ensure the race distance was certified (e.g. by USATF).

Because we were only concerned with data points that are closely related, we investigated using a density-based local clustering algorithm. When applied to our data, this algorithm matched our intuition of what a race looks like: a connected graph of points. If our assumptions and intuition were correct, then our problem was reduced to simply identifying an appropriate radius and a minimum number of points.

Density Based Clustering Illustration

We started with a random sample of 11,000 workouts from 2016, and the clustering algorithm detected 27 clusters (races) within these 11,000 workouts.

In the example above, the clustering algorithm found four runners who ran a similar distance (a marathon) at the same time and place. Manually checking the outputs revealed what we expected: Hello, Disney!

Location of First Cluster – Disney Marathon

Each of the 27 clusters we identified corresponded to an organized race event, including the Disney Marathon, the Los Angeles Marathon, the 3M Half Marathon in Austin, Texas, and the Vitality Half Marathon in Brighton, England.

At this point, we felt relatively confident in identifying race events and use those as an anchor in order to analyze runners’ training preceding each race. (Patent pending.)

Exploration: Elevation Effect 

We also knew that running up or down a slope can impact a runner’s typical form[9], so the team also wanted to quantify that effect in order to build a robust form coaching experience for our runners. Elevation changes are obvious times when a runner would change their running form due to a slope, so we started by looking at data from devices that give us time series data for elevation, usually from a barometer.

Along with pace, cadence, and stride length, we identified workouts with that elevation time series data and selected a representative group of active runners with a range ability and experience levels. Then we visually inspected workouts with elevation changes on graphs, and compared them to changes in the other variables. We also quickly realized that we needed to ignore interval workouts and focus on steady-state run workouts, but more on the sorting of interval workouts later.

Figure: Plot of stride cadence compared to elevation and gradient—elevation points are colored by severity of the gradient.
Figure: Plot of stride length compared to elevation and gradient—elevation points are colored by binary scheme: uphill (red) and downhill (blue)

While exploring the elevation data, we discovered that cross-cutting population patterns did not emerge across runners when comparing flat (±2% gradient), uphill, and downhill portions of running courses. Instead, it looked as if runners employ their own individual strategy when running over hills. So we decided to compare runners only to themselves, using their running form on a flat course as their baseline.

Next, we considered each runner’s average pace, cadence, and stride length for uphill and downhill portions for varying degrees of gradient, both uphill and downhill: 2–4%, 4–6%, 6–8%, 8–10%, and 10%+. This revealed an interesting pattern: around ±4–6% gradient is where most runners change their pace and cadence, and in both uphill and downhill scenarios, they slow down. We were unable to find any similar pattern for stride length. More significant changes in running form occur above 6% gradient, but the amount of change is not uniform across runners, so we were unable to pull out any general trends.

The key result from this data exploration was learning that running form is personal and the strategy employed by runners is uniquely adjusted to their physical attributes, athletic ability, and running experience level.
In summary, during Part 1, we made substantial progress towards establishing a personalized form-coaching feature, created useful tools and methods, and discovered a few key results.

  1. We performed the foundational work necessary to rapidly study time-series fitness data.
  2. We discovered that running form consistency may be a signature of ability, and glimpsed the first signs of a personalized, form-coaching feature.
  3. We created an algorithm to identify race events outside of the context of a MapMyFitness training plan—an important breakthrough in scaling our research.
  4. We discovered that individuals adjust their running form when they encounter uphill and downhill grades over 4–6%. But, the way runners changed their form was idiosyncratic, not systematic at a population level.

In Part 2 of this series we’ll explain how we created a proprietary software data service to enable this study, and how we attempted to improve the quality of our elevation service to further our quest of creating a better form coaching system.

Footnotes

[1] – Running USA. 2015 National Runner Survey. 2015.
[2] – Running USA Annual Report, 2017
[3] – Cavanagh and Williams. The effect of stride length variation on oxygen uptake during distance running.Med Sci Sports Exerc. 1982.
[4] – Hunter I, and Smith GA. Preferred and optimal stride frequency, stiffness and economy: changes with fatigue during a 1-h high-intensity run.European J Appl Phys. 2007.
[5] -Folland, et. al.,  Running Technique is an Important Component of Running Economy and Performance.Med Sci Sports Exerc. 2017.
[6] – Cogsdill. National Run Survey For Runners(Internal Survey).January 2016.
[7] – Edwards, et. al. Effects of Stride Length and Running Mileage on a Probabilistic Stress Fracture Model.  Med Sci Sports Exerc. 2010.
[8] – Nielsen, et. al., A prospective study on time to recovery in 254 injured novice runners. PLoS One. 2014.
[9] -Warr, et. al., Characterization of Foot-Strike Patterns: Lack of an Association With Injuries or Performance in Soldiers. Mil Med. 2015.
[10] – Crowell and Davis. Gait Retraining to Reduce Lower Extremity Loading in Runners. Clin Biomech. 2011
[11] – Padulo, et. al., A Paradigm of Uphill Running.PLoS One. 2013
[12] – Syder and Farley, Energetically optimal stride frequency in running: the effects of incline and decline, Journal of Experimental Biology.2011
[13] – Customer reviews, emails, and other interactions. 

algorithmsclassification