Statistics in Coaching: A Conversation

I had an interesting conversation with a fellow coach the other day about the basics of using statistics to improve coaching.  I thought others may find the discussion valuable, so with the generous permission of Coach K, I’ve transcribed the conversation and only slightly edited for clarity. [Content in square brackets added for further clarification]

Note: This is not a rigorous clarification of statistics.  If you want that, go check out the Kahn Academy Statistics and Probability course or the Crash Course Statistics series of videos.  Feel free to let me know if I made a glaring mistake, but I play fast and loose here so consider yourself warned.


<K> You're good at stats right? I would love to see someone write an article putting this presentation into plain english. I don't understand stats well enough to do it.

[You don’t have to read the linked presentation, but it does provide some context.  The abstract may be enough for most.]

<P> I'll take a look….

The short translation is that single data points are very limited in the information they provide. Specific to performance testing (e.g. FTP or Wingate) we don't know if a change of + or - X Watts is real change or just testing variance [AKA noise]. The author is going through mathematical methods of providing more context for determining the information provided by performance test results and provides fast and loose guidelines for coaches.

<K> Some of the terminology is confusing to me. Stats terms like typical error, why do you divide SD by sqrt2, what's pearson and intraclass, etc. The first few slides and last few slides are pretty easy for me, but I always get a nagging feeling I'm missing something in the middle.

<P> Ah. Some of that’s just the nitpicking of stats. Most statistical tests require a few specific assumptions about the features of the data. Some of those terms and techniques are measures of how much the data fit those assumptions or tests for that.

<K> Ohhh I gotcha…

So like if the bell curve has a skew or is bimodal it might not fit some tests, or if its SD is too wide you can't be certain something had an effect?

<P> Correct. Terms like Kurtosis, Skewness, and Modality are all characteristics of data. [Similar to height, weight, or eye color being characteristics of individual humans.]  Different statistical tests are relatively more valid if the characteristics of the data closely fit the test assumptions.

Like Chi squared tests are only valid on data that is categorical.(yes/no or a/b/c/etc choice responses) If you tried to use the Chi squared test on continuous data(e.g. wattage or time) the results would be meaningless because the assumptions were violated.

But since stats are not perfect, we have a bunch of tests to approximate how closely the data fit the assumptions. And thus how much we can argue any particular interpretation of the results of those statistical tests(and the data as a whole).

<K> Ohhhh I gotcha.

<P> Also, as a more general note. Most of the statistical tests we use today were designed for group level analysis. Not individual. Applying them to individuals immediately moves us away from those base assumptions.  A worthwhile listen is “The End if Average”by Tod Rose

<K> You mean like t-tests?

<P> Yes. It still provides some information, but don’t let it pull you away from what you’re actually witnessing in the athletes. In other words, Stats are guidelines. Not rules.

<K> Seems like being well versed in stats would be super helpful then. Its only been 7 years since i took stats, how hard could it be 😂

<P> I come from a biased position of knowing maths like the back of my hand, but I’d say it’s marginal gains at best. [Sometimes arrogant? (✓) Kinda a math nerd? (✓✓)]

Know this: data is noisy.  Small changes in individual data are likely all noise. Big changes are either testing failure or real change. Once you eliminate likely test failure modes, you can confidently argue moderate to large observed changes are probably real.

<K> Is there a practical way to look at variation in an athlete's testing data then? Besides power meter (PM) accuracy.

<P> Yes. Stats IS practical. It’s just not written in stone law.

<K> Okay, let's talk sprint data…

Peak power of any one ride for 1 second... 1% error of 1500w is 15w.

So what we're saying is I need to see more than 1515w to be certain it's a positive change? But... is it better to look at average through a week? I guess I'm trying to apply what this presentation is saying over the course of weeks or months.

<P> Two things:

First, the 1% guideline is above and beyond any PM accuracy limits. So if PM has a +-5% accuracy and the guideline says you need min 1% variance to get beyond noise, that's +-75W for the PM variance and +-1% at the ends of that range. So you need to see over 1590W or under 1410W.

Second: Average over time would provide more precision but also introduce noise due to variations in fatigue/nutrition/etc. Comparing to history is better most the time IF you can increase the likelihood of repeating the testing conditions.  At least that way you can suggest that any noise introduced is natural and random. [In contrast to non-random which would bias observed differences]

<K> Ahhhhhh…

5% is a good guideline? Even for PMs rated at 1% such as SRM and Power2Max?

<P> I just picked 5% as an example.  A 1% PM accuracy would result in a range of over 1530W or under 1470W.

<K> Where's the extra 1% from?

<P> For the fast and loose purposes we’re discussing, PM accuracy stacks with whatever guideline percentage the paper gives. Thus, 1% from the PM plus 1% from the noise suggestion of the paper(or 5% or whatever).

<K> Ohhh I gotcha…

So even within the noise if I'm looking for sprint to sprint variability… let's say I do 5 sprints, each has a peak power of 1870-1930w, it's impossible to know the actual variability within the PM accuracy? For all intents and purposes it was the same each time?

<P> Yes…

You may argue that the highest and lowest values are different.  But that would be splitting hairs and have a high chance of being wrong.  If you said they were different and in reality the observed difference was just due to noise, you would have committed what is called a Type 1 error in stats.  If you ever come across a P value in a paper(e.g. p<.05), that is the probability(in decimal notation) of committing a Type 1 error in reporting a significant difference between observed values.

<K> So type 1 error is a risk of a false positive vs type 2 for false negative?

<P> Correct.

<K> Looking at something more accurate though, I could gain better insight in effort-to-effort variability by looking at variation in something electronically timed over a set distance?

<P> Yes and no.  Yes in a perfect world.  No in our world where there are infinite tiny sources of noise like friction, nutrition, fatigue, and wind.

<K> Type 1 vs Type 2 error. In most setups of measurement which is better to risk having?

<P> Oh that's tough. [The scientist in me disagrees with the coach in me here, but this is for coaching]

For age groupers, you would rather have Type 2 [False Negative: Stats say no effect, reality is some effect].  Get them doing the things that have BIG effects and get them to respond less to the little differences that waste time and energy.

For world class level... I'd probably say Type 1 [False Positive: Stats say some effect, reality is no effect].  And for a weird reason that I’ll probably get feedback for. When you're competing at that level, TINY performance differences can be the difference between podium and ‘also ran’.  And the placebo effect is well known to produce that level of difference in performance.

So I'd rather* be telling athletes that what their doing is working to induce placebo, then fill their environment with failure related discussion(*ONLY IF the data is in the range of higher chance of Type 1 error.)  Even then, depending on the context, I'd probably just tell them that whatever that thing is they are doing is  a waste of effort because the numbers aren't moving. I’d prefer to be honest and direct with an athlete because I don’t want to waste their time.

Keeping perspective about all of this, at the end of the day, talking stats to an athlete is not a very value adding activity**.  So do it for yourself. But have clear and concise information for them. Keep it to the point so they can specialize.

<K> Lots of stuff to chew on here.  Thanks!

<P> Thanks for asking!


**Post Note: At the end of the day an athlete needs to know if they should do something or not.  If it will move them toward their valued outcomes, or away. They don’t need to know the minutiae of how you, the coach, came to that decision.  Your value adding service is taking quantitative and qualitative data about the athlete and converting it into a go/no-go decision that they can act on.

Was this useful? Confusing?  What questions did it bring up? How do you use statistics in your coaching business?  Post below, contact me direct, or if you know someone else who might benefit from this, share it.