Debunking Exponent’s Methodology in the Wells Report

Let’s ignore smaller little hand-waving techniques like deciding to ignore one of the measurements for the Colt balls because it “looked strange.” Or that there might be compelling evidence in the report that Walt Anderson did indeed use the Logo gauge in his pre-game measurements. Instead, let’s just focus on the two major methods Exponent used to reach its faulty conclusion in Deflate Gate’s Wells Report.

  1. The “p-value” without controlling for time
  2. The use of a “visual proof” that is biased toward dry balls measured later in the locker-room period

To show exactly what’s so silly (and biased) about this methodology, imagine the following thought experiment:

  • All footballs, for whatever reason, are slightly below where we’d expect them at halftime, about 0.4 PSI below expected
  • The Patriots balls are dry instead of wet
  • The Patriots balls are measured second instead of first
  • The Colts balls are wet instead of dry
  • Finally, imagine that the Patriots deliberately released 0.15-0.3 PSI from some of their footballs after referee Walt Anderson approved them

Doing that produces a table with the following hypothetical halftime measurements:

Fake Exponent Table

As you can see, the Patriot balls have a much higher halftime average measurement because they were dry, inflated to 13.0 PSI in the pre-game and were measured after the Colt balls, allowing them more time to recalibrate to pre-game PSI levels. Now let’s use Exponent’s two main methodologies to see if we can detect the tampering that we’ve built in to the hypothetical.

Exponent Method #1: p-value independent of time

First, let’s run a statistical (independent t) test to see the likelihood that these two groups of footballs are from the same (untampered) groups. If we do what exponent did, which is ignore time of measurement and compare the pre-game level of the Patriot balls (13.0 PSI in our hypothetical) and Colt balls (12.5 PSI) to where they measured at halftime, we get a p-value of 0.0097. That’s almost exactly what Exponent came up with in the real-life scenario for the Patriots.

Only in our thought-expirement it’s the Colts who are the ones found likely of tampering. Exponent’s method literally picks out the wrong team because it ignores major, confounding variables like time. Pause for a second and appreciate how bad this is: I’ve created a scenario where the Patriots cheat, and Exponent’s methodology points to the Colts…because Exponent’s methodology is biased toward any team that was measured second in the locker room period (and also had wet footballs).

In our hypothetical, The Colt balls were measured first, and thus had less time to recalibrate. Furthermore, the Colt balls were treated as the wet group in this scenario, so they drop further in PSI compared to the dry Patriot balls (the opposite of the real-life AFC title game). So, using Exponent’s nonsensical test, it’s fairly easy to demonstrate “statistical significance” that one team tampered with the balls…even though it was not the team that actual tampered. It was just the team who had their wet footballs measured first.

Exponent Method #2: a picture that lies

Exponent also attempts to use a “visual proof” of sorts to demonstrate that something is wrong with the Patriot footballs and not the Colt footballs. This approach is Exponent’s acknowledgement that time (“transient curves”) is a relevant variable in the measurement process, but their demonstration is simply incorrect.

Without getting into the nitty gritty, we can simply draw the exact same picture that Exponent draws (starting with Figure 24, pg. 210) to completely disprove their “proof.” (Note: they draw a picture because if they ran an actual statistic test on the data using a transient curve, they would reach the opposite conclusion.) I’ve taken the same data and drawn an Exponent-like picture below. According to Exponent’s “logic,” if the “window” between a dry and wet football doesn’t intersect with the band of what was actual measured (within +/- 2 standard errors from the mean) then it demonstrates something outside the physical boundaries of what is possible.

Here’s our hypothetical data presented Exponent style:

Fake Exponent Graph

Low and behold…the Colt band does not intersect with a 12.5 PSI projection curve and the Patriot band does intersect with its 13.0 PSI projection curve. According to Exponent, this shows that Colt balls must have had something additional done to them. Except that in this scenario, it was the Patriot balls that were tampered with!

This is possible because this “visual proof” approach is biased toward a dry ball that was measured later in the locker room. Just like their p-value is biased toward a dry ball measured later in the locker room period. In the case of the visual, if normal variance (from any non-tampering factor) moves dry-ball measurements slightly below what is expected — as we’ve done in this hypothetical for the Patriots — it simply shifts the team’s band below the dry-ball upper boundary but still above the wet-ball lower boundary. If a team actually had a wet ball, they are instead shifted below the lower band, outside of what Exponent considers physically acceptable.

With regards to time, the Patriot balls in our hypothetical comfortably intersect with the acceptable region. This is because they were measured later in the locker-room period; despite dropping in PSI from the fake deflation of seven balls, there is still a band with which to intersect earlier in the locker room period. Conceptually, this is simply taking the point where the hypothetical region and bounded region (the two red regions) intersect and “shifting it left.” The team measured later in the locker-room period has room to “shift left.”

The Colts, however, by virtue of being measured first in this hypothetical, can only “shift left” for about 2 minutes, because that’s when their balls were measured. The Colts can’t “shift left” 4 minutes, because they would no longer be in the locker-room period. The team measured later can “shift left” 4 minutes, because it just means their “possible scenario” occurred earlier in the locker-room period.

I’ve tampered with the Patriot balls only, but both of Exponent’s major methods strongly suggest it’s the Colt balls who have been tampered with! For the record, if we use a proper methodology as shown in the last post — one that accounts for time — a t-test will produce a statistically significant result (p-value of 0.046) that correctly identifies the Patriot balls as being tampered with.

Conclusion

There are other peripheral weirdnesses in Exponent’s methods, but we don’t need to move beyond the two major core issues here that lead them to their conclusions. In our fake scenario, in which the Patriots deflated seven balls, both of Exponent’s methods would find the wrong team guilty of tampering with the balls! The method used in the last post that controls for time — specifically, taking the difference of each ball at the time it is measured and seeing how far it is from the projected PSI — instead correctly identified the tampered balls with statistical significance.

Yes, a proper method can demonstrate this despite a sample size of just four footballs from the control group. This is possible because of the consistency of the measurements in our hypothetical. You know what set of data did not show the same consistency? The real Colt balls, measured at halftime of the AFC Championship game. That Exponent jumps through hoops to try and demonstrate a lack of variability in measurements is fine and dandy…but as Rasheed Wallace once said, “ball don’t lie.” And the four Colt balls don’t lie — there is enough variability in the data set that, unsurprisingly, a 0.2 PSI difference in expected measurements at halftime is not statistically significant, and in many cases, not even close.

A proper analysis from Exponent, given the real halftime data presented in the Wells Report, would have found this.

 

 

The Statistical Improbability of Deflate Gate

On Sunday I broke down some of the common misconceptions surrounding the Wells Report, including the social science involved, the statistical misinterpretations and the lack of coherence in the NFL’s story based on its own evidence. Then, on Wednesday provided a time-based visualization of the all the measurements presented in the Report based on where we’d expect them to be at a given time as the balls warmed up in the locker room. Visually, it’s fairly clear that the Colt balls and Patriot balls have similar issues, as many are “under-inflated” by similar degrees. But what does this mean in terms of probability if we actually run some statistical tests on the data?

To reiterate, time is a major variable in this case because the PSI of the balls was increasing with every minute that they were in the locker room at halftime. Thus, the time that each ball was measured becomes critical in trying to analyze the discrepancy between where a ball was measured and where a ball “should” be using Ideal Gas Law parameters. Below is one such scenario presented in the previous post. The blue line is where we’d expect a Colt ball to measure given the time indoors and the gold line where we’d expect a Patriot ball to be (based on Fig. 22 of Wells Report):

Deflate Gate Logo Scenario

What we want to calculate is the difference between each ball at a given point in time (a circle) and where we’d expect the ball to be based on how long it’s been in the locker room (the solid line). For instance, in the above graph, Patriot ball #1 is about 0.25 PSI above where we’d expect. Ball #2 is 0.4 below where we’d expect. These values will be different depending on when the balls where measured, so our parameters for simulating the actual measuring circumstances are (assuming the report is correct in that the balls were correctly recorded in order, and were set to 12.5 and 13.0 PSI respectively in the pre-game):

  • Set Up time (2-4 minutes according to accounts)
  • Measurement time (21.8 – 27.3 seconds per ball)
  • Inflation time (2-5 minutes)
  • Packing time (unstated, but assumed to require some small degree of time between last measurement and re-emergence from locker room)

If we use Exponent’s Ideal Gas Law calculations that assumes 71 degrees pre-game — which may be slightly low, as noted in the last post — and add a small “wetness” factor per their report, we can then simulate a bunch of these scenarios to see what was likely and unlikely. The scenario above attempts to average all the accounts; set up time is “medium” (3 minutes, halftime between 2 and 4), measurement time is “medium” (~25 seconds) and inflation time is medium (3.5 minutes). But we can also examine other scenarios — instances where the Patriot balls were tested after 2 minutes or 4 minutes, quickly or slowly re-inflated, etc. If we do that, we’re left with a number of basic permutations we can study:

Deflate Gate p-value

Experiment parameters: Footballs were gauged at 71 degrees pre-game and were 48 degrees coming off the field with an atmospheric pressure of 14.636. Transient recalibration curve based on Fig. 22 of the Wells Report.

So what do these numbers mean? The “Patriot-Colt mean difference from expected” column calculates where each ball should be based on the time it’s measured, takes the average of all such Patriot balls and subtracts it from the average of all such Colt balls. If we take the mean of all six hypothetical scenarios, the average Patriot ball is about 0.003 PSI below where it should be at the time of measurement relative to the Colt balls. (i.e. using the Colt balls as a control group.) The p-value is the statistical likelihood that the balls come from different populations, i.e. that one set of balls had something done to them that the other set didn’t.

  • The best-case statistical scenario for the Patriots is that Walt Anderson used the Logo Gauge pre-game, that the balls were measured at 2 minutes, each took about 22 seconds to measure and that the officials took 5 minutes to re-inflate the balls (labeled “Early Start, Fast Measure, Long Inflate” above). That produces a mean where the Patriots balls are higher than the Colts, meaning it’s impossible for the Patriots balls to come from a population that is inherently lower than the Colts balls.
  • Three of the six scenarios in which the Logo Gauge was used pre-game completely exonerate the Patriots
  • The worst-case scenario for the Patriots is that Walt Anderson used the Non-Logo Gauge in the pre-game, and that the balls were measured at 4 minutes, each took 27 seconds to measure and that the officials took 2 minutes to re-inflate the balls (labeled “Late Start, Slow Measure, Quick Inflate” above). That produces a p-value of 0.247, which means that if our assumptions are true, there is a 75.3% chance the Patriots balls come from a different population.

Although it’s far below “statistical significance,” 75.3% might sound like a lot. But what is that number actually saying? For that, we have to look at the observed difference in the averages to put this into perspective: there’s a 75% chance that the 0.3 PSI difference is not simply from variance and is part of a different population (i.e. tampered balls).

Depending on the distribution, 0.3 PSI could easily be 99.99% likely to come from a different sample…which would suggest, what? There’s a 99.99% chance that the Patriots systematically released an average of 0.3 PSI per football? And that’s the worst-case scenario? That strains common sense.

In total, the (independent t-tests) results show that it was incredibly unlikely that the Patriot balls behaved any differently from the Colt balls using the assumptions presented in the Wells Report. Additionally, in the small unlikelihood that they are different  — roughly a 15% chance if Anderson used the Logo Gauge in the pre-game and 57% if he used the Non-Logo gauge — the degree to which the balls are different is nonsensically small. We would expect a “small” degree of deflation to be something like 1.0-2.0 PSI; the initial reports were “11 more than 2 PSI below regulation,” with another ball falsely labeled by the NFL itself at 10.1. But the data presents a completely different story — the Patriot balls are sometimes higher than the Colt balls relative to what we’d expect, and the worst-case scenario for New England suggests a “non-significant” likelihood of tampering to a degree that is so small it’s equivalent to the variance seen between the two gauges used to measure the balls.

The next post details exactly how Exponent used faulty methods to reach faulty conclusions in the Wells Report. 

PS What happens if the Colts balls were measured immediately after the Patriots balls and then the Patriot balls were re-inflated? This still produces results that strongly suggest non-tampering, as shown below. If the Colt 4-ball group is indeed treated as a control group, we would expect to see the Patriot measurements 13.6% of the time in their “worst-case scenario” — not an occurrence considered “significant” in the scientific community. If Anderson used the Logo Gauge in the pre-game, Patriot measurements are roughly 0.1 to 0.2 PSI below the Colt measurements and far from statistically significant. 

Deflate Gate Fast Recal Patriots Inflate Last

Experiment parameters: Footballs were gauged at 71 degrees pre-game and were 48 degrees coming off the field with an atmospheric pressure of 14.636. Transient recalibration curve based on Fig. 22 of the Wells Report.

June 19, 2015 Update: Joe Arthur has asked me to run the calculation using a flatter transient curve — in other words, what happens if we expect the footballs to heat up more slowly than projected in Fig. 22 of the Wells Report? For a “slower” expected re-calibration, we can use Fig. 24 of the Wells Report. If any physics experts out there can tell us at what exact rate the balls are expected to recalibrate, it would be much appreciated. In the meantime, the aforementioned numbers are derived with a “fast” recalibration curve, and the below with a “slow” recalibration curve. Based on other amateur experiments I’ve found, it seems likely recalibration takes place somewhere between these two extremes. Below are the results of a “slower” expected recalibration. Worth noting is that on the Logo Gauge, the Colts calls are almost exactly where we’d expect. 

Slow Recalibration Model Deflate Gate

Experiment parameters: Footballs were gauged at 71 degrees pre-game and were 48 degrees coming off the field with an atmospheric pressure of 14.636. Transient recalibration curve based on Fig. 24 of the Wells Report.