Let’s ignore smaller little hand-waving techniques like deciding to ignore one of the measurements for the Colt balls because it “looked strange.” Or that there might be compelling evidence in the report that Walt Anderson did indeed use the Logo gauge in his pre-game measurements. Instead, let’s just focus on the two major methods Exponent used to reach its faulty conclusion in Deflate Gate’s Wells Report.
- The “p-value” without controlling for time
- The use of a “visual proof” that is biased toward dry balls measured later in the locker-room period
To show exactly what’s so silly (and biased) about this methodology, imagine the following thought experiment:
- All footballs, for whatever reason, are slightly below where we’d expect them at halftime, about 0.4 PSI below expected
- The Patriots balls are dry instead of wet
- The Patriots balls are measured second instead of first
- The Colts balls are wet instead of dry
- Finally, imagine that the Patriots deliberately released 0.15-0.3 PSI from some of their footballs after referee Walt Anderson approved them
Doing that produces a table with the following hypothetical halftime measurements:
As you can see, the Patriot balls have a much higher halftime average measurement because they were dry, inflated to 13.0 PSI in the pre-game and were measured after the Colt balls, allowing them more time to recalibrate to pre-game PSI levels. Now let’s use Exponent’s two main methodologies to see if we can detect the tampering that we’ve built in to the hypothetical.
Exponent Method #1: p-value independent of time
First, let’s run a statistical (independent t) test to see the likelihood that these two groups of footballs are from the same (untampered) groups. If we do what exponent did, which is ignore time of measurement and compare the pre-game level of the Patriot balls (13.0 PSI in our hypothetical) and Colt balls (12.5 PSI) to where they measured at halftime, we get a p-value of 0.0097. That’s almost exactly what Exponent came up with in the real-life scenario for the Patriots.
Only in our thought-expirement it’s the Colts who are the ones found likely of tampering. Exponent’s method literally picks out the wrong team because it ignores major, confounding variables like time. Pause for a second and appreciate how bad this is: I’ve created a scenario where the Patriots cheat, and Exponent’s methodology points to the Colts…because Exponent’s methodology is biased toward any team that was measured second in the locker room period (and also had wet footballs).
In our hypothetical, The Colt balls were measured first, and thus had less time to recalibrate. Furthermore, the Colt balls were treated as the wet group in this scenario, so they drop further in PSI compared to the dry Patriot balls (the opposite of the real-life AFC title game). So, using Exponent’s nonsensical test, it’s fairly easy to demonstrate “statistical significance” that one team tampered with the balls…even though it was not the team that actual tampered. It was just the team who had their wet footballs measured first.
Exponent Method #2: a picture that lies
Exponent also attempts to use a “visual proof” of sorts to demonstrate that something is wrong with the Patriot footballs and not the Colt footballs. This approach is Exponent’s acknowledgement that time (“transient curves”) is a relevant variable in the measurement process, but their demonstration is simply incorrect.
Without getting into the nitty gritty, we can simply draw the exact same picture that Exponent draws (starting with Figure 24, pg. 210) to completely disprove their “proof.” (Note: they draw a picture because if they ran an actual statistic test on the data using a transient curve, they would reach the opposite conclusion.) I’ve taken the same data and drawn an Exponent-like picture below. According to Exponent’s “logic,” if the “window” between a dry and wet football doesn’t intersect with the band of what was actual measured (within +/- 2 standard errors from the mean) then it demonstrates something outside the physical boundaries of what is possible.
Here’s our hypothetical data presented Exponent style:
Low and behold…the Colt band does not intersect with a 12.5 PSI projection curve and the Patriot band does intersect with its 13.0 PSI projection curve. According to Exponent, this shows that Colt balls must have had something additional done to them. Except that in this scenario, it was the Patriot balls that were tampered with!
This is possible because this “visual proof” approach is biased toward a dry ball that was measured later in the locker room. Just like their p-value is biased toward a dry ball measured later in the locker room period. In the case of the visual, if normal variance (from any non-tampering factor) moves dry-ball measurements slightly below what is expected — as we’ve done in this hypothetical for the Patriots — it simply shifts the team’s band below the dry-ball upper boundary but still above the wet-ball lower boundary. If a team actually had a wet ball, they are instead shifted below the lower band, outside of what Exponent considers physically acceptable.
With regards to time, the Patriot balls in our hypothetical comfortably intersect with the acceptable region. This is because they were measured later in the locker-room period; despite dropping in PSI from the fake deflation of seven balls, there is still a band with which to intersect earlier in the locker room period. Conceptually, this is simply taking the point where the hypothetical region and bounded region (the two red regions) intersect and “shifting it left.” The team measured later in the locker-room period has room to “shift left.”
The Colts, however, by virtue of being measured first in this hypothetical, can only “shift left” for about 2 minutes, because that’s when their balls were measured. The Colts can’t “shift left” 4 minutes, because they would no longer be in the locker-room period. The team measured later can “shift left” 4 minutes, because it just means their “possible scenario” occurred earlier in the locker-room period.
I’ve tampered with the Patriot balls only, but both of Exponent’s major methods strongly suggest it’s the Colt balls who have been tampered with! For the record, if we use a proper methodology as shown in the last post — one that accounts for time — a t-test will produce a statistically significant result (p-value of 0.046) that correctly identifies the Patriot balls as being tampered with.
There are other peripheral weirdnesses in Exponent’s methods, but we don’t need to move beyond the two major core issues here that lead them to their conclusions. In our fake scenario, in which the Patriots deflated seven balls, both of Exponent’s methods would find the wrong team guilty of tampering with the balls! The method used in the last post that controls for time — specifically, taking the difference of each ball at the time it is measured and seeing how far it is from the projected PSI — instead correctly identified the tampered balls with statistical significance.
Yes, a proper method can demonstrate this despite a sample size of just four footballs from the control group. This is possible because of the consistency of the measurements in our hypothetical. You know what set of data did not show the same consistency? The real Colt balls, measured at halftime of the AFC Championship game. That Exponent jumps through hoops to try and demonstrate a lack of variability in measurements is fine and dandy…but as Rasheed Wallace once said, “ball don’t lie.” And the four Colt balls don’t lie — there is enough variability in the data set that, unsurprisingly, a 0.2 PSI difference in expected measurements at halftime is not statistically significant, and in many cases, not even close.
A proper analysis from Exponent, given the real halftime data presented in the Wells Report, would have found this.