Data is funny.
We use it to tell us all sorts of things. We call it empirical. We talk about how the data doesn’t lie. We look at numbers, look at trends, and we draw conclusions – not just the data scientists in the crowd, but everybody. How much money did you make last year? How profitable is the latest Captain America movie? Who is the most successful batter of all time? The data will tell us.
But … will it?
Let’s look at a couple of baseball players and examine their batting averages. This is real data I’m using here, and the math is pretty easy. For the sake of conversation, let’s try to determine who was a better batter – Derek Jeter or David Justice. To make things simple let’s examine a data set of just two years, 1995 and 1996, and let’s talk about each player’s batting average – that’s the percentage of time, when a batter is at bat, he gets a hit.
Derek Jeter’s batting average for 1995 was .250 and for 1996 was .314
David Justice’s batting average for 1995 was .253 and for 1996 was .321
What does the data tell us? It’s pretty clear, right? If you’re gonna pick a better batter for 1995 and 1996, you’d choose David Justice. He was a more successful batter that Derek Jeter was. He hit the ball with more reliability. That’s not my opinion – The data says so!
Not so fast. Let’s combine the two years:
For the two-year period combined, Derek Jeter’s batting average was .310
For the same period, David Justice’s batting average was .270
That’s not a typo, that’s Simpson’s Paradox in action. Edward Simpson first described his statistical finding this way: “Trends which appear in groups of data may disappear or reverse when the groups are combined.” Seem unbelievable, right? It’s not. It’s just math.
Let’s look at the raw data. I put the “winner” in bold in each data set.
1995: Hits At Bats Average
Derek Jeter 12 48 .250
David Justice 104 411 .253
1996: Hits At Bats Average
Derek Jeter 183 582 .314
David Justice 45 140 .321
Combined: Hits At Bats Average
Derek Jeter 195 630 .310
David Justice 149 551 .270
The data doesn’t lie. David Justice had a more successful percentage of at-bats in 1995 and a more successful percentage of at-bats in 1996 … and when you combine the two years, Derek Jeter is the better batter. Sorry, David; when you aggregate data, sometimes there’s just no justice.
I’m not saying data can’t be trusted – that’s not the point at all. Data can always be trusted. It’s empirical, remember. Data doesn’t lie. The paradox is that both cases are true. David Justice had a higher batting average than Derek Jeter in both 1995 and 1996. This is a fact. Derek Jeter’s 1995/1996 Combined batting average is higher. This is also true. It seems like these things can’t both be true, but they are.
And that’s the point.
The world isn’t binary. We think if A is true then B must be false, and that’s almost never the case. We think, if we’re right about something, then others must be wrong. We think if what the data tells us is true, then what the data doesn’t tell us is surely false.
All too often, we’re wrong.
Let’s talk about movies for a second. Which movie was more successful, The Avengers, or The Fast and Furious 7? Let me give you some data to help you figure this out:
Movie Worldwide Gross
The Avengers $1,517,557,910
Furious 7 $1,516,045,991
The answer is obvious. The Avengers was more successful, right? The data says so. The math is clear. The Avengers made $1.5 million more than Furious 7. Box office numbers don’t lie! But there’s more to the data than that. Dig a bit deeper and look at the movie’s cost:
The Avengers $220,000,000
Furious 7 $85,000,000
So The Avengers cost $135 million more to make than Furious 7 did, and only made $1.5 million more than Furious 7 did. Doesn’t that mean Furious 7 was more successful?
I guess it depends on how you define successful. And that brings us closer to something you can take away and think about. If you define a movie’s success to be a measure of tickets sold (and dollars earned) at the box office, you are correct in asserting that The Avengers is more successful. If you define a movie’s success as the function of the movie’s box office receipts less the movie’s budget, you are correct in asserting that Furious 7 is more successful.
Despite your binary instincts, telling you only one or the other is true, the data confirms for us that both scenarios are true.
It’s all about how you look at it.
Consider this the next time you find yourself in a disagreement with someone about something. What if the fact that you’re right doesn’t mean the other person is also wrong? What if you’re facing Simpson’s Paradox? What if you’re both right?
It’s not always about who is right.
Sometimes, everybody is.