Data Doesn’t Lie.

Data is funny.

We use it to tell us all sorts of things. We call it empirical. We talk about how the data doesn’t lie. We look at numbers, look at trends, and we draw conclusions – not just the data scientists in the crowd, but everybody. How much money did you make last year? How profitable is the latest Captain America movie? Who is the most successful batter of all time? The data will tell us.

But … will it?

Let’s look at a couple of baseball players and examine their batting averages. This is real data I’m using here, and the math is pretty easy. For the sake of conversation, let’s try to determine who was a better batter – Derek Jeter or David Justice. To make things simple let’s examine a data set of just two years, 1995 and 1996, and let’s talk about each player’s batting average – that’s the percentage of time, when a batter is at bat, he gets a hit.

Derek Jeter’s batting average for 1995 was .250 and for 1996 was .314
David Justice’s batting average for 1995 was .253 and for 1996 was .321

What does the data tell us? It’s pretty clear, right? If you’re gonna pick a better batter for 1995 and 1996, you’d choose David Justice. He was a more successful batter that Derek Jeter was. He hit the ball with more reliability. That’s not my opinion – The data says so!

Not so fast. Let’s combine the two years:

For the two-year period combined, Derek Jeter’s batting average was .310
For the same period, David Justice’s batting average was .270

Wait, what?

That’s not a typo, that’s Simpson’s Paradox in action. Edward Simpson first described his statistical finding this way: “Trends which appear in groups of data may disappear or reverse when the groups are combined.” Seem unbelievable, right? It’s not. It’s just math.

Let’s look at the raw data. I put the “winner” in bold in each data set.

1995:                           Hits                 At Bats            Average

Derek Jeter                 12                    48                  .250
David Justice              104                  411                  .253

1996:                           Hits                 At Bats            Average

Derek Jeter                 183                  582                .314
David Justice              45                     140                 .321

Combined:                  Hits                 At Bats            Average

Derek Jeter                 195                  630                  .310
David Justice              149                 551                  .270

The data doesn’t lie. David Justice had a more successful percentage of at-bats in 1995 and a more successful percentage of at-bats in 1996 … and when you combine the two years, Derek Jeter is the better batter. Sorry, David; when you aggregate data, sometimes there’s just no justice.

I’m not saying data can’t be trusted – that’s not the point at all. Data can always be trusted. It’s empirical, remember. Data doesn’t lie. The paradox is that both cases are true. David Justice had a higher batting average than Derek Jeter in both 1995 and 1996. This is a fact. Derek Jeter’s 1995/1996 Combined batting average is higher. This is also true. It seems like these things can’t both be true, but they are.

And that’s the point.

The world isn’t binary. We think if A is true then B must be false, and that’s almost never the case. We think, if we’re right about something, then others must be wrong. We think if what the data tells us is true, then what the data doesn’t tell us is surely false.

All too often, we’re wrong.

Let’s talk about movies for a second. Which movie was more successful, The Avengers, or The Fast and Furious 7? Let me give you some data to help you figure this out:

Movie                          Worldwide Gross

The Avengers              $1,517,557,910
Furious 7                       $1,516,045,991

The answer is obvious. The Avengers was more successful, right? The data says so. The math is clear. The Avengers made $1.5 million more than Furious 7. Box office numbers don’t lie! But there’s more to the data than that. Dig a bit deeper and look at the movie’s cost:

Movie                          Budget

The Avengers              $220,000,000
Furious 7                      $85,000,000

So The Avengers cost $135 million more to make than Furious 7 did, and only made $1.5 million more than Furious 7 did. Doesn’t that mean Furious 7 was more successful?

I guess it depends on how you define successful. And that brings us closer to something you can take away and think about. If you define a movie’s success to be a measure of tickets sold (and dollars earned) at the box office, you are correct in asserting that The Avengers is more successful. If you define a movie’s success as the function of the movie’s box office receipts less the movie’s budget, you are correct in asserting that Furious 7 is more successful.

Despite your binary instincts, telling you only one or the other is true, the data confirms for us that both scenarios are true.

It’s all about how you look at it.

Consider this the next time you find yourself in a disagreement with someone about something. What if the fact that you’re right doesn’t mean the other person is also wrong? What if you’re facing Simpson’s Paradox? What if you’re both right?

It’s not always about who is right.
Sometimes, everybody is.

Recent Posts

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *