As a researcher, I always get annoyed when I see the wanton abuse of data. Mashable’s article Google+ Users Are Nearly All Male is a great example of data abuse.
The data abuse starts by reporting data without noting anything about the methodology behind it. Data is meaningless if you don’t know how it was collected. For this article, they report on two different websites which claim to have analyzed Google+ user profiles. Neither of these websites say that they’ve looked at all of the profiles, and neither of them note what method they used to sample the profiles that they did analyze.
The data abuse continues by ignoring the major differences in the data that is returned by the two sites. One says that 86.8% of sampled profiles are male, the other says that 73.7% are male. What explains a delta of more than 10 points? I can come up with possibilities, but I don’t know if any of my potential explanations are correct. In the case of any of the possibilities, it would tell us a very different story. For example, if the difference is one of time (that is, one set of data was collected earlier than the other), then we’d learn something about the early-adopter curve. If the difference is one of sampling method, then we might learn about the relative strengths of each of those sampling methods for this type of dataset.
What really bothers me about this breathless repeating of such statistics is that there is no attempt at analysis. If we accept that the current Google+ users skew male, is this any different than the usual early-adopter curve? Or the early-adopter curve for social media? Or the early-adopter curve for new Google applications? Data without analysis is meaningless. Reporting on the data suggests that we should care, that there is something different here. But it appears that no-one has bothered to answer such basic questions about the data.
We can do better than this. Let’s stop the blind reporting of data, and instead expend some effort on analyzing the data.