In social science, we talk a lot about replication, transparency and open data. We’re rightly concerned when researchers publish and publicize dramatic claims without sharing their raw data. Recent examples include Michael Lacour’s faked study on attitudes toward same-sex marriage and a controversial paper on “air rage” in which the data not released for business reasons and then questioned by an aviation journalist.
Today we have a story that goes in the other direction: a researcher who has gathered data that he is willing to share, but he is concerned about data quality.
Here’s Tom Slee, who wrote to me in a private email that he has given me permission to post:
For the last couple of years I’ve collected data on Airbnb listings in a wide variety of cities, and I now have well over a 2.5 million data points. It’s been a useful exercise, and it’s led to some interesting journalistic stories.
Now I’m getting an increasing number of queries from academics for data and for the code that collects it. I’m happy to comply as and when I can, but there are obvious problems with the verifiability of the data itself. Individual surveys of cities on or around a particular date can be compared with partial data releases that Airbnb occasionally makes public, but those are always aggregated in one way or another, and have a definite slant to them. The method itself is not rocket science, but even if it were I could not vouch for it, as its success or failure depends on the changing form and practices of the Airbnb web site itself. Most of my surveys are, I believe, accurate for public policy discussion purposes but there are times (for all kinds of reasons) when some of the data may be inaccurate.
Academics now tend to ask me (a) how to refer to the data set, and (b) how they can validate it for publication purposes, to which I generally reply (a) I have no idea, and (b) I have no idea.
This is almost the opposite of the “private data set” problem. I have data that I believe to be accurate and even useful, but it can’t (by its nature) be well validated because that would require access to a private data set, and if we had access to Airbnb’s data then mine would be superfluous. Does this shut down academic analysis of topics like “Airbnb impact on Affordable Housing in X” or “Airbnb and the tourism industry in Y”.
The code is here.
It seems to me that if Slee is open, it should be fine for him to share the data, as people can analyze how they’d like and validate how they’d like to do so.
But Slee points out:
I think for most of my users validation is difficult because they don’t have much to compare it to. They can validate (within a short time frame) that the listings I tabulate appear on the site, but they cannot see if I have missed many listings (easily).
And in particular there is no accessible “reality” to check against (except for inside Airbnb’s own black box).
More generally, Slee expresses the concern that, on the one hand, black-box companies can now provide their data for unreproducible (but probably friendly) research that nevertheless seems to qualify for academic publication. On the other hand, academic investigations that are critical of these companies are difficult because the available data do not and cannot meet academic standards of validation. It’s an asymmetry that is politically troubling.
Here’s another example, regarding research on the environmental impact of ride-sharing apps — based on data supplied by . . . the ride-sharing app Uber. As Sarah Emerson discusses in this linked article, there are reasons to be concerned about the representativeness of these data, and, as a result, a consortium organized by the University of California is organizing a new, more comprehensive data collection effort on ride-sharing.