When should we trust polls from non-probability samples?

After my post the other day on tracking public opinion with biased polls, someone pointed me to this 2011 article by David Yeager, Jon Krosnick, LinChiat Chang, Harold Javitz, Matthew Levendusky, Alberto Simpser, and Rui Wang, who wrote:

This study assessed the accuracy of telephone and Internet surveys of probability samples and Internet surveys of non-probability samples of American adults by comparing aggregate survey results against benchmarks. The probability sample surveys were consistently more accurate than the non-probability sample surveys, even after post-stratification with demographics. The non-probability sample survey measurements were much more variable in their accuracy, both across measures within a single survey and across surveys with a single measure. . . .

Yeager et al. concluded:

These results are consistent with the conclusion that non-probability samples yield data that are neither as accurate as nor more accurate than data obtained from probability samples.

This got me a bit worried: we just wrote an article making the case that non-representative, non-random polls can be just fine, as long as you do a sensible adjustment using multilevel regression and post-stratification. And here’s this paper saying that non-probability samples are no good! What gives?
So I sent this to my colleague David Rothschild, who took a careful look at the Yeager et al. paper. Rothschild reports:

The study compares three types of polls: random digit dial (RDD) telephone, internet probability, and internet non-probability. The authors checked for accuracy of primary demographics used to weight the samples, secondary demographics not used to weight the sample, and answers to questions about the users daily health habits (which are similar to the secondary demographics). Estimates are compared to benchmarks with no post-stratification and then with post-stratification. The benchmarks are government surveys. There was one RDD, one probability internet, and seven non-probability internet polls.
1) It should be no surprise that probability samples have more accurate primary demographics before weighting. There is no difference post-weighting, almost by definition. At this point I will stop commenting on the before weighting results, because they are not meaningful to any academic or practitioner report, because they would never be used in practice.
2) With post-weighting, the secondary demographics and answers are slightly statistically significantly better for the probability samples. The average absolute percentage point error is 2.9 and 3.4 for the probability and ranges from 4.5 to 6.6 for non-probability with an average and median of 5.2. The largest errors for the probability were 9.0 and 8.4, but ranged from 10.0 to 17.8 for non-probability with an average of 13.5 and median of 13. This seems to be the strongest basis for Yeager et al.’s statement that probability samples are better.
Small Issues: (a) the benchmark for secondary demographics is fine (ACS and CPS), but is it clear that the probability sample is not built for these demographics? (b) More questionable is the idea of ground truth for other answers coming from NHIS, all of the answers they were looking for were health. But, the NHIS is just a survey, which is likely to match the probability survey better. (c) Also, most standard methods for dealing with non-probability sample use a regression model prior to post-stratification. Non-probability samples may benefit more from the regression prior to post-stratification, due to the extremely bad selection issues in some demographic cells.
Big Issues: (a) Even if I took their results at face value, the errors may be statistically significantly worse, but would they be worth it at 1% of the cost and a fraction of the time? Academic publications miss some key variables when they compare survey designs on accuracy alone: you must consider cost and speed. (b) The idea of showing non-weighted answers was to show something about the raw selection not being an issue in probability polling, despite its low response rate. While this paper was published in 2011, the surveys are from 2004-5. Unfortunately, in the time it took from survey to publication the response rate for RDD probability polling plummeted from 25% to 9%: http://www.people-press.org/2012/05/15/assessing-the-representativeness-of-public-opinion-surveys/.

The bottom line
No survey is truly a probability sample. Lists for sampling people are not perfect, and even more important, non-response rates are huge. When pollsters use probabilistic methods to try to get representative samples, they can do pretty well but they still need to do some adjustments to correct for known differences between sample and population. Non-probability samples typically require more adjustment. When you do the adjustment carefully, you can do just about as well with a non-probability sample as with a probability sample. But results from non-probability samples are generally more dependent on making sure the adjustment was done well. Meanwhile, response rates continue to decline and adjustment methods continue to improve (I know that last part because my colleagues and I are working hard on such methods).
So I think there’s room for both sorts of surveys. Rather than thinking in a binary way of probability vs. non-probability sampling, perhaps it’s better to think of a continuum, to think about things as follows: We want to gather a sample to learn about a specified population. We can put effort into data collection or into analysis. The more effort we put into data collection, the less effort needs to go into analysis. In some settings we can throw a lot of resources into data collection and try to get something close to a representative sample. In other settings (such as our Xbox poll and various “big data” problems), it’s hard to do much with the sampling–but we can put effort into collecting enough background data (for example, the Xbox respondents told us who they voted for in the previous election) to allow an accurate adjustment. A modern poll will always require serious effort in data collection and in analysis, and different surveys will put greater effort into different aspects of this mix. The only survey I really don’t want any involvement in are those robo-polls, which really seem evil to me.
David Rothschild is an economist at Microsoft Research in New York.