After the surprise of 2016, here's how pollsters can do better in predicting election results

President-elect Donald Trump, alongside his wife, Melania, and their son, Barron, pumps his fist after giving his acceptance speech on election night. (John Locher/AP)

By now, it is widely agreed that bringing more — and more diverse — people into the policymaking process can have major payoffs in opening governance. The underlying argument is straightforward. Different people will have different understandings of a given situation. Assuming that these people are somewhat competent, each of these understandings likely captures some significant aspect of the truth. If we can somehow capture and combine the insights in these different understandings, we can arrive at a broader understanding that is more complete and more likely to capture the truth than the single perspective of any individual.

Of course, this basic argument can be understood in many different ways, leading to interesting disagreement between people like Scott Page and Philip Tetlock over the extent to which ordinary people with diverse perspectives can systematically do better than experts who mostly share the same understanding of a problem. However, one question that we’ve yet to pay enough attention to is how we combine different perspectives to extract the most possible information from them.

[interstitial_link url=”https://www.washingtonpost.com/news/to-your-health/wp/2017/05/31/poll-most-americans-want-senate-to-change-or-ditch-house-health-care-bill/?utm_term=.e6027f999e53″]Poll: Most Americans want Senate to change or ditch House health-care bill[/interstitial_link]

On the one hand, scholars like Page provide broad proofs that different perspectives, if combined, can have great benefits. On the other, scholars like Tetlock investigate problem-solving through practical experimentation. There’s room to supplement these accounts with a different approach that, like Page, would try to provide a general mathematical understanding of what we can or cannot do, but like Tetlock would provide more specific insights into the techniques that we use to aggregate information from individuals.

This is what my research does. Like Tetlock and other scholars, I am interested in the question of “collective forecasting.” In other words, I want to know how we can combine different people’s judgments of an outcome (demand of a new product, GDP growth, rain or no rain tomorrow, whether a particular candidate will be elected, and so on) to arrive at a single consensus that represents the combined information of the group. An aggregator — i.e. a way of combining and summarizing everyone’s predictions — that does not leave any information unused is called “efficient.”

[interstitial_link url=”https://www.washingtonpost.com/world/national-security/bin-ladens-son-steps-into-fathers-shoes-as-al-qaeda-attempts-a-comeback/2017/05/27/0c89ffc0-4198-11e7-9869-bac8b446820a_story.html?utm_term=.0ea20198aa11″]Bin Laden’s son steps into father’s shoes as al-Qaeda attempts a comeback[/interstitial_link]

To emphasize the real-world importance of this objective, consider Leon Panetta, who in early 2011 consulted key analysts from the U.S. intelligence community about the chances of Osama bin Laden living in a specific compound in Pakistan. These agents (presumably drawing on different perspectives and sets of information) gave wildly different estimates of the probability that bin Laden was there. In fact, their answers ranged between 30 percent and 90 percent. How should Panetta combine such very different predictions into a good overall answer?

This is generally a hard problem. Below is what my research has shown so far.

Focusing on a single “best” analyst ignores a lot of information

Traditional organizations may not combine forecasts at all. Instead they try to figure out who the single best analyst is and then base their decision on that analyst’s prediction alone. This approach has several problems. First, it is often not possible to figure out who the “best” analyst is until the outcome of interest is observed. Then, for obvious reasons, it is too late. Second, even the very best analyst in the group doesn’t know everything that the rest of the group knows. Thus, focusing on a single expert is going to leave a lot of potentially valuable information on the table.

Averaging doesn’t work well, either

The simplest and most popular way to combine predictions is to take the average of everyone’s predictions and then use it as a kind of index for the group’s prediction. This uses all the forecasts and hence is at least thinking along the right track. Unfortunately, however, the simple average is not a good aggregator; it doesn’t aggregate information efficiently.

Are there better measures than the simple average?

Of course, simple averaging is not the only way to combine multiple predictions. There are a variety of “measures of central tendency” — summary statistics like the average, median, mode, geometric average, midrange and so on. Many of these statistics can also weight the forecasters differently, for instance, by assigning higher weight to any forecaster who is believed to be more knowledgeable. Could any of these standard summary statistics, equally weighted or not, aggregate information efficiently?

[interstitial_link url=”https://www.washingtonpost.com/news/fact-checker/wp/2017/05/30/do-voter-id-laws-help-or-hurt-voter-turnout/?utm_term=.06222a296503″]Do voter ID laws help or hurt voter turnout?[/interstitial_link]

My work has highly pessimistic implications for these ways of extracting information: They don’t work well as aggregators unless the forecasters’ information sets — the information about the problem that they have access to — are infinitely large. Since real-life problem-solvers are limited beings operating in a physical universe with hard limits on information processing, this is another way of saying that none of these standard measures is going to be efficient. They always leave some information on the table. Most generally, this inefficiency applies to any aggregator that always places the consensus forecast between the smallest and largest forecasts.

To illustrate the real-world implications of these results, consider the 2016 U.S. presidential election. Many polling companies were not able to predict the final outcome. In particular, a few days before Election Day, Nate Silver’s highly celebrated polling analysis site, FiveThirtyEight, predicted a 30 percent chance of Republican Donald Trump beating out Democratic rival Hillary Clinton. Why was that estimate so low? A close analysis of the user’s guide to Silver’s prediction algorithm reveals that state-specific polls are aggregated in two steps: a) correct any biases (convention bounce bias, likely voter bias, etc.) by adjusting the polls’ predictions, and b) take the weighted average of the adjusted predictions.

[interstitial_link url=”https://www.washingtonpost.com/news/the-fix/wp/2016/11/11/prediction-professor-who-called-trumps-big-win-also-made-another-forecast-trump-will-be-impeached/?utm_term=.71348aedc4a3″]‘Prediction professor’ who called Trump’s big win also made another forecast: Trump will be impeached[/interstitial_link]

My work shows that this does not use the polls’ information efficiently. It is likely that an efficient aggregator would have assigned Trump a higher chance of winning. Would this have changed the way the candidates or the voters behaved and hence led to a different result? We will never know.

One should note, however, that there is a big difference between saying that these measures are inefficient and saying that they are not useful at all. Even simple averaging generally outperforms the typical individual’s forecast. What my general results suggest is that even if these measures are better than nothing, they are relatively inefficient and hence can be improved upon.

Are there other approaches that might work better?

Yes. The right approach, however, depends on the real-world context. If relevant historical data is available (e.g., past forecasts of GDP growth along with the actual realized growth), a simple approach called “extremization” often performs very well. In short, this approach moves the simple average away from the noninformative, “naive” forecast by some amount. If historical data is not available, a top-down approach can be applied. This approach first chooses an appropriate statistical model for the problem and then applies the best aggregator under that model. The experimenter can choose among some of the existing generic models, such as the “Gaussian model.” However, the best results can be obtained by relying on field expertise and tailoring the model to the particular problem at hand.

In general, finding better aggregators than the simple average and the like is a wide-open research area. This also has huge potential implications for opening governance. If we want to take advantage of the new kinds of knowledge and expertise that might be attracted to problem-solving, we need new systematic and efficient ways of collecting information from the experts. One of the aims of my research is to encourage others to develop new and better ways of doing that.

Ville Satopää is an assistant professor of technology and operations management at INSEAD global business school.

This article is one in a series supported by the MacArthur Foundation Research Network on Opening Governance that seeks to work collaboratively to increase our understanding of how to design more effective and legitimate democratic institutions using new technologies and new methods. Neither the MacArthur Foundation nor the Network is responsible for the article’s specific content. Other posts in the series can be found here.

Topics on this page

Donald Trump United States Senate Democratic Party Republican Party Hillary Clinton Americans Osama bin Laden al-Qaeda Pakistan Election day Leon Panetta FiveThirtyEight Melania Trump Nate Silver Philip E. Tetlock

04

After the surprise of 2016, here’s how pollsters can do better in predicting election results

Focusing on a single “best” analyst ignores a lot of information

Averaging doesn’t work well, either

Are there better measures than the simple average?

Are there other approaches that might work better?

Topics on this page

Related