A simple Bayesian analysis of surveys

What share of UK adults gave the right answer to a statistical question?

Anthony B. Masters
2 min readMar 27, 2022

--

Classical statistics thinks of probability as the long-run frequency of events. The Bayesian approach considers probability as a degree of belief. The language of probability then describes uncertainty in unknown quantities.

The Royal Statistical Society commissioned a survey of the public, asking this question:

  • Question 1: If you toss a fair coin twice, what is the probability of getting two heads?

Opinium conducted this online survey, gathering views of 2,001 UK adults. Researchers weighed responses by gender, age, region, social grade and employment status.

(Image: R Pubs)

Surveys provide estimates, which can differ from true values for many reasons. Researchers use distinct wordings and response options, trying to measure the same concept. Different survey modes, sampling frames, and weights may produce different estimates.

What about the uncertainty in those estimates? Bayesian analyses start with the prior distribution. This distribution represents knowledge before data collection. There were seven response options for the question: 15%, 25%, 40%, 50%, 75%, Other, and Don’t Know.

A non-informative prior distribution assumes each possible combination of shares has equal probability. We can encode that belief in a Dirichlet distribution, with all its parameters equal to 1.

Next, data updates that belief. There are seven response options, so this is Multinomial data. We need the number of respondents giving each answer. For example, a sampled 496 out of 2,001 adults gave the right answer of 25%.

The goal of Bayesian analyses is to find the posterior distribution. Both the prior and posterior distributions are in the same family. This is because we chose a conjugate prior for the likelihood. Our choice of the prior distribution means there is a simple updating rule.

For each of the parameters, sum together the ‘prior’ value and the ‘data’ value. The parameter relating to the option of 25% is then 497, as it is 496 plus 1.

DirichletReg::rdirichlet(
n = number_sims,
alpha = opinium_survey_df$weighted + 1)

We have our posterior distribution, so we can express uncertainty from sampling:

  • Correct answer of 25%: 25% (23% to 27%).

These credible intervals have a natural interpretation. There is a 95% probability that the share choosing 25% lies between 23% and 27%.

More complex models could incorporate uncertainty from survey weights and other sources. Using a uniform prior distribution and the weighted counts is a simple method. If we had an older survey, we could encode its results in our choice of prior distribution. Different choices lead to different models with different credible intervals.

The code is available on GitHub and R Pubs.

--

--

Anthony B. Masters

This blog looks at the use of statistics in Britain and beyond. It is written by RSS Statistical Ambassador and Chartered Statistician @anthonybmasters.