Statistics is the science of data and uncertainty. Uncertainty is an inherent part of statistics, affecting analyses and interpretations. Uncertainty is unavoidable when seeking to predict the future.
Communicating that uncertainty to others is often crucial. If people are aware of how uncertain estimates are, that could affect their decisions.
Uncertainty and why it matters
The Government Statistical Service hosted a recent webinar on how we communicate uncertainty. The speakers were:
- Charles Lound (Office for National Statistics)
- Cathryn Blair (Northern Ireland Statistics and Research Agency)
- Dr Sarah Dryhurst (Winton Centre for Risk and Evidence Communication)
- Jonathan Tecywn (Department for Education)
As the Government Statistical Service guidance says:
Uncertainty is an inherent aspect of statistics, but the term is often misinterpreted, possibly implying that the statistics are unusable, or simply wrong. As a result, statistics producers might have understandable concern that pointing out limitations in the statistics could reduce users’ confidence in the published figures. This should not be the case.
How can analysts communicate to:
- Promote awareness and understanding of uncertainty;
- And maintain trust in the numbers and those who produce statistics?
Why does uncertainty matter?
In 1997, the US National Weather Service predicted the Red River would crest at around 49 feet (~15m). This prediction was under average precipitation. In a scenario of no further rain, a lower forecast was 47.5 feet.
This prediction came six weeks in advance of the Spring 1997 floods. City officials prepared for a crest of 52 feet (15.8m), with sand-bags and levees.
There was no quantitative statement of uncertainty.
Some officials took the two figures as a plausible range. Others thought 49 feet was the greatest plausible crest. Reiteration of the 49 feet figure meant an over-emphasis of that point estimate.
Disaster struck: by 21st April, the Red River had reached over 54.3 feet. The flood devastated Grand Forks and East Grand Forks, displacing thousands of people. Total estimated damages were about $1–2bn.
What kinds of uncertainty are there?
Uncertainty can take many forms:
- Aleatory uncertainty: the natural randomness in a process. If I had a fair die, the rolls would vary with known probabilities.
- Epistemic uncertainty: the scientific uncertainty in our model of a process. Imagine I did not know if the die was fair. I am uncertain of the chance of rolling each value.
In practice, this distinction can be confusing. The toolkit offers a third kind:
- Ontological uncertainty: unrecognised ignorance in our model, or failure to comprehend unprecedented circumstances. Someone could be switching fair dies for weighted ones, without my knowledge.
Where does uncertainty come from?
There are three main sources of uncertainty in analyses:
- Data: data can come from random processes, or have missing entries.
- Assumptions: model assumptions could take a range of plausible values.
- Models: the particular techniques or models we choose to use. Different analysts may choose different models, which affect produced estimates.
Sometimes, you may see statements like this accompanying statistics:
Care should be taken when using these statistics.
This caution starts more questions than it answers. Why do I need to take caution? How must I take care? It may mean readers might misinterpret or go elsewhere.
As Cathryn Blair says, there is often a balance between concision and giving detail. This problem is acute for social media posts.
The webinar puts forward four ways of communicating uncertainty:
- Show the process: say where the uncertainty comes from;
- Describe the uncertainty: say how large that uncertainty is;
- Illustrate the uncertainty: show people what that uncertainty means;
- Tell them what you can and cannot do: help readers to interpret stats.
Showing the process
By showing the process, readers can understand where uncertainty comes from. Charles Lound highlighted the ONS infection pilot survey of private English households:
This estimate is based on swab tests collected from 25,662 participants, of which eight individuals from eight different households tested positive for COVID-19.
A later paragraph also shows what this household survey excludes:
As this is a household survey, our figures do not include people staying in hospitals, care homes or other institutional settings.
Dr Sarah Dryhurst presented on testing uncertainty communications. We need to understand: who communicates what in what form to whom to what effect.
A 2020 PNAS paper by van der Bles and others studied effects of different wordings. Their study looked at five experiments. The third experiment was a randomised trial with eight variants. The words sought to describe UK unemployment statistics.
Adding “estimated” (a verbal cue) did not appear to convey uncertainty well.
Two treatments did not show a significant decrease in trust in the numbers:
- A numerical range with the point estimate: “by 116,000 (range between 17,000 and 215,000)…”
- An implicit verbal statement: “by 116,000 compared with the same period last year, although there is a range around this figure: could be somewhat higher or lower…”
There was not a major difference between the control and others for trust in civil servants. A field experiment using a BBC News article drew similar conclusions.
Illustrating the uncertainty
Dr Dryhurst’s presentation also discussed different ways of illustrating uncertainty.
For example, error lines added to bar charts can create ‘within the bar’ bias. Violin plots are an alternative, showing small distributions.
Presenting uncertainty around trends did not appear to compromise understanding. It also did not undermine trust in the data or source.
Dr Dryhurst indicates there was greater nuance by readers. That suggests we could be more muscular in showing uncertainty.
We need to test graphical representations. Some formats, such as hurricane path cones, may fail to convey uncertainty.
Analysts can show what readers can and cannot do with published statistics.
Charles Lound drew attention to the publication of faster indicators of economic activity. The January 2020 article opens:
It should be noted that these indicators are not intended to be an early measure or predictor of gross domestic product (GDP) and their potential relationship with headline GDP should be interpreted with caution. Instead they provide an early picture of a range of activities that are likely to have an impact on the economy, supplementing official economic statistics.
Imperfect knowledge means uncertain decisions. The UK government toolkit shows different ways of analysing uncertainty for decision-makers, including:
- Monte Carlo techniques: This technique define distributions for all inputs and internal correlations. We generate random values for each input, accounting for correlations. We determine the output values, and analyse their distribution. This produces a full probability profile. The results are subject to simulation error, and takes time and power to compute.
- Convolution: This technique combines distributions to give one distribution. This approach avoids simulation error. Convolution can be difficult to do with many variables.
- Summation in quadrature: Suppose the uncertainties are all independent of each other. We assume these variables have Normal distributions. We then calculate the square root of the sum of their squares. This is easy to calculate. The assumptions may be inappropriate.
- Using past errors: We compare repeated forecasts to actual values, then calculate errors. Those errors can then form an estimate for potential error in future forecasts. This method captures all forms of uncertainty, but requires long-run stability.
- Scenario analysis: Analysts identify the main sources of uncertainty, and find plausible ranges. We can then construct a set of coherent scenarios. This is simple in computation. It is subjective, with possible biases.
After that, the real work of communicating uncertainty begins.