You keep using that word. I do not think it means what you think it means.

This is a very famous line spoken by the character Inigo Montoya in The Princess Bride. Statistics used in public debate and research often have a similar difficulty: a statistic may not mean what some people think it means.

This article will consider a few examples.

‘Most children in poverty are in working families’

Many people are trapped in low paid, insecure work and 70% of children in poverty now live in working families.

In the Department for Work & Pension’s Households Below Average Income report, there are five definitions of poverty (or low income).

A household is defined as being in poverty if its net disposable household income lies below a specified threshold: 60% of the median (the middle value). This is given for both relative low income — meaning the threshold moves as average income changes — and absolute low income — the median income in 2010/11 and moves it in line with inflation. Weekly net equivalised household income is expressed both before and after housing costs.

Additionally, there is a measure of children in material deprivation: a household that is below 70% of the relative median income, and where their family scores 25 or higher (out of 100) on questions about access to 21 goods and services.

For 2017/18, 56% of children living households without an adult working were estimated to be in poverty (below 60% of the median in relative income, before housing costs). The figure for children in households with a working adult (‘working households’) is 17%.

The DWP will also report on the Social Metrics Commission definition of poverty in the near-future. (Source: HBAI report 2017/18)

This is contrary to the further claim by Peter Stefanovic and others that “work is no longer a route out of poverty”.

In the latest year, 88% of children were estimated to be in working families. Consequently, children in poverty in working families were estimated to outnumber those in work-less households. This has been true since 2004/05.

In 2017/18, 69% of children in poverty were in working households. (70% is the figure for relative income after housing costs.)

‘A Kimberly-Clark digital campaign increased their sales by 50%’

An Adobe CMO article in 2013 — entitled ‘15 mind-blowing stats about re-targeting’ — claimed that:

11. CPG company Kimberly-Clark relies on re-targeting, saying it is seeing 50 to 60 percent higher conversion rates among consumers who have been re-targeted.

Following the links reveals something different. In a 2012 interview for Digiday, Jeff Jarrett (VP of Global Digital Marketing) said:

Consumers who visit the brand site are 20 percent more likely to act on a message than a consumer who has not expressed this interest. Furthermore, we are seeing stronger conversion rates among these consumers: 50–60 percent conversion rates.

The VP’s statement is suggesting that — among potential consumers who visit the brand website — Kimberly-Clark see a 50–60% conversion through their digital re-targeting campaigns.

This is not remotely the same claim that Kimberly-Clark’s digital campaigning — in general — has increased their sales by 50%.

‘Half of all marriages end in divorce’

However, how many marriages and divorces there were in a year does not tell you about how many of those marriages end through divorce. Unless the marriage is exceptionally short, people getting married in one year are not the same pairs getting divorced in that same year.

We should look instead at married couples who wed in specified years, in England and Wales. For marriages conducted in 1970, 22% of coupled ended their marriage in divorce after 15 years. For couples wed in 1990, that figure was 33%.

This shows the proportion of marriages that had ended in divorce by each anniversary. (Graph: ONS Archive)

For England and Wales, the estimated percentage of marriages ending in divorce — assuming 2010 divorce and mortality rates throughout — is 42%. There are some signs that couples wed more recently have lower divorce rates than earlier cohorts. After five years, 10% of marriages conducted in 2000 in England and Wales had finished in divorce. For 2005, the five-year divorce rate was 8%.

‘The p-value is the probability this result happened by chance’

In seeking truth, empirical science pursues replication of results — after one experiment has concluded, other researchers run the same experiment, aiming to reproduce the result.

At the heart of this problem lies a calculation in statistical testing called the p-value. Statistical testing starts with a null hypothesis (usually, that there is no effect) and an alternate hypothesis (typically, there is such an effect).

The p-value is defined: under the modelling assumption the null hypothesis is true, what is the probability that we would observe the test data or something more extreme?

The statistician Ronald Fisher sought to use the p-value for deductive reasoning. If we assume the null hypothesis is true, then the observed result should have such a low probability that we then reject that null hypothesis. The p-value is sometimes confused for the probability that the null hypothesis is true. Frequentist statistics does not attach probabilities to the hypotheses themselves.

This type of statistical testing has been used in many areas of scientific research. It has been commonplace to use 0.05 as a threshold. This value is entirely a convention.

Results with a p-value of less than 0.05 are labelled ‘significant’ and those above as ‘non-significant’. This practice has been called to end by hundreds of statisticians.

Two studies could find the same effect, but label their results differently because of study design. (Image: Nature)

The distinction between significant and non-significant results comes from study design, and how uncertain we are. The p-value is not the ‘probability that the result happened by chance’, or the false discovery rate. It is the computation under modelling assumptions, that the null hypothesis is true. The p-value is a statement about how the observed data relates to the model.

False discoveries are an important part of diagnostic screening — and highlight the difference between that risk and p-values.

To use Prof David Calquhoun’s example, say there was a mild cognitive impairment suffered by 1% of the population. We also have a test, which labels someone who is free of this condition as wrongly having it for 5% of people. Additionally, the test correctly detects the condition for 80% of sufferers.

Out of 1,000 people, we then have 10 people with the condition, 8 of which are correctly identified by our test. For the other 990 people, 49 are falsely found by the test to have this condition. Consequently, only 14% of people who tested positive for the condition have it.

Our false discovery risk is 86%, but the p-value equivalent is only 5%.

Empirical research should be concerned about uncertainty of estimates and differences between methods, rather than calling results ‘significant’ and ‘non-significant’.

This blog looks at the use of statistics in Britain and beyond. It is written by RSS Statistical Ambassador and Chartered Statistician @anthonybmasters.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store