# The Princess Bride Problem

You keep using that word. I do not think it means what you think it means.

This is a very famous line spoken by the character Inigo Montoya in *The Princess Bride*. Statistics used in public debate and research often have a similar difficulty: a statistic may not mean what some people think it means.

This article will consider a few examples.

# ‘Most children in poverty are in working families’

This is a statistic commonly cited in response to Government ministers and spokespersons citing the latest ONS labour statistics. For instance, Mike Amesbury MP (Labour, Weaver Vale), the Shadow Minister for Employment, stated in a press release:

Many people are trapped in low paid, insecure work and 70% of children in poverty now live in working families.

In the Department for Work & Pension’s Households Below Average Income report, there are five definitions of poverty (or low income).

A household is defined as being in poverty if its net disposable household income lies below a specified threshold: 60% of the median (the middle value). This is given for both relative low income — meaning the threshold moves as average income changes — and absolute low income — the median income in 2010/11 and moves it in line with inflation. Weekly net equivalised household income is expressed both before and after housing costs.

Additionally, there is a measure of children in material deprivation: a household that is below 70% of the relative median income, and where their family scores 25 or higher (out of 100) on questions about access to 21 goods and services.

For 2017/18, 56% of children living households without an adult working were estimated to be in poverty (below 60% of the median in relative income, before housing costs). The figure for children in households with a working adult (‘working households’) is 17%.

This is contrary to the further claim by Peter Stefanovic and others that “work is no longer a route out of poverty”.

In the latest year, 88% of children were estimated to be in working families. Consequently, children in poverty in working families were estimated to outnumber those in work-less households. This has been true since 2004/05.

In 2017/18, 69% of children in poverty were in working households. (70% is the figure for relative income after housing costs.)

# ‘A Kimberly-Clark digital campaign increased their sales by 50%’

The claim that Kimberly-Clark’s digital campaigning increased their sales by 50% has gone through, not one, but *two* phases of misinterpretation.

An Adobe CMO article in 2013 — entitled ‘15 mind-blowing stats about re-targeting’ — claimed that:

11. CPG company Kimberly-Clark relies on re-targeting, saying it is seeing 50 to 60 percent higher conversion rates among consumers who have been re-targeted.

Following the links reveals something different. In a 2012 interview for Digiday, Jeff Jarrett (VP of Global Digital Marketing) said:

Consumers who visit the brand site are 20 percent more likely to act on a message than a consumer who has not expressed this interest. Furthermore, we are seeing stronger conversion rates among these consumers: 50–60 percent conversion rates.

The VP’s statement is suggesting that — among potential consumers who visit the brand website — Kimberly-Clark see a 50–60% conversion through their digital re-targeting campaigns.

This is not remotely the same claim that Kimberly-Clark’s digital campaigning — in general — has increased their sales by 50%.

# ‘Half of all marriages end in divorce’

It is commonly asserted that in the United States and elsewhere that half of all marriages end in divorce. This claim appears to be based on, in a given year, the number of divorces were approximately half of marriages. As an example, for 2009 in England and Wales: there were 113,949 divorces and 232,443 marriages.

However, how many marriages and divorces there were in a year does not tell you about how many of those marriages end through divorce. Unless the marriage is exceptionally short, people getting married in one year are not the same pairs getting divorced in that same year.

We should look instead at married couples who wed in specified years, in England and Wales. For marriages conducted in 1970, 22% of coupled ended their marriage in divorce after 15 years. For couples wed in 1990, that figure was 33%.

For England and Wales, the estimated percentage of marriages ending in divorce — assuming 2010 divorce and mortality rates throughout — is 42%. There are some signs that couples wed more recently have lower divorce rates than earlier cohorts. After five years, 10% of marriages conducted in 2000 in England and Wales had finished in divorce. For 2005, the five-year divorce rate was 8%.

# ‘The p-value is the probability this result happened by chance’

Empirical science seeks to establish truth, accurately as we can. We are often interested in whether something we have observed is real or just happened by chance. Are our discoveries false?

In seeking truth, empirical science pursues replication of results — after one experiment has concluded, other researchers run the same experiment, aiming to reproduce the result.

At the heart of this problem lies a calculation in statistical testing called the p-value. Statistical testing starts with a null hypothesis (usually, that there is no effect) and an alternate hypothesis (typically, there is such an effect).

The p-value is defined: under the modelling assumption the null hypothesis is true, what is the probability that we would observe the test data or something more extreme?

The statistician Ronald Fisher sought to use the p-value for deductive reasoning. If we assume the null hypothesis is true, then the observed result should have such a low probability that we then reject that null hypothesis. The p-value is sometimes confused for the probability that the null hypothesis is true. Frequentist statistics does not attach probabilities to the hypotheses themselves.

This type of statistical testing has been used in many areas of scientific research. It has been commonplace to use 0.05 as a threshold. This value is entirely a convention.

Results with a p-value of less than 0.05 are labelled ‘significant’ and those above as ‘non-significant’. This practice has been called to end by hundreds of statisticians.

The distinction between significant and non-significant results comes from study design, and how uncertain we are. The p-value is not the ‘probability that the result happened by chance’, or the false discovery rate. It is the computation under modelling assumptions, that the null hypothesis is true. The p-value is a statement about how the observed data relates to the model.

False discoveries are an important part of diagnostic screening — and highlight the difference between that risk and p-values.

To use Prof David Calquhoun’s example, say there was a mild cognitive impairment suffered by 1% of the population. We also have a test, which labels someone who is free of this condition as wrongly having it for 5% of people. Additionally, the test correctly detects the condition for 80% of sufferers.

Out of 1,000 people, we then have 10 people with the condition, 8 of which are correctly identified by our test. For the other 990 people, 49 are falsely found by the test to have this condition. Consequently, only 14% of people who tested positive for the condition have it.

Our false discovery risk is 86%, but the p-value equivalent is only 5%.

Empirical research should be concerned about uncertainty of estimates and differences between methods, rather than calling results ‘significant’ and ‘non-significant’.