Cumulative sums are not independent

Are China’s coronavirus statistics “too perfect”?

A financial news site said Chinese coronavirus figures were “too perfect to mean much”.

A severe statistical misunderstanding underlies this complaint. Cumulative sums are, by definition, not independent observations. This article shows how cumulative sums of pseudo-random data are also ‘too perfect’ by the same measure.

It all adds up

The financial investment magazine Barrons headlined:

China’s Coronavirus Figures Don’t Add Up. ‘This Never Happens With Real Data.’

The Chinese government submits statistics about the coronavirus to the World Health Organisation. The article asserts “a simple mathematical formula” describes cumulative deaths. This simple model has “very high accuracy”.

‘Cumulative’ means adding up as you go along. Imagine there were three deaths on the first day, and five on the second day. The cumulative total deaths after two days is eight. On the third day, two more people die. The cumulative total becomes ten.

Imagine you wanted to express deaths in connection to average temperature in Wuhan. We calculate how much the varying temperature explains the variance of deaths.

In the jargon, statisticians call this value the coefficient of determination, or ‘R-squared’. The value goes from 0 to 1 (or 0% to 100%). If it is 1, the model fits exactly: explaining all the variance.

A R-squared value of 0 means the model does not explain any of the variance. (Image: Stephanie Glen)

Questioning

The Barrons article quotes biostatistician Prof Goodman (NYU):

I have never in my years seen an r-squared of 0.99. As a statistician, it makes me question the data.

For this data, we were being told the coefficient for cumulative deaths. The fundamental problem is that cumulative sums are not independent of one another. The cumulative sum on the second day affects the third, and so on.

We can get a very high coefficient with cumulative sums of pseudo-random data. Imagine we have 100 independent ‘rolls’ of a six-sided die:

The seed is set. (Image: ggplot2)

We can then add up the value after each roll. The model is pre-set: it is the true mean (3.5) times by the number of rolls.

The R-squared value is 0.998.

An estimated central 95% of the probability distribution is within the shaded area. (Image: ggplot2)

As the number of rolls increases, the mean value tends towards its true value (of 3.5). The cumulative sum of roll values tends towards 3.5 times the number of rolls.

This is the Law of Large Numbers in glasses.

The R code for the graphs is available on GitHub.

This blog looks at the use of statistics in Britain and beyond. It is written by RSS Statistical Ambassador and Chartered Statistician @anthonybmasters.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store