Cumulative sums are not independent
A financial news site said Chinese coronavirus figures were “too perfect to mean much”.
A severe statistical misunderstanding underlies this complaint. Cumulative sums are, by definition, not independent observations. This article shows how cumulative sums of pseudo-random data are also ‘too perfect’ by the same measure.
It all adds up
The financial investment magazine Barrons headlined:
China’s Coronavirus Figures Don’t Add Up. ‘This Never Happens With Real Data.’
The Chinese government submits statistics about the coronavirus to the World Health Organisation. The article asserts “a simple mathematical formula” describes cumulative deaths. This simple model has “very high accuracy”.
‘Cumulative’ means adding up as you go along. Imagine there were three deaths on the first day, and five on the second day. The cumulative total deaths after two days is eight. On the third day, two more people die. The cumulative total becomes ten.
Imagine you wanted to express deaths in connection to average temperature in Wuhan. We calculate how much the varying temperature explains the variance of deaths.
In the jargon, statisticians call this value the coefficient of determination, or ‘R-squared’. The value goes from 0 to 1 (or 0% to 100%). If it is 1, the model fits exactly: explaining all the variance.
The Barrons article quotes biostatistician Prof Goodman (NYU):
I have never in my years seen an r-squared of 0.99. As a statistician, it makes me question the data.
For this data, we were being told the coefficient for cumulative deaths. The fundamental problem is that cumulative sums are not independent of one another. The cumulative sum on the second day affects the third, and so on.
We can get a very high coefficient with cumulative sums of pseudo-random data. Imagine we have 100 independent ‘rolls’ of a six-sided die:
We can then add up the value after each roll. The model is pre-set: it is the true mean (3.5) times by the number of rolls.
The R-squared value is 0.998.
As the number of rolls increases, the mean value tends towards its true value (of 3.5). The cumulative sum of roll values tends towards 3.5 times the number of rolls.
This is the Law of Large Numbers in glasses.
The R code for the graphs is available on GitHub.