Correlations and Time Series
Over 1,700 users shared a series of Twitter posts, which claimed “proof” that:
The vast majority of COVID deaths in England since July have been mislabelled false positive deaths.
The suggestion is that we have more COVID-19 deaths due to greater testing. The posts assert increased testing led to lots of incorrect diagnoses. Their “proof” was high correlation between two time series.
This article focuses on statistical problems with measuring correlation in time series. For this reason among others, their conclusion is false.
Dr Craig, a pathologist, starts with this graph:
There is no given data source, which should be the PHE Coronavirus Dashboard. Dr Craig writes:
You will notice that the shape of the two curves are very similar. We can test this. The chart below demonstrates that since August 93% of the rise in deaths can be accounted for by the rise in the number of tests done in hospitals over the 28 days preceding.
I have never seen such a tight correlation in my career. Biology just isn’t like that. But there it is — 93%.
People should not analyse time series in this way.
Pearson’s correlation is a statistic, measuring linear association of variables. Two variables with high positive correlation will increase together, and decrease together.
Time series are not independent observations. Time connects those points.
This is why such methods are not appropriate for time series.
Values over time could have an internal structure. That could include trends, seasonality, or auto-correlation. Time series analysis needs to account for these phenomena.
There is a clear mechanism through which testing and COVID-19 deaths both increase. The virus spreads: we test more people and more people die.
The graph implies there would be no COVID-19 deaths if there were only around 1.1m tests in a 28 day period. This is absurd.
Reduced testing can lead to under-counting of COVID-19 deaths — as it did in the early parts of the pandemic.
False positives do not explain rising lab-confirmed cases
The thrust of Dr Craig’s posts is that lab-confirmed cases are “full of false positives”. When those people die, these are “mislabelled false positive deaths”.
A false positive is when someone who does not have the virus gets a positive test result. A false negative occurs when someone who has the virus receives a negative result.
Diagnostic PCR tests work by detecting the virus’s genetic material in the sample. A swab sample may not contain virus cultures — leading to a false negative result.
As such, the tests have a very low false positive rate. The ONS methods article states:
For example, in the most recent six-week period (31 July to 10 September), 159 of the 208,730 total samples tested positive. Even if all these positives were false, specificity would still be 99.92%.
The weekly positive proportion in England has increased in both testing pillars:
As a consequence, most positive test results are correct. False positives do not explain rising lab-confirmed cases.
This is not an exhaustive list of problems in this Twitter thread.
The thread gives the wrong definition of correlation
What Dr Craig calls “tight correlation” is a different number. Their Excel graph gives the coefficient of determination. The Twitter thread also provides an incorrect definition. The correct definition is:
- the share of variance in the dependent variable accounted for by independent variables.
The coefficient of determination (R²) shows how well a model fits. For a simple linear model, it is the square of the Pearson correlation coefficient.
Death certificate mentions
Dr Craig states:
In July and August, for a third of patients dying with a COVID diagnosis, the doctors could not bring themselves to put that as the underlying cause of death.
This is a misunderstanding. Death certificates mention diseases as believed causes or contributory factors. These clinical judgements do not need a positive test result.
This is different to the daily measures of deaths, which depends on positive test results.
Cases and new infections
Dr Craig asserts:
Testing has ramped up so much that for 18th-24th Sept 66% of predicted cases were diagnosed.
This is likely to be misleading.
First, the comparison is between lab-confirmed cases and new daily infections. These are different concepts.
Second, the ONS survey excludes institutional settings, like hospitals. It is not an estimated number of all SARS-CoV-2 cases. This is explicit in their analysis:
The data cannot be used for:
measuring the number of cases and infections in care homes, hospitals and other institutional settings
Third, the number of new daily infections is a survey estimate. People should not use the central estimate as if it were a precise number.
As I wrote for World Statistics Day:
Misinformation can be viral. As we take precautions to not spread this virus, we must act to avoid sharing misinformation. This is important: mistaken beliefs could damage people’s health.