Correlations and Time Series

A high correlation does not mean what you think it means.

Over 1,700 users shared a series of Twitter posts, which claimed “proof” that:

The vast majority of COVID deaths in England since July have been mislabelled false positive deaths.

The suggestion is that we have more COVID-19 deaths due to greater testing. The posts assert increased testing led to lots of incorrect diagnoses. Their “proof” was high correlation between two time series.

This article focuses on statistical problems with measuring correlation in time series. For this reason among others, their conclusion is false.

Correlation coefficients

Dr Craig, a pathologist, starts with this graph:

There is no given data source, which should be the PHE Coronavirus Dashboard. Dr Craig writes:

You will notice that the shape of the two curves are very similar. We can test this. The chart below demonstrates that since August 93% of the rise in deaths can be accounted for by the rise in the number of tests done in hospitals over the 28 days preceding.

The Pearson correlation coefficient is around 0.96.

I have never seen such a tight correlation in my career. Biology just isn’t like that. But there it is — 93%.

People should not analyse time series in this way.

Pearson’s correlation is a statistic, measuring linear association of variables. Two variables with high positive correlation will increase together, and decrease together.

Time series are not independent observations. Time connects those points.

Two time series which share the same trend will show high correlation. This is true even for random series and unrelated series:

The R² statistic here is 0.92. (Image: Tyler Vigen)

This is why such methods are not appropriate for time series.

Values over time could have an internal structure. That could include trends, seasonality, or auto-correlation. Time series analysis needs to account for these phenomena.

There is a clear mechanism through which testing and COVID-19 deaths both increase. The virus spreads: we test more people and more people die.

The graph implies there would be no COVID-19 deaths if there were only around 1.1m tests in a 28 day period. This is absurd.

Reduced testing can lead to under-counting of COVID-19 deaths — as it did in the early parts of the pandemic.

False positives do not explain rising lab-confirmed cases

The thrust of Dr Craig’s posts is that lab-confirmed cases are “full of false positives”. When those people die, these are “mislabelled false positive deaths”.

A false positive is when someone who does not have the virus gets a positive test result. A false negative occurs when someone who has the virus receives a negative result.

Diagnostic PCR tests work by detecting the virus’s genetic material in the sample. A swab sample may not contain virus cultures — leading to a false negative result.

As such, the tests have a very low false positive rate. The ONS methods article states:

For example, in the most recent six-week period (31 July to 10 September), 159 of the 208,730 total samples tested positive. Even if all these positives were false, specificity would still be 99.92%.

The weekly positive proportion in England has increased in both testing pillars:

The two graphs have different scales. (Image: Public Health England/Weekly COVID-19 Surveillance Report, Week 40)

As a consequence, most positive test results are correct. False positives do not explain rising lab-confirmed cases.

More errors

This is not an exhaustive list of problems in this Twitter thread.

The thread gives the wrong definition of correlation

What Dr Craig calls “tight correlation” is a different number. Their Excel graph gives the coefficient of determination. The Twitter thread also provides an incorrect definition. The correct definition is:

  • the share of variance in the dependent variable accounted for by independent variables.

The coefficient of determination (R²) shows how well a model fits. For a simple linear model, it is the square of the Pearson correlation coefficient.

Death certificate mentions

Dr Craig states:

In July and August, for a third of patients dying with a COVID diagnosis, the doctors could not bring themselves to put that as the underlying cause of death.

This is a misunderstanding. Death certificates mention diseases as believed causes or contributory factors. These clinical judgements do not need a positive test result.

This is different to the daily measures of deaths, which depends on positive test results.

The CEBM article calculates from Public Health England weekly reports. They could use Office for National Statistics monthly mortality analyses.

Cases and new infections

Dr Craig asserts:

Testing has ramped up so much that for 18th-24th Sept 66% of predicted cases were diagnosed.

This is likely to be misleading.

First, the comparison is between lab-confirmed cases and new daily infections. These are different concepts.

Second, the ONS survey excludes institutional settings, like hospitals. It is not an estimated number of all SARS-CoV-2 cases. This is explicit in their analysis:

The data cannot be used for:

measuring the number of cases and infections in care homes, hospitals and other institutional settings

The positive proportion continues to increase in English private households. (Image: ONS COVID-19 Infection Survey)

Third, the number of new daily infections is a survey estimate. People should not use the central estimate as if it were a precise number.

As I wrote for World Statistics Day:

Misinformation can be viral. As we take precautions to not spread this virus, we must act to avoid sharing misinformation. This is important: mistaken beliefs could damage people’s health.

This blog looks at the use of statistics in Britain and beyond. It is written by RSS Statistical Ambassador and Chartered Statistician @anthonybmasters.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store