A Spreadsheet of Errors

How can misusing Microsoft Excel induce analytical errors?

Using Microsoft Excel can cause major problems in statistical reporting and analysis. This article looks at several recent incidents of Excel errors.

The missing cases

Public Health England gathered SARS-CoV-2 swab test results from commercial firms. These results were in a list-based format — comma-separated values (CSV) files.

An old version of Excel pulled together different text files. (Image: BBC)

In an automated process, the agency used Excel to pull together these text-based files. For each SARS-CoV-2 test, there were several rows in the file.

The agency used an old version of Excel (XLS format). That meant the collated files could only hold around 65,000 rows. Microsoft superseded that format in 2007. The row limit in the latest version of Excel is about one million.

When the rows breached their limit, the extra lines were missing. There was temporary under-reporting of lab-confirmed cases. These reported cases feed into the NHS Test and Trace system. As a result, this problem delayed attempts to control the virus.

This was not a “glitch”, but the inevitable outcome of the automated process.

Spreadsheet problems

This is not the first time that major problems lie in the shadow of an Excel spreadsheet. The European Spreadsheet Risks Interest Group identifies some horror stories of Excel errors.

In genomics, genes have shortened names. For example, Septin 2 was SEPT2. The problem is that Excel treated that name as a date. The default is: SEPT2 becomes 2-Sep. There are others which become floating point numbers.

That inadvertent conversion affects later analysis. A 2016 Genome Biology paper found:

Of the selected journals, the proportion of published articles with Excel files containing gene lists that are affected by gene name errors is 19.6%.

In response, the HUGO Gene Nomenclature Committee renamed genes. SEPT2 became SEPTIN2, to stop this date-reading issue.

A major economics paper by Prof Reinhart and Prof Rogoff had an Excel error. Their Excel formula was too short, missing 5 out of 20 countries in their key calculation. This coding error, as well as choices of years and weights, influenced the paper’s main claim.

The average should be for L30:L49. (Image: Roosevelt Institute)

In 2012, a misdirected reference resulted in a $25m accounting error in Utah’s education budget. Two people resigned because of this problem:

Shumway said the error was the result of a mathematical formula in an Excel spreadsheet referencing the wrong cell to calculate the state’s weighted pupil units. That inaccurate figure was then used for a per-pupil funding estimate that resulted in public education being underfunded by the Utah Legislature.

There are other general problems with Microsoft Excel, including:

  • Leading point arithmetic: all calculations use the 15 leading non-zero digits. That method can result in strange inequalities.
  • Missing values: the tool treats missing values in inconsistent ways.
  • No optimal arrangement: in Excel, it is easier to copy data into the right format for each analysis.
  • No bug list: there is no comprehensive list of Microsoft Excel’s bugs or unexpected features.

Statistics with Excel

Microsoft Excel is a powerful spreadsheet. Its main competitor is Google Sheets, which clones its main functionality.

The spreadsheet tool is Turing complete: meaning it can perform all calculations. Because you can do everything in Excel, it does not mean you should.

There are major barriers to overcoming the Excel addiction in data analysis. Firms may stop people from downloading a new browser — let alone a statistics tool. Since it is part of the standard Office suite, companies may only allow analysts to use Excel.

Nor is access to proper tools uniform across an organisation. Some analysts may get pricey licenses. Others have to make do with Excel.

The limitations of Excel need care to mitigate. There are good practices for spreadsheet testing, design, and documentation. Proficient organisation needs consistency, one date format, data dictionaries, rectangular frames, and more.

Microsoft Excel is not a statistics packages for data analysis. Good practices can help some issues arising from the overuse of spreadsheets. Analysts need freedom and funding to use appropriate tools.

This blog looks at the use of statistics in Britain and beyond. It is written by RSS Statistical Ambassador and Chartered Statistician @anthonybmasters.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store