Member-only story

Data wrangling and name linkages

Answering a question about MPs leads to a lot of wrangling.

Anthony B. Masters
4 min readDec 14, 2024

The initial question was simple. How many Conservative MPs entered the ballot for the Private Members’ Bill?

I put this question to the House of Commons Enquiry Service. With a prompt response, they directed me towards a portable document. This file had the names of all MPs who had entered the ballot and their assigned number. There was no summary by party, which was unfortunate.

Wrangling a list

PDFs can be also difficult for many data analysis tools to read.

Can we wrangle this data in R? Yes. There is a package called ‘pdftools’.

Looking at the file itself reveals the challenges. There are three columns: each with a number and the MP’s name. Sometimes, the full name will spill over the row below.

How many Conservative MPs are in this list? (Image: UK Parliament Documents)

With a function, R can read each page of the document as a character table.

pdf_text <- pdftools::pdf_text(pdf_path)

From this mess, we can restore order and define a function to produce a list of MPs. Let’s look at this custom function and break down why we need each step.

hoc_table_converter <- function(input_pdf_text, input_pdf_rows){
output_df <…

--

--

Anthony B. Masters
Anthony B. Masters

Written by Anthony B. Masters

This blog looks at the use of statistics in Britain and beyond. It is written by RSS Statistical Ambassador and Chartered Statistician @anthonybmasters.

No responses yet