Member-only story
Data wrangling and name linkages
Answering a question about MPs leads to a lot of wrangling.
The initial question was simple. How many Conservative MPs entered the ballot for the Private Members’ Bill?
I put this question to the House of Commons Enquiry Service. With a prompt response, they directed me towards a portable document. This file had the names of all MPs who had entered the ballot and their assigned number. There was no summary by party, which was unfortunate.
Wrangling a list
PDFs can be also difficult for many data analysis tools to read.
Can we wrangle this data in R? Yes. There is a package called ‘pdftools’.
Looking at the file itself reveals the challenges. There are three columns: each with a number and the MP’s name. Sometimes, the full name will spill over the row below.
With a function, R can read each page of the document as a character table.
pdf_text <- pdftools::pdf_text(pdf_path)
From this mess, we can restore order and define a function to produce a list of MPs. Let’s look at this custom function and break down why we need each step.
hoc_table_converter <- function(input_pdf_text, input_pdf_rows){
output_df <…