# Grade Estimates and Statistical Models

## Without exams, how do we estimate what grades students get?

The Coronavirus pandemic caused great social and economic disruption.

Educational institutions around the world closed and began mass online learning. Teachers and students are adapting to exceptional circumstances. The question is: what grades do students receive when there are no exams?

This article discusses the statistical model used the qualifications regulator in England.

# The Direct Centre Performance model

The Office of Qualifications and Examinations Regulation (Ofqual) regulates qualifications in England. After school and college closures in March, Gavin Williamson MP (Education Secretary) wrote:

Ofqual should ensure, as far as is possible, that qualification standards are maintained and the distribution of grades follows a similar profile to that in previous years.

In simple terms, the Ofqual model has three stages for each subject:

- Calculate the grade distribution at each centre based on the prior three years. That calculation accounts for improvement at centres.
- Adjust for the GCSE performance of the 2020 cohort. Their performance at GCSE compares to past cohorts in each school and college.
- Use the relative rankings given by teachers. Place each student in the adjusted grade distribution.

For centres with low numbers in a subject, student grades depend more on the teachers.

Thanks to the Royal Statistical Society and others, there are six main problems:

**Validity:**do teacher assessments and exams measure the same thing?**Model assessments:**Ofqual use actual results from 2019, not teacher predictions.**Uncertainty in relative rankings:**the model treats rank orders as immutable.**Subject covariance:**student performance is separate in different subjects.**The ungraded problem:**some students*must*receive an ungraded mark.**Point estimates and prediction intervals:**students get the estimated grade. They do not receive the interval estimate.

# Validity

Do teacher assessments and examinations measure the same thing?

Teacher assessments reflect things such as performance in mock examinations and coursework. Also, there are qualitative judgements: like motivation, interest and quality of daily work.

Examinations measure the performance of a student against a particular paper. Some students can have a bad day. Exam papers can have varying difficulty. Exams may coalesce parts of the syllabus in unexpected ways. There was an infamous question in the 2015 Edexcel Maths GCSE exam. Marking can be too harsh or too lenient.

Teacher predicted grades can reflect student *potential*.* *That* *leads to general overestimation of *actual* student performance in the exam.

Smaller classes also benefit: relying more on teacher assessments than modelled estimates.

Suppose a teacher assesses five A* grades for students in their subject. Based on historical performance, the estimate is for four A* grades in their class. How does the model know *which* student would not get that grade in an exam?

**If two methods are measuring the same thing, expect high correlation.** This paragraph is from the Ofqual interim report:

The correlation between teacher estimates and actual grades is relatively strong with correlations between 0.76 and 0.85. This means that teachers can estimate the relative performance of their students within their class with high accuracy, even if the absolute accuracy is lower.

Correlations measure the strength of relationships between two variables.

**It does not measure their agreement.**

Correlation coefficients depend on the mapping from the graded letters to numbers. The choice of which numbers is arbitrary. It may not reflect underlying relationships between estimated and achieved grades.

# Modelled assessments

How is the model assessed?

Their model choice depends on relationships between historical years and 2019 *actual* results. Is historical performance good at predicting what teachers *estimate* in a future year?

Across large subjects, predictive accuracy for the same grade was between 50% and 60%. That is the accuracy when using the actual results in 2019.

If teacher rankings were perfect, over a third of students receive the wrong grade.

# Uncertainty in relative rankings

As Prof Guy Nason (Imperial) says, there is no uncertainty in relative rankings. The model considers the rank order as sacrosanct. There *should* be uncertainty,

The predicted grades from teachers are uncertain, and may need change. The rank orders from the same teachers are unchangeable:

We proposed an approach that placed more weight on statistical expectations — determining the most likely distribution of grades for each centre based on the previous performance of the centre and the prior attainment profile of this year’s students. Then using the submitted rank order to assign grades to individual students in line with this expected grade distribution.

Return to our hypothetical teacher. They now have to rank their students. There are no ties, even when the teacher considers their abilities to be the same. Four students may get A*. The lowest ranked one gets an A. There are similar problems across all grade boundaries.

For larger schools and colleges, teachers can rank students in groups of 10. This is a reflection of how relative ranks becomes more difficult in larger groups.

# No modelled subject covariance

This is an institutional model. Schools and colleges are set grade distributions for each subject. By rank orders, students are then slotted into those grades.

**Subjects have separate models.** There are separate rank orders for each subject. This is a problem because student performance across subjects are not independent.

A-Level students generally get a better grade in Maths than Further Maths. As the name suggests, Further Maths is harder than Maths. It is a course for people *really* interested in maths, and consider univariate calculus to be too easy.

In 2019, 0.2% of students achieved a higher grade in Further Maths than Maths. In 2020, 3.1% had estimated higher grades in Further Maths.

Ofqual recognises this issue with Further Maths:

Given that there are a higher proportion of centres with small cohorts in further maths — and that this is related to more lenient outcomes — this is not surprising.

# The ungraded problem

Dave Thomson (FFT Education Data Lab) highlights the following issue.

Suppose a school receives an prior attainment change for their 2020 cohort. That change suggests 2.3% of students should receive an U (ungraded).

The class for the subject is 27 students.

The allocation method is stated as:

This is performed by overlaying the rank order provided by the centre onto each centre’s predicted cumulative percentage grade distribution such that the proportion of students awarded each grade within the centre matches the predicted distribution as closely as possible.

Since there are 27 students, each one is worth 3.7%. Though 2.3% is less than one student share, it is more than half.

In this example, one student *must* receive a U. Not one student got a U in that subject for the past three years.

# Point estimates and prediction intervals

**Uncertainty matters.** In large subjects, the 2019 model had estimated same-grade accuracy from 50% to 60%. Within one grade of the prediction, accuracy was over 95%. One smaller subject had accuracy within one grade of under 65%.

There is inherent uncertainty in the prediction model. Students have the central estimate. **Students are not shown the prediction interval.**

# Hello, World

Statistical models and algorithms should not have omnipotent authority.

As Prof Hannah Fry (UCL) writes in her book:

In my view, the best algorithms are the ones that take the human into account at every stage. The ones that recognise our habit of over-trusting the output of a machine, while embracing their own flaws and wearing their uncertainty proudly front and centre.

There is a need for transparency over the use of statistical models. Methodological choices need justification. Unfair outcomes need intervention.

This is an institutional model, which is fair to schools and colleges — based on past results. This is not the same as being fair to individual students.