Changing grade weight can close grade gaps introductory physics

Title: Grades, grade component weighting, and demographic disparities in introductory physics

Authors: Amber B. Simmons and Andrew F. Heckler

Institution: Ohio State University

Journal: Physical Review Physics Education Research 16, 020125 (2020) [open access]

When I took my first math course in college, I was shocked by the grading criteria. A 32% was considered passing and what would normally been the around an “F” or “D” was considered a B-. As a result, I constantly felt like I was struggling in the course based on the 50 and 60% (and sometimes lower) grades I’d earn on exams. Unsurprisingly, I debated dropping the course.

Image showing grading policy for the intro math course I took. The final grade was your best 5 midterms plus homeworks plus a final. Passing was a 32%, an A- was 80% or above and an A was 90 or above
This is the grading scale for my first college math course. If the values aren’t scary enough, the set notation surely is.

Many students have similar experiences. Students use grades to decide whether to stay in their major and to revise their beliefs about their ability in a field. As such, it is not unreasonable to expect that grading practices may affect the demographics of a discipline.

The goal of today’s paper was to see how different grade weighting schemes affected passing and “A” rates and whether such schemes could close grade gaps between students. Sure enough, reducing the amount of the grade dependent on exams eliminates the gender grade gap and reduces the racial grade gap.

To reach their conclusion, Ohio State University researchers collected student grades from over 20,000 students enrolled in algebra-based and calculus-based introductory physics over a six year period. In addition, they obtained individual grade components for 6,500 of those students. These components included exams and quizzes (midterm exams, in class quizzes, and the final exam) and online homework, in-class labs, and online skills building practice.

In both courses, grade components (and not overall grades) were curved so that each had a median of at least 77% (roughly a C+/B- based on the grading scale).

The researchers also had access to the university student records system and were able to get the students’ ACT scores to provide a measure of “prior preparation” of the students in the introductory physics courses.

When looking at the overall grades, the researchers found the expected results based on previous work. Women and men had similar grades but white and Asian and continuing generation students had higher grades than their Black, Latinx, Native American and first generation counterparts.

Next, the researchers modeled the final grades using the ACT scores to account for differences in preparation coming into the course. Ideally, students with similar ACT scores should have earned similar grades in the course.

Yet, that wasn’t the case for students from underrepresented racial groups in physics. Instead, students from overrepresented racial groups earned higher grades than students from underrepresented racial groups for identical ACT scores, except for the highest ACT scores (Figure 1).

Figure 1: Course grade by gender (top) and underrepresented race in physics status (bottom). Notice that in both cases, there are regions were the lines do not overlap and hence, differences in performance. (Based on figure 2 in paper).

When looking at the grade component data, the researchers found the ACT explained 20% of the variance in the exam components of the final grade, but only 2% of the variance in the non-exam component of the final grade. That is, the proxy for prior preparation wasn’t necessarily measuring prior preparation but may be measuring something else too (such as test-taking skills).

From this analysis, the researchers also found that grade gaps tended to be due to differences in the grades on exams and quizzes rather than non-exam components.

Changing the final grade weighting

Inspired by their results that non-exam components showed less disparities than the exam components, the researchers then performed a thought experiment involving how the final grade is calculated.

Currently, the final grade is a weighted average of the exam components and non-exam components where exams contribute 70% of the final grade and the non-exam components contribute 30%.

What if instead, exam and non-exam components were given equal weight (50% and 50%)?

In the first experiment, the weights are changed to 50/50 but students grades were adjusted so that the average grade under this scheme was the same as in the actual course.

In the second experiment, all student grades are curved with the new average grade being the new final grade assuming a 50/50 weighting.

In the third experiment, the weights are changed to 50/50 and the average grade is not fixed, meaning the new average final grade is equivalent to that in the second experiment.

Given that students tend to do better on non-exam parts of the course than exam parts, it is not surprising that all three experiments lowered the number of “D”s and “F”s and experiments II and III, which allowed the mean grade to increase, increased the number of “A”s.

However, only the third experiment decreased the size of the grade gaps. For gender, the grade gap was eliminated.

Further, the size of changes is impressive. Under the third experiment, 20% less women and students from underrepresented racial groups would have earned a “D” or “F” while the number of “A”s would have increased 50%. Men and students from overrepresented racial groups also would have benefited from the grading scheme changes, though to a smaller degree. (Figure 2 & 3)

Figure 2: Relative difference in D and F percentage based on various demographics. Numbers in the right margin correspond to the experiment number. Notice that for the equal weighting scheme, all groups earn fewer Ds and Fs. (Fig 4 in paper)
Figure 3: Relative difference in “A” percentage based on various demographics. Numbers in the right margin correspond to the experiment number. Again, for the equal weighting scheme, all groups earn more “A”s. (Fig 5 in paper)

Takeaways and Recommendations

The results of this study provides more evidence that the system we use to grade our students can have a large impact on course outcomes. When using a grade weighting scheme that treated exams and non-exam activities equally, more students earned “A”s, fewer earned “D”s or “F”s and the course outcomes were more equitable. The best part is that these outcomes could be achieved with minimal work from the instructor as non-exam activities are already part of the standard physics curriculum.

The authors acknowledge that a key assumption of their thought experiments is that students and instructors wouldn’t behave differently if the grade weighting were different. For example, if homework were worth more, students may feel pressured to misuse online resources to ensure they are getting full points, and if participation were valued more in the final grade, instructors may feel that students need to meet a higher threshold to earn full points.

Nevertheless, this study calls into question whether our grading system reflects what we value. If our course learning goals include developing physics knowledge, problem solving skills, laboratory skills, professional skills, and group work skills, does our grade weighting reflect those as well?

Figures used under CC BY 4.0. Header image by used under CC BY-NC-ND 4.0.

Leave a Reply

Your email address will not be published. Required fields are marked *