Buy the pizza, have the grading party

Title: TA Marking Parties: Worth the Price of Pizza
Authors: Brian Harrington, Marzieh Ahmadzadeh, Nick Cheng, Eric Heqi Wang, Vladimir Efimov
First author institution: University of Toronto Scarborough
Source: Proceedings of the 2018 Conference on International Computing Education Research (Peer-Reviewed, closed access)

It’s not uncommon to see enrollment in physics classes increasing as introductory physics courses are often required for STEM majors and some students take them as electives for their general degree requirements. Thus, it is not uncommon to have over 1000 students enrolled in an introductory course. With so many students, grading exams becomes a time consuming endeavor and is sometimes done as a “grading party,” in which teaching assistants (TAs) and professors for the course meet in a room to grade together. The idea is that by grading together, the TAs and professors can discuss answers to ensure more consistent grading across the exams. However, large courses typically have a large number of TAs and professors, making it difficult to find sufficient time for everyone to meet and grade. Today’s paper asks if the hassle is worth it: do grading parties actually provide benefits (such as increased consistency) compared to just allowing the graders to take a stack of exams and grade on their own time?

To answer this question, the researchers selected an introductory computer science course with 564 students and 20 TAs at a large Canadian University. For grading the final exam, the 20 TAs were randomly assigned to grade exams individually or as part of a grading party. Prior to the final, the TAs had had experience grading both individually and as part of the group. Each TA was then given 30 final exams to grade and as well as a rubric. Unknown to the TAs, 2 of the exams in each of their piles were “fake” exams, that is, exams with solutions created by the researchers rather than students in the class. All of the 40 fake exams (2 fake exams for each of the 20 TAs) had similar, subtle errors that would lose points if the grader were following the rubric carefully but would be easily missed otherwise. Two TAs who had previously taught the course many times and were not part of the 20 TAs in the study had reviewed the fake exams beforehand to ensure they would all lose points for the same reasons according to the rubric, and hence, earn the same overall score. After completing the grading, the 20 TAs completed a survey asking out their preferences for grading, how fair they thought the grading process was, and how long the grading took.

So what happened? Well, not exactly what the researchers had hoped. As the 10 TAs who were not selected to be part of the study were free to grade wherever and whenever they wanted, 3 of them decided to meet up to grade together. Coincidentally, they chose to meet in the room that the other 10 TAs were participating in the grading party. As the original excuse for not having a grading party of all 20 TAs (and thus allowing for the study to be conducted) was that not all of them were able to attend, the researchers allowed the 3 TAs to join the grading party to avoid having to disclose anything about the study. This caused there to be 13 TAs in the grading party and 7 TAs grading individually.

So what did the researchers find? First, the grading party TAs self-reported spending less time grading than the solo-grading TAs did and the TAs from both the grading party and solo-grading groups preferred grading parties and felt it was fairer to the students in terms of consistency of grades.

To see if this was actually the case, the researchers counted the total number of grading errors on the fake exams, where a grading error was awarding points to an incorrect answer or not deducting points for an incorrect answer. The TAs who solo-graded averaged 17.14 errors on the fake exams while the TAs who graded as part of the grading party average 11.30 errors. While the difference was statistically significant, the researchers note that in reality, some of the mistakes would have cancelled out, as incorrectly awarding a point for an incorrect answer and not deducting a point for an incorrect answer would produce no change in the overall grade. Thus, using the number of grading errors instead of the number of incorrectly awarded or deducted points likely overestimates any differences.

Next, the researchers compared how consistent each grader was by comparing the number of mistakes on the two fake exams for each TA. Remember that the fake exams were constructed so that they should receive the same score. Yet, when comparing the two fake exams, the TAs who graded solo averaged 8.57 differences between their two fake exams while the grading party TAs averaged 6.38 differences. As the variance of the results was large, the results were not statistically significant. When looking at each of the four questions on the exam individually, the grading party TAs averaged 1 error fewer than the solo-grading TAs.

As there were only 20 TAs, not all of the results were statistically significant, even when they showed differences in the number of errors between the grading party and solo-party TAs. However, the results suggest further study about how grading in groups may affect consistency. In addition, having 3 TAs join the grading party may have influenced the results as the groups were no longer random.

So what can we take away from this paper? First, the TAs preferred to grade in a “grading party” and believed that the grading was quicker this way. When comparing the consistency between the TAs who graded in groups with those who graded solo, the solo graders made more errors than the group graders. The solo graders were also less consistent on the set of exams they graded. While not all the results are statistically significant, they are suggestive that grading parties do in fact have benefits. Thus, even though the scheduling can be a hassle, it seems to be worth it.

Header images from Flickr used under CC BY 2.0.

Nick Young

I am a postdoc in education data science at the University of Michigan and the founder of PERbites. I’m interested in applying data science techniques to analyze educational datasets and improve higher education for all students

Leave a Reply Cancel reply