Artificial Intelligence, Fairness

AI Can Bring Fairness to Assessments but Are We Ready for It?

Published in

Towards AI

5 min readFeb 4, 2021

Last September, I came across an article on The Verge that talks about how a group of seventh-grade students figured out an easy way to cheat on a history exam graded by artificial intelligence (AI). The article focuses on the story of a young student who receives a low score on a history assignment at Edgenuity but later on figures out how to trick the AI grading system with the help of her mother (who happens to be a history professor) and gets 100% in the rest of his assignments. A very intriguing and eye-catching story, right?

There is always more to the story than you think.

After reading the whole article, I realized that there were some gaps in the story. For example, the so-called “AI” grading system mentioned in the article turns out to be a simple keyword matching engine, not a real AI system at all. In the Edgenuity platform, each short-answer question on the test appears to be associated with several keywords. The more keywords a student’s answer includes, the higher his/her grade becomes. So, this grading system should not have been called AI from the very beginning. Interestingly, another article published by The Verge in 2019 points out the misuse of the term“AI” by the press and social media. The two articles are seemingly contradictory.

Automated Grading with AI

At this point, one might wonder whether AI can actually be used to grade students’ written responses accurately and objectively. The short answer is yes. In my previous article on Towards AI, I talked about automated essay scoring, shortly AES, as a promising solution to automatically grading short-answer questions as well as long essays with a high degree of precision. Using natural language processing, AES creates a set of linguistic scoring rules based on how human raters (e.g., teachers) grade students’ responses and then applies these rules to grade the responses of a new group of students.

Human raters grade questions based on the relevance of the response to the question, the organization of the response, and lower-level errors (e.g., grammar errors, typos, and punctuation errors) [1]. Unlike human raters, AES does not try to “understand” the content of the response. Instead, it looks for consistent patterns across the responses that have been scored similarly by human raters.

Several testing organizations are using AES to grade writing assessments — typically designed for English language learners. For example, Educational Testing Service (ETS) has several applications of automated scoring in low- and high-stakes assessments. Similarly, Australian Council on Educational Research uses AES to score the Online Writing Assessment (OWA) for adults.

Objectivity and Fairness

When it comes to ease of scoring and objectivity, multiple-choice testing is often considered the most effective and enduring form of assessment [2]. Unlike multiple-choice testing, short-answer and essay-type questions are more difficult to grade because they require human raters to go through every response one by one. Also, such questions are more prone to subjectivity in grading for several reasons. For example, grading consistency may change depending on the mood or energy level of a human rater. Also, with two or more human raters, it is possible that each rater grades the responses with a different level of leniency.

A significant benefit of using AES when grading responses to short-answer and essay-type questions is the option to maintain objectivity and fairness in grading. AES can consistently apply the same scoring rules to the responses given by all students, regardless of their socioeconomic status, gender, and ethnic or racial background. Also, AES would not be affected by the halo effect — the tendency to assign significantly higher (or lower) scores based on prior experience with a student [3].

It is worth noting that applying the same scoring rules consistently across all students does not mean grades assigned by AES are entirely bias-free. For AES to learn the correct scoring rules, it requires a large-size scoring data set in which human raters have graded the questions consistently.

You’re only as good as your data set. [4]

If human raters’ grading involves negative or positive bias towards a particular group of students, then grading with AES will involve the same bias, too. That is, AES, or more generally AI, cannot automatically eliminate unfairness and discrimination caused by human biases.

Cultural Resistance

Currently, there is (partially unconscious) cultural resistance against the use of AES in practice.

Students and teachers: Providing students with the opportunity to use their own words when answering questions helps create more tailored assessments. In the meantime, it brings the expectation that human raters (e.g., teachers) will read between the lines in every written response to give at least some partial credit on the response. So, students might be concerned that AES will fail to capture the depth of their response, especially if it is not written explicitly. Some teachers and instructors are also skeptical of grading with AES based on the legitimate concern that it might encourage students to use formulaic writing.
General public: When people hear about an intriguing AI system, there is a general tendency to look for ways to trick or deceive it. In assessments where human factors are heavily influential, it is even harder to convince the general public about the utility of AI. Dr. Les Perelman, who is a former MIT professor, has been highly vocal about inaccurate grading with AES, which he calls “robo-grading”. In a recent article, he talks about the BABEL Generator — a software program capable of creating complex but nonsensical text that receives perfect scores from most AES engines. News media also continues to influence public opinion on this matter. Most newspaper and magazine articles argue that AI’s understanding of the human language is still shallow and thus AI systems fail to grasp the meaning of written text.

Final Thoughts

Sooner or later, we will have to recognize that every field, including education and human assessment, is getting increasingly automated through the use of advanced AI systems. Thus, we need to employ a human-centered approach in order to build the trust of different stakeholders (e.g., students, teachers, parents, and policymakers) regarding the utility of AI. Promising applications such as AES foreshadow how AI might bring fairness to human assessment in the near future but more progress is still needed.

References

[1] Madnani, N., & Cahill, A. (2018, August). Automated scoring: Beyond natural language processing. In Proceedings of the 27th International Conference on Computational Linguistics (pp. 1099–1109).

[2] Gierl, M. J., Bulut, O., Guo, Q., & Zhang, X. (2017). Developing, analyzing, and using distractors for multiple-choice tests in education: A comprehensive review. Review of Educational Research, 87(6), 1082–1116.

[3] Malouff, J. M., Stein, S. J., Bothma, L. N., Coulter, K., & Emmerton, A. J. (2014). Preventing halo bias in grading the work of university students. Cogent Psychology, 1(1).

[4] https://cs.utdallas.edu/dr-vincent-ng-develops-ai-essay-grading-program/