Today’s academic environment is as challenging and providing teachers with tools to facilitate education is more important than ever before. Grading assignments, especially open ended text, is a time consuming task for which Artificial Intelligence may significantly alleviate.
One promising application is automatic short answer grading (ASAG), which aims to alleviate the heavy workload of educators by automatically evaluating open-ended textual responses.
While LLMs have shown remarkable progress in ASAG, even surpassing traditional methods in complex scenarios like rubric-based evaluation, they still struggle to consistently achieve human-level accuracy when relying solely on automated processes.
This research delves into the potential of leveraging the interactive capabilities of LLMs through a human-in-the-loop (HITL) approach, proposing a novel framework named GradeHITL to bridge this gap and bring automated grading closer to the nuanced judgment of human experts.
Key Takeaways:
- GradeHITL significantly improves the accuracy of LLM-based automatic short answer grading by incorporating human expert feedback to refine grading rubrics.
- By enabling LLMs to actively pose questions about rubrics and grading errors to human experts, GradeHITL adaptively optimizes the grading process.
- The framework utilizes a reinforcement learning (RL)-based Q&A selection method to filter out low-quality questions, ensuring that human input is focused and impactful.
- Experimental results on a pedagogical dataset demonstrate that GradeHITL outperforms existing fully automated prompt optimization methods and traditional ASAG baselines.
- The RL-based Q&A retriever within GradeHITL proves more effective in selecting valuable human feedback compared to heuristic methods like random selection and semantic similarity.
- GradeHITL's iterative optimization process, inspired by the prior GradeOpt framework but enhanced with HITL, allows for continuous improvement of grading accuracy and rubric reliability.
Overview
The advent of artificial intelligence (AI) technologies, particularly large language models (LLMs), is revolutionizing numerous fields, with education being a prominent one. Within education, the application of LLMs holds immense potential, ranging from personalized learning experiences to providing teaching assistance.
A particularly impactful area is automatic short answer grading (ASAG), which focuses on the evaluation of open-ended textual responses. Traditional ASAG methods relied on handcrafted features and pattern-matching techniques using statistical algorithms.
The rise of deep learning and advanced natural language processing (NLP) techniques, including transformer networks and pre-trained language models (PLMs) like BERT and RoBERTa, led to semantic-based grading, eliminating the need for large labeled datasets through fine-tuning.
However, despite these advancements, fully automated LLM-powered ASAG systems face challenges in achieving human-level grading performance, especially in rubric-based assessments. Rubric texts often contain jargon or domain-specific terms that lack clear explanations, hindering the ability of purely automated methods to accurately interpret them based solely on labeled examples.
The inherent complexity and variability of language expression can lead to inconsistencies in grading outcomes, where minor variations in input can cause significant changes in output. These limitations highlight the necessity of integrating human feedback and domain knowledge into the ASAG process.
This research introduces GradeHITL, a novel LLM-powered ASAG framework that incorporates a human-in-the-loop (HITL) design to address these issues.
Unlike fully automated approaches, GradeHITL leverages the interactive features of LLMs, enabling them not only to output grades but also to actively ask questions about the rubrics or their own grading errors.
By incorporating answers from human experts, the framework adaptively optimizes the grading rubrics, leading to substantial improvements in grading performance and ensuring more controllable grading standards, which is crucial in educational applications. The framework consists of three key components: Grading, Inquiring, and Optimizing, as illustrated in Figure 1.
Why it’s Important
The development of accurate and reliable automated short answer grading systems holds significant implications for the education domain and beyond. By reducing the time and effort required for manual grading, ASAG can alleviate the workload of educators, allowing them to dedicate more time to other crucial aspects of teaching, such as curriculum development and personalized student support.
The integration of human expertise into the automated grading process, as proposed by GradeHITL, is particularly important for ensuring fairness, consistency, and interpretability of assessments, especially in complex domains requiring nuanced understanding.
Within the education field, accurate ASAG can facilitate more frequent assessments, providing timely feedback to students and instructors, which can enhance the learning process.
In high-stakes examinations or large-scale educational assessments, reliable automated grading can improve efficiency and potentially reduce grading biases. Moreover, the insights gained from human-LLM interactions during the rubric refinement process in GradeHITL can contribute to the development of clearer and more effective grading guidelines.
Beyond education, the principles of HITL-enhanced automated text evaluation can be applied to other domains involving subjective assessments of textual data, such as evaluating open-ended survey responses, providing feedback on creative writing, or even in legal document review where nuanced understanding and interpretation are critical.
The framework's ability to learn from expert feedback and refine evaluation criteria iteratively could be valuable in any scenario where automated systems need to align with human judgment on complex textual tasks.
Summary of Results
The effectiveness of GradeHITL was evaluated through comprehensive experiments on a pedagogical dataset designed to assess teachers’ knowledge of mathematics education.
This dataset included three types of questions:
- knowledge of mathematics teaching (C1),
- knowledge of students’ mathematical thinking (C2),
- knowledge of tasks (C3), with two representative questions selected from each category. Responses were graded on a three-point scale by expert annotators.
The dataset was split into training (80%) and testing (20%) sets for each question.
Table 1. Comparison of GradeHITL with baseline models. The best performed model of each metric is marked with bold, the second best one is marked with underline.
Table 1 presents a comparison of GradeHITL's performance against several baseline models, including non-LLM methods (SBERT with logistic regression and RoBERTa with fine-tuning) and LLM-based grading models (naive prompting, APO, and GradeOpt).
The results clearly show that all LLM-based algorithms outperformed the non-LLM methods, highlighting the inherent advantages of leveraging LLMs for grading tasks.
Notably, optimized prompting methods (APO, GradeOpt, and GradeHITL) exhibited higher average performance metrics and lower variance compared to naive prompting, underscoring the importance of rubric optimization.
Most importantly, GradeHITL consistently achieved the highest performance across most metrics, particularly Cohen’s Kappa (κc) and Quadratic Weighted Kappa (κw), demonstrating the significant positive impact of incorporating human-in-the-loop interactions into the rubric optimization process.
To address the effectiveness of the RL-based Q&A selection component (RQ2), ablation studies were conducted, comparing it with two heuristic approaches: random selection and semantic similarity selection.
Table 2. Comparison of different Q&A Retrievers: Random (RD), Semantic Similarity (SS), and Reinforcement Learning (RL). The best-performing model for each metric is highlighted in bold, while the second-best is indicated with underline.
Table 2 presents the performance comparison of these methods when integrated into GradeHITL. The semantic similarity-based retrieval outperformed random selection in most cases, indicating the value of selecting relevant human feedback.
However, the RL-based retriever consistently outperformed both heuristic methods across all six questions. This suggests that by learning from the rewards associated with different Q&A pairs, the RL-based approach can more effectively identify and prioritize the most valuable human input for rubric refinement, leading to improved grading accuracy.
GradeHITL Mechanism
The GradeHITL framework operates through an iterative optimization process involving Grading, Inquiring, and Optimizing components.
- The Grading component uses Chain-of-Thought (CoT) prompting to make the LLM's reasoning transparent.
- The Inquiring component generates questions about the rubric based on LLM's uncertainties and prioritizes these questions based on grading confidence, with human experts (or, in some experimental settings, another LLM acting as an Answerer) providing answers.
- A validation step filters out ineffective Q&A pairs based on their impact on grading accuracy.
The Optimizing component, inspired by GradeOpt, employs a multi-agent system (Retriever, Reflector, and Refiner) to refine the rubric using the validated human-computer Q&A interactions.
- The Retriever selects valuable Q&A pairs using the RL-based policy,
- The Reflector analyzes grading errors and proposes rubric improvements
- The Refiner generates a new, optimized rubric by appending Adaptation Rules (Gar).
This iterative process, with outer, middle, and inner loops, aims to continuously enhance grading accuracy and rubric consistency, employing a beam search strategy to prevent local optima.
Conclusion
This research successfully introduces GradeHITL, a novel human-in-the-loop framework that significantly enhances the performance of LLM-based automatic short answer grading.
By effectively integrating human expert feedback through an interactive questioning process and leveraging reinforcement learning for optimal feedback selection, GradeHITL overcomes limitations of purely automated approaches, bringing ASAG closer to human-level evaluation.
The experimental results on a challenging pedagogical dataset provide compelling evidence for the effectiveness of this approach, demonstrating improvements in both grading accuracy and rubric alignment compared to existing state-of-the-art methods.
The findings underscore the critical role of incorporating human intelligence into automated assessment systems to achieve robust and reliable grading outcomes. Future work could explore the application of GradeHITL to other types of assessments and in different educational contexts.
Enhancing Automated Grading with Human Insight: The GradeHITL Framework
LLM-based Automated Grading with Human-in-the-Loop