How GradeHITL Elevates Automated Grading with Human Expertise

Advances in Artificial Intelligence Assisted Grading Technologies

LLM-based Automated Grading with Human-in-the-Loop

Hang Li Yucheng Chu Kaiqi (Korbin) YANG Yasemin Copur-Gencturk Jiliang Tang

Get All The Latest Research & News!

Subscribe

Transforming Automated Grading Through Human Expertise

Automated grading powered by large language models (LLMs) has emerged as a breakthrough for educators, especially when it comes to evaluating open-ended student responses. Yet, even advanced AI systems struggle to fully capture the nuanced judgment of human graders, particularly in complex, rubric-based assessments. Enter GradeHITL—a pioneering framework that brings human insight directly into the AI grading loop, setting a new standard for accuracy and reliability.

Key Innovations of GradeHITL

Integrates human expertise: GradeHITL incorporates feedback from human experts, allowing LLMs to refine grading rubrics and achieve greater alignment with expert judgment.
Interactive Q&A mechanism: The system empowers LLMs to pose clarifying questions about rubrics and grading uncertainties—questions that experts answer, driving targeted improvements.
Reinforcement learning optimization: A reinforcement learning (RL)-based Q&A selection ensures that only high-value questions are prioritized, making human involvement both efficient and impactful.
Superior performance: Experiments on a challenging pedagogical dataset show that GradeHITL consistently outperforms both traditional and fully automated ASAG (Automatic Short Answer Grading) methods.
Iterative rubric refinement: Inspired by the GradeOpt framework but enhanced with human-in-the-loop feedback, GradeHITL iteratively optimizes grading rubrics for even greater reliability and fairness.

Why Human-in-the-Loop Matters

Automated grading systems offer enormous promise by saving educators time and enabling more frequent, consistent assessment. However, rubrics often contain domain-specific language and ambiguities that purely automated systems struggle to interpret. By actively involving human experts, GradeHITL not only improves grading accuracy but also ensures assessments are fair, interpretable, and aligned with educational standards. This approach helps reduce grading bias, offers timely feedback, and supports the development of clearer grading guidelines—benefitting both students and instructors.

The implications stretch beyond education. The HITL model can be adapted for other fields requiring subjective text evaluation, such as survey analysis, creative writing feedback, or legal review, wherever nuanced understanding is essential.

How GradeHITL Works: The Iterative Optimization Process

Grading: The LLM grades responses using Chain-of-Thought (CoT) prompting to make its reasoning transparent.
Inquiring: When uncertain, the LLM generates questions about rubrics or its grading decisions. These are prioritized based on confidence, and domain experts provide answers.
Validation: The impact of each Q&A pair on grading accuracy is assessed, filtering out less useful interactions.
Optimizing: A multi-agent system—comprising a Retriever (using RL to select valuable Q&A), a Reflector (analyzing errors and suggesting improvements), and a Refiner (updating the rubric)—iteratively refines grading standards. This loop continues, leveraging both AI and expert feedback, until optimal performance is achieved.

This design ensures each cycle brings the system closer to human-level evaluation, while a beam search strategy helps avoid suboptimal outcomes.

Evidence from Real-World Testing

In rigorous testing on a pedagogical dataset assessing teachers' mathematics education knowledge, GradeHITL demonstrated clear advantages. LLM-based models already outperformed traditional, non-LLM methods, but GradeHITL delivered the highest performance—especially in key agreement metrics like Cohen’s Kappa. The RL-based Q&A retriever proved more effective than random or semantic similarity-based approaches, ensuring that the most relevant human feedback was always prioritized for rubric refinement.

Conclusion and Future Impact

GradeHITL marks a significant leap forward in automated grading, blending the efficiency of AI with the discernment of human experts. The framework’s interactive and reinforcement learning-driven approach leads to more accurate, consistent, and interpretable assessments, ultimately benefiting educators and learners alike. Looking ahead, this human-in-the-loop model could transform automated evaluation across education and other domains requiring subjective judgment, making AI tools smarter and more aligned with real-world expertise.

Source:

in Quick Research Reviews

# automated feedback, gradehitl grading human q&a rubrics systems

Source: https://joshuaberkowitz.us/blog/research-reviews-2/enhancing-automated-grading-with-human-insight-the-gradehitl-framework-49

Publication Title: LLM-based Automated Grading with Human-in-the-Loop

DOI: 10.48550/arXiv.2504.05239

Authors:

Hang Li Yucheng Chu Kaiqi (Korbin) YANG Yasemin Copur-Gencturk Jiliang Tang

Organizations:

Michigan State University University of Southern California

Preprint Date: 2025-04-07

Number of Pages: 16

Follow us

How GradeHITL Elevates Automated Grading with Human Expertise

LLM-based Automated Grading with Human-in-the-Loop

Get All The Latest Research & News!

Transforming Automated Grading Through Human Expertise

Key Innovations of GradeHITL

Why Human-in-the-Loop Matters

How GradeHITL Works: The Iterative Optimization Process

Evidence from Real-World Testing

Conclusion and Future Impact

Source:

Share this post

Tags

blogs

Get In Front of 1000s of Professionals Today! Advertise Here

Most Popular Articles

Every shirt tells a story—and every story

#ClothingForACause

How GradeHITL Elevates Automated Grading with Human Expertise

LLM-based Automated Grading with Human-in-the-Loop

Get All The Latest Research & News!

Transforming Automated Grading Through Human Expertise

Key Innovations of GradeHITL

Why Human-in-the-Loop Matters

How GradeHITL Works: The Iterative Optimization Process

Evidence from Real-World Testing

Conclusion and Future Impact

Source:

Share this post

Tags

blogs

Get In Front of 1000s of Professionals Today! ​ Advertise Here

Most Popular Articles

Every shirt tells a story—and every story

#ClothingForACause

Get In Front of 1000s of Professionals Today! Advertise Here