It’s well-established that “learning by doing” is more effective than passive study. At VitalSource, we’re leveraging artificial intelligence to enhance this method by providing scalable formative practice through our Bookshelf CoachMe tool. As students engage with AI-generated questions, they not only practice their skills but also provide valuable feedback on the quality of the questions. Recently, VitalSource's learning science team conducted a large-scale study to analyze this feedback and uncover insights into how students interact with AI-powered learning tools.
Analyzing Student Ratings of AI-Generated Questions
Our study, presented at the 17th International Conference on Educational Data Mining and recognized with the Best Paper Award, focused on analyzing student ratings of over 800,000 automatically generated (AG) questions answered across more than 9,000 textbooks. Students rated these questions using a simple thumbs up or thumbs down feature. By combining these ratings with other clickstream data and features of the questions themselves, we created a groundbreaking data set that provided insights into how students perceive AI-generated questions. This data set is publicly available here.
The overarching research question was: What relationships does the explanatory model suggest between student ratings and AG question features?
The explanatory model includes 10 features that correspond to a hypothesis on how that feature might impact student ratings. We had 10 hypotheses:
H1: Answering a question correctly on the first attempt will increase the chance of a thumbs up and decrease the chance of a thumbs down.
H2: As a student answers more questions, the chance of giving a rating (thumbs up or down) will decrease.
H3: Receiving a spelling correction suggestion for an answer will increase the chance of a thumbs up and decrease the chance of a thumbs down.
H4: Questions created from more important sentences in the textbook will receive more thumbs up and fewer thumbs down.
H5: Questions with answer words that are more important in the textbook will receive more thumbs up and fewer thumbs down.
H6: Questions with noun and adjective answer words will receive more thumbs up and fewer thumbs down than verb and adverb answer words.
H7: Questions with rarer words as the answer will receive more thumbs up and fewer thumbs down than questions with more common words as the answer.
H8: Questions where the answer blank occurs early in the sentence will receive fewer thumbs up and more thumbs down.
H9: Questions that give elaborative feedback after an incorrect answer will receive more thumbs up and fewer thumbs down than questions that give only outcome feedback.
H10: Questions that have been reviewed by a human reviewer before inclusion will receive more thumbs up and fewer thumbs down than questions that did not have human review.
So what did we find?
Key Findings
Improving the Future of AI-Generated Learning
The causal relationships discovered in this study offer practical insights for refining AI-generated questions. By focusing on factors that lead to negative ratings, such as the importance of the answer word or sentence and the type of answer term used, we can further develop our AQG system to better meet student needs and reduce the likelihood of negative feedback.
As we continue to explore the intersection of AI and education, the feedback provided by students themselves becomes an invaluable resource. Their ratings help us better understand which features drive engagement and which can be improved to support more effective learning.
To learn more about VitalSource’s ongoing efforts to advance learning science, visit our Learning Science Research Center.