Students Rated Millions of AI Generated Questions – Here's What We Learned

Written by Rachel Van Campenhout, Ed.D. | Oct 28, 2024 5:22:48 PM

It’s well-established that “learning by doing” is more effective than passive study. At VitalSource, we’re leveraging artificial intelligence to enhance this method by providing scalable formative practice through our Bookshelf CoachMe tool. As students engage with AI-generated questions, they not only practice their skills but also provide valuable feedback on the quality of the questions. Recently, VitalSource's learning science team conducted a large-scale study to analyze this feedback and uncover insights into how students interact with AI-powered learning tools.

Analyzing Student Ratings of AI-Generated Questions

Our study, presented at the 17th International Conference on Educational Data Mining and recognized with the Best Paper Award, focused on analyzing student ratings of over 800,000 automatically generated (AG) questions answered across more than 9,000 textbooks. Students rated these questions using a simple thumbs up or thumbs down feature. By combining these ratings with other clickstream data and features of the questions themselves, we created a groundbreaking data set that provided insights into how students perceive AI-generated questions. This data set is publicly available here.

The overarching research question was: What relationships does the explanatory model suggest between student ratings and AG question features?

The explanatory model includes 10 features that correspond to a hypothesis on how that feature might impact student ratings. We had 10 hypotheses:

H1: Answering a question correctly on the first attempt will increase the chance of a thumbs up and decrease the chance of a thumbs down.
H2: As a student answers more questions, the chance of giving a rating (thumbs up or down) will decrease.
H3: Receiving a spelling correction suggestion for an answer will increase the chance of a thumbs up and decrease the chance of a thumbs down.
H4: Questions created from more important sentences in the textbook will receive more thumbs up and fewer thumbs down.
H5: Questions with answer words that are more important in the textbook will receive more thumbs up and fewer thumbs down.
H6: Questions with noun and adjective answer words will receive more thumbs up and fewer thumbs down than verb and adverb answer words.
H7: Questions with rarer words as the answer will receive more thumbs up and fewer thumbs down than questions with more common words as the answer.
H8: Questions where the answer blank occurs early in the sentence will receive fewer thumbs up and more thumbs down.
H9: Questions that give elaborative feedback after an incorrect answer will receive more thumbs up and fewer thumbs down than questions that give only outcome feedback.
H10: Questions that have been reviewed by a human reviewer before inclusion will receive more thumbs up and fewer thumbs down than questions that did not have human review.

So what did we find?

Key Findings

Correct Answers Correlate with Positive Feedback: As hypothesized, when students answered questions correctly, they were more likely to give a thumbs up. This shows a direct connection between student success and positive feedback.
Verb and Adverb Answer Terms Receive More Negative Ratings: One of the more surprising findings was that students gave more thumbs down ratings to questions where the answer was a verb or adverb, compared to those with noun or adjective answers. This insight provides a clear path for refining our AQG system by favoring certain types of answers over others.
Question Structure and Feedback Matter: The placement of the answer within the sentence, the importance of the answer word, and the feedback given after incorrect answers were all significant in determining whether students gave a thumbs up or thumbs down. These features help inform how we can improve the overall student experience with AG questions.
Thumbs Down Ratings Are More Tied to Question Features: Interestingly, we found that most of the tested question features were associated with thumbs down ratings, while only three were significant in the thumbs up model. This finding suggests that students are more likely to give a thumbs down when specific features of the question don't meet their expectations, making it easier to identify areas for improvement.
Correct Answers and Rarity of Terms Are Key for Positive Feedback: Only three factors significantly influenced thumbs up ratings—whether the student answered the question correctly, how many questions they had answered, and the rarity of the answer term. This suggests that students favor rare or unique terms, which are more likely to correspond to vocabulary they need to learn, along with success in answering.
Human Review Had Little Impact: Interestingly, the study found that whether or not a question was reviewed by a human (H10) had no statistically significant impact on thumbs down ratings. This indicates that while human review might be expected to improve question quality, it didn’t strongly influence students' perceptions in this case.

Improving the Future of AI-Generated Learning

The causal relationships discovered in this study offer practical insights for refining AI-generated questions. By focusing on factors that lead to negative ratings, such as the importance of the answer word or sentence and the type of answer term used, we can further develop our AQG system to better meet student needs and reduce the likelihood of negative feedback.

As we continue to explore the intersection of AI and education, the feedback provided by students themselves becomes an invaluable resource. Their ratings help us better understand which features drive engagement and which can be improved to support more effective learning.

To learn more about VitalSource’s ongoing efforts to advance learning science, visit our Learning Science Research Center.

View full post