Psychometric Properties of Chatgpt-4o-Generated Test Items in Social Science Education: A Multi-Dimensional Evaluation in Secondary Schools in Kaduna State Metropolis, Nigeria

Authors

  • Dr. Bashir Mamman Kaduna State University, Kaduna-Nigeria Author

Keywords:

Psychometric evaluation, Social Science education, item difficulty, item discrimination, higher-order thinking, ChatGPT-4o

Abstract

The integration of artificial intelligence (AI) tools, specifically large language models 
(LLMs)—into educational assessment has introduced new possibilities and challenges for test 
development in Nigerian secondary schools. This study evaluated the psychometric properties 
of test items generated by ChatGPT-4o in Social Science subjects in secondary schools in 
Kaduna State Metropolis, Nigeria, thereby contributing an empirically grounded, context
specific baseline for evidence-informed AI-assisted assessment practice in the Nigerian 
secondary school system. A descriptive survey research design was adopted, with a sample 
of 120 Social Science teachers selected through stratified random sampling from 30 public 
secondary schools in Kaduna Metropolis. Three validated instruments were utilised: a 
Teachers' Perception Questionnaire on AI-Generated Test Items (TPQAGTI; S-CVI = .84), 
an AI-Generated Item HOTS Classification Protocol (AIHCP), and an Item Difficulty 
Analysis Record (IDAR). A pool of 200 AI-generated test items was produced using 
ChatGPT-4o and administered to 240 Senior Secondary School (SSS) III students across four 
selected schools in Kaduna Metropolis. Data were analysed using descriptive statistics, one
way Analysis of Variance (ANOVA), chi-square goodness-of-fit test, and Kruskal-Wallis H 
test. Results revealed that teachers held generally positive perceptions of AI-generated test 
items (grand mean = 3.55), with significant inter-group differences across experience levels 
[F(3, 116) = 3.42, p = .020, η² = .08]. The majority of AI-generated items (71.0%) operated 
at lower-order thinking levels, with a statistically significant non-uniform cognitive 
distribution (χ² = 48.85, df = 5, p < .001). Item difficulty indices were predominantly 
moderate (52.5%), with an overall mean discrimination index of D = .26, indicating generally 
marginal discriminatory power. Findings are limited to Kaduna State Metropolis and to items 
generated by a single LLM (ChatGPT-4o); generalisability to other geographic contexts or 
AI platforms requires further investigation. The study concludes that while ChatGPT-4o
generated test items demonstrate acceptable difficulty calibration, they exhibit marginal 
discrimination and require deliberate HOTS-aligned prompt engineering and professional 
human oversight before deployment in the Nigerian secondary school context. 

Author Biography

  • Dr. Bashir Mamman, Kaduna State University, Kaduna-Nigeria

    Department of Educational Foundations, Faculty of Education, KASU

Downloads

Published

2026-06-08