Psychometric Properties of Chatgpt-4o-Generated Test Items in Social Science Education: A Multi-Dimensional Evaluation in Secondary Schools in Kaduna State Metropolis, Nigeria
Keywords:
Psychometric evaluation, Social Science education, item difficulty, item discrimination, higher-order thinking, ChatGPT-4oAbstract
The integration of artificial intelligence (AI) tools, specifically large language models
(LLMs)—into educational assessment has introduced new possibilities and challenges for test
development in Nigerian secondary schools. This study evaluated the psychometric properties
of test items generated by ChatGPT-4o in Social Science subjects in secondary schools in
Kaduna State Metropolis, Nigeria, thereby contributing an empirically grounded, context
specific baseline for evidence-informed AI-assisted assessment practice in the Nigerian
secondary school system. A descriptive survey research design was adopted, with a sample
of 120 Social Science teachers selected through stratified random sampling from 30 public
secondary schools in Kaduna Metropolis. Three validated instruments were utilised: a
Teachers' Perception Questionnaire on AI-Generated Test Items (TPQAGTI; S-CVI = .84),
an AI-Generated Item HOTS Classification Protocol (AIHCP), and an Item Difficulty
Analysis Record (IDAR). A pool of 200 AI-generated test items was produced using
ChatGPT-4o and administered to 240 Senior Secondary School (SSS) III students across four
selected schools in Kaduna Metropolis. Data were analysed using descriptive statistics, one
way Analysis of Variance (ANOVA), chi-square goodness-of-fit test, and Kruskal-Wallis H
test. Results revealed that teachers held generally positive perceptions of AI-generated test
items (grand mean = 3.55), with significant inter-group differences across experience levels
[F(3, 116) = 3.42, p = .020, η² = .08]. The majority of AI-generated items (71.0%) operated
at lower-order thinking levels, with a statistically significant non-uniform cognitive
distribution (χ² = 48.85, df = 5, p < .001). Item difficulty indices were predominantly
moderate (52.5%), with an overall mean discrimination index of D = .26, indicating generally
marginal discriminatory power. Findings are limited to Kaduna State Metropolis and to items
generated by a single LLM (ChatGPT-4o); generalisability to other geographic contexts or
AI platforms requires further investigation. The study concludes that while ChatGPT-4o
generated test items demonstrate acceptable difficulty calibration, they exhibit marginal
discrimination and require deliberate HOTS-aligned prompt engineering and professional
human oversight before deployment in the Nigerian secondary school context.