Main Article Content
Item response theory is a widely used framework for the design, scoring, and scaling of measurement instruments. Item response models are typically used for dichotomously scored questions that have only two score points (e.g., multiple-choice items). However, given the increasing use of instruments that include questions with multiple response categories, such as surveys, questionnaires, and psychological scales, polytomous item response models are becoming more utilized in education and psychology. This study aims to demonstrate the application of explanatory item response theory (IRT) models to polytomous item responses in order to explain common variability in item clusters, person groups, and interactions between item clusters and person groups. Explanatory forms of several IRT models – such as Partial Credit Model and Rating Scale Model – are demonstrated and the estimation procedures of these models are explained. Findings of this study suggest that explanatory IRT models can be more parsimonious than traditional IRT models for polytomous data when items and persons share common characteristics. Explanatory forms of the polytomous IRT models can provide more information about response patterns in item responses by estimating fewer item parameters.
International Journal of Assessment Tools in Education
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
Adams, R. J., Wu, M. L., & Wilson, M. (2012). The Rasch rating model and the disordered threshold controversy. Educational and Psychological Measurement, 72(4), 547–573. doi: 10.1177/0013164411432166
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, & Joint Committee on Standards for Educational and Psychological Testing. (2014). Standards for educational and psychological testing. Washington, DC: AERA.
Andrich, D. (1978). Application of a psychometric rating model to ordered categories which are scored with successive integers. Applied Psychological Measurement, 2(4) 581–594. doi:10.1177/014662167800200413
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716–723. doi:10.1109/TAC.1974.1100705
Bates, D., Maechler, M., Bokler, B., & Walker, S. (2014). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48. doi:10.18637/jss.v067.i01
Beretvas, S. N. (2008). Cross-classified random effects models. In A. A. O’Connell & D. Betsy McCoach (Eds.), Multilevel modeling of educational data (pp. 161-197). Charlotte, SC: Information Age Publishing.
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores. Reading, MA: Addison–Wesley.
Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37(1), 29–51. doi:10.1007/BF02291411
Bock, R. D., & Aitkin, M. (1981) Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46(4), 443–459. doi:10.1007/BF02293801
Bond, T., & Fox, C. (2001). Applying the Rasch model: Fundamental measurement in the human sciences. Mahwah, NJ: Lawrence Erlbaum Associates.
Briggs, D. C. (2008). Using explanatory item response models to analyze group differences in science achievement. Applied Measurement in Education, 21(2), 89 - 118. http://dx.doi.org/10.1080/08957340801926086
Bulut, O. (2019). eirm: Explanatory item response modeling for dichotomous and polytomous item responses [Computer software]. Available from https://github.com/okanbulut/eirm.
Bulut, O., Palma, J., Rodriguez, M. C., & Stanke, L. (2015). Evaluating measurement invariance in the measurement of developmental assets in Latino English language groups across developmental stages. Sage Open, 5(2), 1-18. doi:10.1177/2158244015586238
Cawthon, S., Kaye, A., Lockhart, L., & Beretvas, S. N. (2012). Effects of linguistic complexity and accommodations on estimates of ability for students with learning disabilities. Journal of School Psychology, 50, 293–316. doi:10.1016/j.jsp.2012.01.002
Cohen, A. S., & Bolt, D. M. (2005). A mixture model analysis of differential item functioning. Journal of Educational Measurement, 42(2), 133–148. doi:10.1111/j.1745-3984.2005.00007
De Ayala, R. J., Kim, S. H., Stapleton, L. M., & Dayton, C. M. (2002). Differential item functioning: A mixture distribution conceptualization. International Journal of Testing, 2(3-4), 243–276. http://dx.doi.org/10.1080/15305058.2002.9669495
De Boeck, P. (2008). Random item IRT models. Psychometrika, 73, 533-559. doi:10.1007/s11336-008-9092-x
De Boeck, P., & Partchev, I. (2012). IRTrees: Tree-based item response models of the GLMM family. Journal of Statistical Software, 48(1), 1–28.
De Boeck, P., & Wilson, M. (2004). Explanatory item response models: a generalized linear and nonlinear approach. Statistics for Social Science and Public Policy. New York, NY. Springer.
Desjardins, C. D., & Bulut, O. (2018). Handbook of educational measurement and psychometrics using R. Boca Raton, FL: CRC Press.
Embretson, S. E. (1983). Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93(1). 179–197. http://dx.doi.org/10.1037/0033-2909.93.1.179
Embretson, S. E. (1994). Applications of cognitive design systems to test development. In C. R. Reynolds, Cognitive Assessment (pp. 107¬–135). Springer USA.
Embretson, S. E. (1998). A cognitive design system approach to generating valid tests: Application to abstract reasoning. Psychological Methods, 3(3), 380–396. http://dx.doi.org/10.1037/1082-989X.3.3.380
Embretson, S. E. (2006). Cognitive models for the psychometric properties of GRE quantitative items. Final Report. Princeton, NJ: Educational Testing Service.
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum.
Embretson, S. E., & Yang, X. (2007). Construct validity and cognitive diagnostic assessment. In J. P. Leighton & M. J. Gierl (Eds.), Cognitive diagnostic assessment for education (pp. 119–145). New York, NY: Cambridge University Press.
Fischer, G. H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37(6), 359–374.
French, B. F., & Finch, W. H. (2010). Hierarchical logistic regression: Accounting for multilevel data in DIF detection. Journal of Educational Measurement, 47(3). 299–317. doi:10.1111/j.1745-3984.2010.00115.x
Ferster, A. E. (2013). An evaluation of item level cognitive supports via a random-effects extension of the linear logistic test model. Unpublished doctoral dissertation, University of Georgia.
Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2013). Bayesian data analysis. Boca Raton, FL: CRC Press.
Hartig, J., Frey, A., Nold, G., & Klieme, E. (2012). An application of explanatory item response modeling for model-based proficiency scaling. Educational and Psychological Measurement, 72(4), 665–686. doi:10.1177/0013164411430707
Holling, H., Bertling, J. P., & Zeuch, N. (2009). Automatic item generation of probability word problems. Studies in Educational Evaluation, 35, 71–76. doi:10.1016/j.stueduc.2009.10.004
Janssen, R. (2010). Modeling the effect of item designs within the Rasch model. In. S. E. Embretson (Ed.), Measuring psychological constructs: Advances in model-based approaches (pp. 227–245). Washington, DC, US: American Psychological Association.
Janssen, R., Schepers, J., & Peres, D. (2004). Models with item and item group predictors. In P. De Boeck & M. Wilson (Eds.), Explanatory item response models: A generalized linear and nonlinear approach (pp. 189–212). New York, NY: Springer-Verlag.
Jiao, H., & Zhang, Y. (2014). Polytomous multilevel testlet models for testlet‐based assessments with complex sampling designs. British Journal of Mathematical and Statistical Psychology, 68(1), 65–83. doi:10.1111/bmsp.12035
Kan, A., & Bulut, O. (2014). Examining the relationship between gender DIF and language complexity in mathematics assessments. International Journal of Testing, 14(3), 245–264. http://dx.doi.org/10.1080/15305058.2013.877911
Kuha, J. (2004). AIC and BIC: Comparisons of assumptions of performance. Sociological Methods and Research, 33, 188–229. doi:10.1177/0049124103262065
Kubinger, K. (2008). On the revival of the Rasch model-based LLTM: from constructing tests using item generating rules to measuring item administration effects. Psychological Science Quarterly, (3), 311–327.
Linacre, J. M. (2002). Optimizing rating scale category effectiveness. Journal of Applied Measurement, 5(1), 85–106.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum.
Lunn, D. J., Thomas, A., Best, N., & Spiegelhalter, D. (2000). WinBUGS-a Bayesian modelling framework: Concepts, structure, and extensibility. Statistics and Computing, 10(4), 325–337. doi:10.1023/A:1008929526011
Luppescu, S. (2012, April). DIF detection in HLM item analysis. Paper presented at the annual meeting of the American Educational Research Association, New Orleans, LA.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149–174. doi:10.1007/BF02296272
Natesan, P., Limbers, C., & Varni, J. W. (2010). Bayesian estimation of graded response multilevel models using Gibbs sampling: formulation and illustration. Educational and Psychological Measurement, 70(3) 420–439. doi:10.1177/0013164409355696
Plieninger, H. & Meiser, T. (2014). Validity of multi-process IRT models for separating content and response styles. Educational and Psychological Measurement, 74(5), 875–899. doi:10.1177/0013164413514998
Prowker, A., & Camilli, G. (2007). Looking beyond the overall scores of NAEP assessments: Applications of generalized linear mixed modeling for exploring value‐added item difficulty effects. Journal of Educational Measurement, 44(1), 69–87. doi:10.1111/j.1745-3984.2007.00027.x
R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing: Vienna, Austria.
Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests (Copenhagen, Danish Institute for Educational Research), expanded edition (1980) with foreword and afterword by B. D. Wright. Chicago: The University of Chicago Press.
Reise, S. P., & Yu, J. (1990). Parameter recovery in the graded response model using MULTILOG. Journal of Educational Measurement, 27(2), 133–144.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores (Psychometric Monograph No. 17). Richmond, VA: Psychometric Society. Retrieved from http://www.psychometrika.org/journal/online/MN17.pdf
Schwarz, G.E. (1978). Estimating the dimension of a model. Annals of Statistics, 6(2), 461–464. doi:10.1214/aos/1176344136
Scheiblechner, H. H. (2009). Rasch and pseudo-Rasch models: suitableness for practical test applications. Psychology Science Quarterly, 51, 181–194.
Thissen, D., Chen, W., & Bock, D. (2003). MULTILOG 7 [Computer software]. Chicago, IL: Scientific Software International.
Tuerlinckx, F., & Wang, W.-C. (2004). Models for polytomous data. In P. De Boeck & M. Wilson (Eds.), Explanatory item response models: A generalized linear and nonlinear approach (pp. 75–109). New York: Springer-Verlag.
Tutz, G. (1990). Sequential item response models with an ordered response. British Journal of Mathematical and Statistical Psychology, 43(1), 39–55.
Tutz, G. (1991). Sequential models in categorical regression. Computational Statistics and Data Analysis, 11(3), 275–295. doi:10.1111/j.2044-8317.1990.tb00925.x
Vaughn, B. K. (2006). A hierarchical generalized linear model of random differential item functioning for polytomous items: A Bayesian multilevel approach. Electronic Theses, Treatises and Dissertations. Paper 4588.
Van den Noortgate, W., De Boeck, P., & Meulders, M. (2003). Cross-classification multilevel logistic models in psychometrics. Journal of Educational and Behavioral Statistics, 28(4), 369–386. doi:10.3102/10769986028004369
Van den Noortgate, W., & Paek, I. (2004). Person regression models. In P. De Boeck & M. Wilson (Eds.), Explanatory item response models: A generalized linear and nonlinear approach (pp. 167–187). New York, NY: Springer-Verlag.
van der Linden, W. J. & Hambleton, R. K. (1997). Item response theory: Brief history, common models, and extensions. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 1–28). New York: Springer
Vansteelandt, K. (2000). Formal models for contextualized personality psychology. Unpublished doctoral dissertation, K.U. Leuven, Belgium.
Verhelst, N. D., & Verstralen, H. H. F. M. (2008). Some considerations on the Partial Credit Model. Psicologica: International Journal of Methodology and Experimental Psychology, 29(2), 229–254.
Wang, W.-C., & Liu, C.-Y. (2007). Formulation and application of the generalized multilevel facets model. Educational and Psychological Measurement, 67(4), 583 - 605. doi:10.1177/0013164406296974
Wang, W.-C., & Wilson, M. (2005). Exploring local item dependence using a random-effects facet model. Applied Psychological Measurement, 29(4), 296 - 318. doi:10.1177/0146621605276281
Wang, W.-C., Wilson, M., & Shih, C.-L. (2006). Modeling randomness in judging rating scales with a random-effects rating scale model. Journal of Educational Measurement, 43(4), 335–353. doi:10.1111/j.1745-3984.2006.00020.x
Wang, W.-C., & Wu, S.-L. (2011). The random-effect generalized rating scale model. Journal of Educational Measurement, 48(4), 441-456. doi:10.1111/j.1745-3984.2011.00154.x
Williams, N. J., & Beretvas, S. N. (2006). DIF identification using HGLM for polytomous items. Applied Psychological Measurement, 30, 22–42. doi:10.1177/0146621605279867
Wilson, M., De Boeck, P., & Carstensen, C. H. (2008). Explanatory item response models: A brief introduction. In Hartig, J., Klieme, E., Leutner, D. (Eds.), Assessment of competencies in educational contexts: State of the art and future prospects (pp. 91-120). Göttingen, Germany: Hogrefe & Huber.
Wilson, M., Zheng, X., & McGuire, L. (2012). Formulating latent growth using an explanatory item response model approach. Journal of Applied Measurement, 13(1), 1–22.
Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago: Mesa Press.
Zwinderman, A. H. (1991). A generalized Rasch model for manifest predictors. Psychometrika, 56(4), 589–600.