11. Cross-cultural validity and comparability in assessments of complex constructs

Kadriye Ercikan
Educational Testing Service
Han-Hui Por
Educational Testing Service
Hongwen Guo
Educational Testing Service

Assessments of 21st Century competencies, such as complex problem solving, are expected to be engaging, resemble real life tasks, draw upon multidisciplinary knowledge and skills, and provide individuals with feedback on their progress towards solving problems. Several testing innovations are used in large-scale digital assessments to meet these goals (see Chapters 5 and 7 of this report). These innovations include adaptivity based on performance on segments of the assessment, interactivity that modifies the assessment as determined by student actions during test-taking, and the use of multimedia tools and digital features of assessment environments. Interactivity and adaptivity in assessments, as well as the need for producing more engaging assessment, immediate and effective feedback to learners, and cost efficiency, can also utilise artificial intelligence (AI)-based tools such as automated item generation and AI-based automated scoring. While these innovations enhance and help meet the demands for such assessments, we must consider the new sources of threats to cross-cultural validity and comparability of assessment inferences that they generate. These implications are discussed in this chapter.

Throughout the chapter, we use assessment to describe educational and psychological measurement instruments, including surveys and questionnaires, and related documents and procedures such as instructions, scoring guidelines and administration procedures. The term test refers specifically to a test form or when it is part of a commonly used phrase, such as in "test equivalence" or "test adaptation”. In describing the mode of assessments, we use the terms digital or innovative to refer to the broad classes of technology-based assessments (TBAs) and technology-enhanced assessments (TEAs). Furthermore, measurement/score comparability and measurement equivalence are used interchangeably and refer to the comparability of score interpretation and use and the statistical notions of measurement equivalence.

Students' participation in assessments has been viewed typically through a cognitive perspective which focuses on teaching, learning and performance on assessments without taking the sociocultural context into account (Pellegrino, Chudowsky and Glaser, 2001[1]). This perspective is in contrast to the sociocultural lens, also referred to as the "situated perspective" (Gee, 2008[2]), which considers learning and performance on assessment in terms of the relationship between individuals and the social environment and the context in which they think, feel, act and interact (Moss et al., 2008[3]).

There is extensive research evidence that social and cultural contexts can affect learning and world views, including how success is perceived, how students are taught and how achievement is defined in education systems. For example, Yup'ik children in rural Alaska learn critical community practices such as fishing and navigation from observing and participating in these activities with experienced adults. Because verbal interactions are part of this key learning process, a school system that expects passive listening with little contextual interaction may disadvantage these students (Lipka and McCarty, 1994[4]).

Sociocultural context plays a particularly critical role in the development of complex 21st Century constructs such as creativity, critical thinking, problem solving or collaborative skills. These constructs are defined in terms of how students think, feel, empathise, act and interact with others and their social environment, and are grounded in social and cultural contexts (Suzuki and Ponterotto, 2007[5]). Research points to evidence of differences in the conceptual definitions and applicability of such constructs in different cultures (Ercikan and Lyons-Thomas, 2013[6]; Ercikan and Oliveri, 2016[7]; Niu and Sternberg, 2001[8]; Suzuki and Ponterotto, 2007[5]; Lubart, 1990[9]).

The central role that social and cultural context plays on learning extends to assessments. When students engage with assessments, their experiences, language and sociocultural backgrounds interact with the knowledge and skills targeted by the assessment, which can in turn impact their performances (Liu, Wu and Zumbo, 2006[10]; Solano-Flores and Nelson-Barber, 2001[11]). Responses to test items reflect what students know and can do as well as a complex interaction of how they feel about the assessment situation, understand assessment questions and formulate their responses that is based on their social environment and cultural and language practices outside of school.

The nature and frequency of access to cultural practices outside of school can affect students' understanding of assessment items and hence their performance. For example, students who attend French schools as linguistic minority students in Ontario, Canada, where English is the dominant societal language, consistently performed worse in mathematics, reading and science than French students in Quebec, Canada, where French is the dominant language (Ercikan et al., 2014[12]). Similar performance trends have been observed for other ethnic, racial and linguistic minority groups in other countries (Ercikan et al., 2014[12]; Ercikan and Elliott, 2015[13]). In each occurrence there are likely multiple factors contributing to these patterns, but the contribution of students’ social and cultural contexts as well as access to the language of schooling need to be considered as important factors for learning and assessment outcomes.

Parental involvement and expectations – another contributor to the social aspect of learning – have also been found to play a role in academic performance (Cooper et al., 2009[14]; Lee and Stankov, 2018[15]). For example, students in Asian countries, such as China, view their education and examination systems as the main route out of poverty. This context, combined with being praised and rewarded for good performance by teachers and parents, is associated with higher motivation to perform well on the assessment (Rotberg, 2006[16]; Zhang and Luo, 2020[17]). Students from Asian cultural groups tend to have higher engagement levels with low-stakes assessments than students from some Western countries (Ercikan, Guo and He, 2020[18]; Guo and Ercikan, 2021[19]), also pointing to higher motivation levels.

An important context that can affect students' assessment performance is the opportunities available to study a topic, learn how to solve the type of problems included on an assessment, and engage with similar kinds of assessments (Ercikan, Roth and Asil, 2015[20]; National Academies of Sciences, Engineering, and Medicine, 2019[21]). The opportunity to access the curricular content which is subsequently assessed, develop test-taking strategies and become familiar with a given assessment technology can all contribute to students' ability to engage with an assessment.

The role of opportunity to learn (OTL) in performance on assessment and in the validity of interpretation and use of assessment results has been widely recognised (Moss et al., 2008[3]; National Academies of Sciences, Engineering, and Medicine, 2019[21]). Large-scale assessments like the National Assessment of Educational Progress (NAEP) and the Programme for International Student Assessment (PISA) include measures of OTL to facilitate better interpretation and use of performance results (National Center for Educational Statistics, 2021[22]; OECD, 2016[23]; Schleicher, 2019[24]). Indicators of OTL often include school poverty rates, access to high-quality preparatory (e.g. kindergarten) programmes and coursework, access to high-quality teaching, curricular breadth and academic support (National Academies of Sciences, Engineering, and Medicine, 2019[21]). As OTL tends to measure access to curricular programmes or academic support, many of the common OTL indicators have demonstrated association with academic performance such as applied mathematics problems (Schmidt, Zoido and Cogan, 2014[25]) and advanced mathematical concepts (Cogan, Schmidt and Wiley, 2001[26]; Fuchs and Wößmann, 2007[27]; Schmidt et al., 2015[28]).

While many of these indicators that primarily capture opportunities to learn in school have obvious and shown associations with learning outcomes, their relationship with competencies like problem solving, decision making or collaborative skills (Kurlaender and Yun, 2005[29]; 2007[30]) are less definitive as the development of such constructs are not as closely connected to specific subjects, concepts and procedures that are taught at school. Therefore, OTL indicators for these types of constructs may need to include those opportunities outside of schools, such as in the home and in broader societal contexts. Additionally, while within-group differences in OTL create challenges for score interpretation at the group levels, measuring the OTL of a complex construct remains a valuable exercise. For example, variations within and across cultural groups in opportunities to learn and practice problem solving can provide important insights to guide policy and practice in reducing such disparities.

Sociocultural context plays an important role in shaping learning and performance on assessment, which therefore has clear implications for assessment validity (and consequently for establishing validity evidence to support measurement equivalence). Validity is defined as the "degree to which evidence and theory support the interpretations of test scores for proposed uses of tests" (AERA, APA, NCME, 2014, p. 11[31]). Similarly, cross-cultural validity refers to the degree to which evidence and theory support the interpretations and uses of test scores for different cultural groups and comparisons across groups. Comparative inferences require equivalence of measurement and comparability of scores when tests are administered in multiple languages or when students from different cultural groups take tests in the same language. Cross-cultural validity and comparability issues have particular relevance to assessments of complex constructs in multicultural and multilingual contexts, such as in international assessments, and assessments in countries with culturally diverse populations.

As described in Figure 11.1, measurement equivalence requires evidence that tests are capturing equivalent constructs (construct equivalence) across subgroups, have similar measurement characteristics and properties (test equivalence), and they are administered in equivalent conditions (equivalence of testing conditions) (Ercikan and Lyons-Thomas, 2013[6]; Perie, 2020[32]). All three aspects of measurement equivalence are inherently influenced by sociocultural context.

The equivalence of constructs is the degree to which construct definitions are similar for populations targeted by the assessment, whether individuals are expected to develop and progress on these constructs in similar ways, and whether they are accessible in similar ways for all populations. It is critical to all assessments intended for multicultural and multilingual groups but it takes on specific relevance to large-scale assessments of complex and multidimensional constructs (Ercikan and Oliveri, 2016[7]). Constructs such as creativity, intelligence, critical thinking and collaboration are not uniformly taught in schools and are conceptualised and defined differently in different cultures. For example, how creativity develops and how creative behaviours are manifested differ across cultural groups (Lubart, 1990[9]; Niu and Sternberg, 2001[8]). Other researchers have also argued that the concepts of intelligence are grounded in cultural contexts and, as such, the constructs have different definitions in these contexts (Sternberg, 2013[35]).

Given that complex skills are embedded within social contexts and are characteristically shaped by cultural norms and expectations, we can expect their manifestations and the value of student outputs to vary across cultures. Because of these differences across cultural groups, there is a need to balance measurement validity with score comparability (see Box 11.1 for an example in the context of a large-scale assessment).

Test equivalence requires equivalence of test versions in different languages and for different cultural groups in multilingual assessment contexts. In innovative, performance-based assessments, another key aspect of test equivalence is the degree to which students, including those from different language and cultural backgrounds, perceive and engage with the tasks in the same way. Test designs – especially those featuring adaptivity or AI tools in item creation and scoring – have direct implications on test content and scores, meaning they are also important to consider for test equivalence. In this section, we review various factors related to test equivalence including linguistic comparability, equivalence of test engagement, adaptivity and AI-based item creation and test scoring.

In examining cross-cultural validity and comparability, particular attention has been given to the comparability of different language versions of assessments. In multilingual assessment contexts, test adaptation facilitates the administration of assessments to students in their language of instruction to provide a valid measurement of the targeted constructs.

The task of test adaptation goes beyond the literal translation of the assessment content (Ercikan and Por, 2020[36]; Perie, 2020[32]; van de Vijver and Tanzer, 1997[37]). An abundance of research has shown that a literal translation does not necessarily produce equivalent measurement and therefore comparability of scores due to key differences between languages (Allalouf, Hambleton and Sireci, 1999[38]; El Masri, Baird and Graesser, 2016[39]; Ercikan and Koh, 2005[40]; Hambleton, Merenda and Spielberger, 2005[41]). Languages vary in the frequency of compound words, word length, sentence length and information density (Bergqvist, Theens and Österholm, 2018[42]). Moreover, grammatical forms in one language may not have equivalent forms in other languages or may have many of them. There is also the difficulty of adapting syntactical style from one language to another, and languages may also differ in form (alphabet versus character-based) and direction of scribe (left-to-right, right-to-left or top-to-bottom).

In order to support measurement equivalence, test adaptation goes beyond literal translation to reflect equivalent meaning, format, relevance, intrinsic interest, engagement and familiarity of the item content (Ercikan, 1998[43]; Hambleton, Merenda and Spielberger, 2005[41]). Subtle differences in meaning, difficulty of vocabulary or complexity of sentences between different language versions of tests can affect the difficulty levels of items differentially and lead to incomparability of performances of examinees from different language groups (Allalouf, 2003[44]; Ercikan et al., 2004[45]).

In addition to potential linguistic differences between different language versions of assessments, the administration of assessments on digital platforms can create sources of non-equivalence in measurement including students’ digital literacy (Bennett et al., 2008[46]) and their familiarity with digital platforms (Tate, Warschauer and Abedi, 2016[47]). Digital literacy varies across cultural or social student groups who may have differential levels of access to and experiences with digital devices, which may then affect the extent to which performances may be accounted for by their abilities to use digital devices and navigate through the assessment effectively. Research has demonstrated that digital demands in assessments can affect performance differentially for different cultural or language groups (Ercikan, Asil and Grover, 2018[48]; Fishbein et al., 2018[49]; Zehner et al., 2020[50]).

In particular, requirements for interactivity in assessments heighten the importance of digital literacy in engaging with assessment tasks in ways that support performance. Students from different sociocultural backgrounds may lack familiarity with certain digital platforms or may not use the specific tools and capabilities made available by the assessment similarly (Jackson et al., 2008[51]). In addition, the use of multimedia formats in some new item types, such as hot-spot items (i.e. identification of correct or incorrect zones) and drag-and-drop image matching, requires that images and videos represent the same meaning and relevance to students from different countries and sociocultural backgrounds (Solano-Flores and Nelson-Barber, 2001[11]). Differential engagement with assessments due to these factors may create sources of incomparability and jeopardise cross-cultural comparability.

Adaptivity entails tailoring the assessment to students' performance levels in ways that provide measurement efficiency and increase student engagement with the assessment. However, adapting to students’ performance levels is also associated with some limitations and challenges (Kingsbury, Freeman and Nesterak, 2014[52]; Yamamoto, Shin and Khorramdel, 2018[53]; Zenisky and Hambleton, 2016[54]). Adaptivity can be at the item level, known as computerised adaptive testing (CAT), or after item sets and at different stages of the assessment, referred to as multi-stage adaptive testing (MST). In both cases, adaptivity in testing relies on the measurement equivalence assumption, that is, that the same constructs are measured by the assessment for different groups and that items have the same order of difficulty for these groups.

Complex constructs such as creativity, communication and collaboration often involve multiple components and measurement dimensions. This multidimensionality not only creates challenges for simple adaptivity designs based on a single dimension but it can also be a source of measurement incomparability. Individuals from different social and cultural backgrounds do not necessarily develop across the construct components in uniform ways. In other words, the dimensionality structure may be different for individuals who may have differential opportunities to develop in different dimensions of the construct. This can result in somewhat different constructs being measured and the ordering of difficulty of items differing for these cultural groups. The assumptions of measurement equivalence, dimensionality and consistency of item ordering are fundamental to establishing cross-cultural comparability in adaptive assessments and these need to be evaluated when designing comparable assessments involving cultural groups. Violations of these expectations can have critical implications, in particular, not meeting the intended goal of measurement efficiency and improving engagement.

There is an additional concern in multi-stage testing (MST) adaptive designs related to the appropriateness of the routing blocks of items and routing decisions for different cultural groups with large degrees of variation in their performance levels. MST adaptivity may start with a block of items determined to have medium-level difficulty across all groups. For groups with much lower performance distribution, a block identified as medium difficulty may be considered difficult or very difficult. This misalignment may also result in MST not meeting its intended goals of advancing measurement efficiency or improving test engagement, in turn affecting the cross-cultural validity and comparability of scores. Therefore in addition to examining measurement equivalence, the appropriateness of an MST design needs to be evaluated for or adapted to cultural groups with varying performance distributions.

As educational assessments capture more complex data on student and computer interactions, analyses using machine learning (ML) and AI algorithms have been developed to support automated inferences of student performances (DiCerbo, 2020[55]). AI-based algorithms for scoring, which make use of features generated from natural language processing (NLP) of text, image recognition of visual data or speech recognition of audio data, are critically dependent on the data sources used in the creation of such algorithms (Baker and Hawn, 2021[56]; Manyika, Silberg and Presten, 2019[57]).

In particular, if data sources for these algorithms are restricted to specific cultural and language groups, resulting scores may not have equivalent validity and accuracy for all groups. For example, on an English-speaking test, the AI-based score engine may produce biased scores for students who have different accents and dialects if the model was trained on standard English pronunciations (Benzeghiba et al., 2007[58]). Similar bias may be observed in the automated scoring of text. Previous research has shown that an automated scoring algorithm developed using data from mainstream student groups resulted in less accurate scores for certain racial and ethnic student groups even if responding in the same language (Bridgeman, Trapani and Attali, 2012[59]). This means that automated scoring models need to be trained, calibrated and adjusted using appropriate samples of responses from all target populations (Zhang, 2013[60]), especially when AI-identified features are used in prediction.

As with many AI-based applications, unintended biases can be introduced due to construct-irrelevant features that are correlated with human scores or due to inadequate representation of features that are uniquely observed for minority language or culture groups (Feldman et al., 2015[61]). The ethical use of AI requires that automated scoring systems treat all test takers fairly regardless of language or population groups by providing equivalent meaning of scores for individuals from different social and cultural groups.

An important consideration in the use of automated scoring systems is whether the scoring systems are similarly effective in detecting aberrant responses across languages and cultural groups. Aberrant responses are defined as atypical responses that are not amenable to be scored by algorithms based on most typical response patterns, such as responses that have unusually creative content (e.g. highly metaphorical), exhibit unexpected response organisation (e.g. poem) or have off-topic content (Higgins, Burnstein and Attali, 2006[62]; Zhang, Chen and Ruan, 2015[63]). In multicultural and multilingual assessment contexts, these differences are magnified when students respond in different languages.

Similar to AI-based automated scoring, automated item generation (AIG) uses a variety of algorithms to automatically create test items. AIG can potentially deliver large amounts of items that cover a variety of content and knowledge, accelerate content updates and test creation, and significantly reduce cost in assessments by replacing highly labour intensive and costly item development by humans. AIG in interactive digital assessments can also be used to generate items "on the fly", in other words, to create and deliver items in interactive and adaptive assessments tailored to students’ responses, performance levels and potentially their sociocultural differences.

AIG relies on two processes: an item cognitive model and a computer algorithm that automatically generates items according to the item model. Some risks of using algorithms for generating items have been identified by researchers, such as questionable cognitive models, ill-structured problems that produce multiple correct answers, and implausible or irrelevant distractors (Gierl, Lai and Turner, 2012[64]; Royal et al., 2018[65]). There are additional challenges for multilingual AIG items that are intended to be administered to students from different cultures, where translation quality (awkward phrases, for example) and psychometric properties of the AIG items may not be invariant across cultural groups (Gierl et al., 2016[66]; Higgins, Futagi and Deane, 2005[67]). Hence, two issues require evaluation of the appropriateness of AIG in assessments designed for multicultural and multilingual groups: the first is the degree to which the item cognitive model used for generating items can be assumed to be equivalent for students from different sociocultural contexts; and the other is the linguistic and psychometric equivalence of AIG items generated from different language models trained in different languages.

The final aspect of measurement equivalence is test condition equivalence, which refers to the similarity of test administration conditions such as test instructions, mode and format of the test, timing and advanced preparations of the test administrators. Large-scale assessments often entail administering the assessments in multiple test sites and across geographical boundaries. The large variability in testing environments and sociocultural norms of learning and assessments therefore increases the threats to score comparability.

Score comparability rests on the premise that valid generalisations can be made about students' performances across test administration sites and cultural and language groups. In multicultural and multilingual assessment contexts, test administrators should be drawn from the local communities so that they are familiar with the culture, language and local dialects to respond to and forestall administration deviations. Training of test administrators is central to the standardisation of testing conditions and should be provided to all administrators in different cultural settings to ensure understanding of the importance of standardised procedures and to provide the needed test administration skills. In the case that the assessments are delivered remotely with live proctors, those proctors need to be trained adequately as well.

When interpreting scores of students from diverse cultural and language backgrounds, knowledge of the broader testing conditions that exist outside assessment settings can enhance understanding of assessment outcomes. These conditions include societal context for testing such as the emphasis given to testing, which may affect how students perceive the testing situation and its role, in turn impacting students’ motivation to perform and how they engage with the assessment.

Despite the recognition that sociocultural context plays an important role in shaping learning and performance on assessment, this context is often neglected in assessment design. In order for assessments to provide equivalent measurement of targeted constructs, the targeted cognition and construct models, task designs and interpretation models (i.e. the three key models of Evidence-Centred Design) need to consider the sociocultural context of learning from the beginning of the design process.

Ensuring cross-cultural validity and comparability needs to start with an evaluation of the equivalence of constructs. This involves identifying behaviours, conceptual understanding and characteristics associated with a construct in a specific culture and time (including how different levels of the competencies involved are differentiated) through surveys or expert judgements (van de Vijver and Tanzer, 2004[68]). The International Test Commission (2017[69]) guidelines emphasise that adequate empirical evidence should be collected to demonstrate that the construct assessed should be understood in the same way across language and cultural groups in large-scale and/or international assessments. In assessments with technological elements, further considerations should be given to how the measurement and scoring of the constructs will be impacted by technology (International Test Commission and Association of Test Publishers, 2022[70]). Gathering such evidence then helps to determine what aspects of the construct can be expected to be common (and what aspects can be expected to differ) across cultural groups considered, and whether a common assessment can provide scores with consistent score meaning across these groups.

Once the evidence described above has been established, the next step is to develop items for the common and culture-specific elements of the target construct. The common elements will facilitate score comparability and the culture-specific elements will optimise validity of score interpretation and use in different cultural groups. Task models therefore need to consider: 1) whether the definition of the targeted construct varies for different groups; 2) whether there are expected differences in learning progressions and knowledge structures; 3) whether variations in sociocultural context are expected to lead to different cognitive processes for students from different backgrounds; 4) whether students from different contexts are expected to engage with different features of tasks differently; and 5) what kinds of variations of features of tasks might be needed to optimise performance for students from different sociocultural backgrounds. Multiple versions of tasks may be necessary to obtain equivalent evidence of the targeted construct from different student groups or to redirect the assessment focus to components of the targeted construct where more similarities can be expected for these groups.

Differences in OTL also need to be considered when developing assessment tasks and interpreting and using responses to assessment tasks. In particular, it is important to determine whether students from different population groups can be expected to have had similar opportunities to learn and develop the targeted constructs, what types of tasks might be more closely aligned with their learning experiences, and whether students can be expected to engage with the test environment effectively given their access to similar digital devices, applications and tools. The potential for variations in digital literacy, in particular, to be a source of incomparability can be addressed by intentionally designing assessment tasks for students with the least access to and familiarity with digital resources. When more advanced technology is necessary for an assessment, accessible and effective tutorials can also be provided to help acquaint students with navigating the assessment interface and entering their responses. If possible, practice tasks can also be developed as part of the assessment and distributed to help familiarise students with the digital environment. Ercikan, Asil and Grover (2018[48]) also recommended examining how students from different backgrounds engage with digital assessments using cognitive labs or other response process analyses (see Chapter 12 of this report for a more detailed discussion of the uses of process data for validation purposes).

Adapting assessments into different language versions involves trade-offs between comparability and cultural authenticity: while concurrent/parallel/simultaneous development (Solano-Flores, Trumbull and Nelson-Barber, 2002[71]) of items in multiple languages prioritises cultural authenticity, successive development (Tanzer and Sim, 1999[72]) of items in one language and then adaptation to other languages prioritises comparability. Having experts evaluate language equivalence is a necessary step as they can identify differences in language, content, format and other aspects of items in the comparison languages. Documenting changes and the rationale for changes between the language versions of assessments is critical for informing test users about the potential impact on comparability.

Following task design, the next step is interpreting student responses to tasks. There are two components to interpretation models: the scoring models used to extract evidence from responses and the measurement models used for accumulating evidence across tasks. Scoring models are created based on cognitive theories; however, different world views, knowledge structures and learning progressions might impact student responses. It is therefore important to investigate whether variations in scoring models are needed for obtaining equivalent quality of evidence of the constructs in different student groups.

In general, even when the assessment design integrates all the considerations discussed above, tasks measuring complex skills can remain susceptible to influences from cultural exposure and learning experiences. One possible psychometric solution is the use of universal or anchor items in the test design. Test developers can judiciously identify a set of items that are universally recognised to measure the construct of interest and that can be carefully selected to represent the assessment in terms of assessment content (see Box 11.1 for an example), item statistical characteristics (Kolen and Brennan, 2014[73]) and item formats (Livingston, 2014[74]). These items, to be used as linking items, can be administered to all students. The performance on these universal items can then be used to anchor the differences in performance on the remaining tasks that are more susceptible to influences by cultural norms and expectations. In this way, universal items can facilitate score comparability and culture-specific item sets can optimise cultural validity of score interpretation and use in different cultural groups.

Further considerations are required for any AI-based automated scoring algorithms, which should be based on construct understanding and evaluated through the lens of validity of interpretation and use (Attali, 2013[75]; Bejar, 2011[76]; Bennett and Bejar, 1998[77]; Powers et al., 2000[78]; Williamson, Xi and Breyer, 2012[79]). This requires systematic evaluation of the consistency of interpretation and use of the AI-based automated scoring engines for individuals from different gender, cultural and language groups, and test developers must ensure that there is an adequate representation of students from all relevant cultural and language groups when training the algorithms. In particular the following set of considerations are important for investigating potential bias in AI-based automated scores (Baker and Hawn, 2021[56]; Benzeghiba et al., 2007[58]; DiCerbo, 2020[55]; ETS, 2021[80]; Kearns et al., 2018[81]):

  • Possible human biases in data coding (particularly in supervised learning algorithms).

  • Experimenting with different models used in algorithms and paying attention to differential fairness for different cultural groups in the models.

  • Cross-validating the models with new and different datasets.

  • Investigating potential bias by comparing human versus machine scores for different cultural groups.

In addition, it is necessary to evaluate the equivalency of NLP features that feed into AI algorithms across languages and investigate whether the same, different or hierarchical AI-based scoring models should be used for different cultural and language groups (see McCaffrey et al. (2022[82]) for a more detailed discussion of these issues). The involvement of subject experts from different cultural and language groups in developing AI-based scoring (and automated item generation) is necessary for minimising poor quality item development and incomparability across language and cultural groups.

While several steps should be taken throughout the assessment design process to establish evidence to support construct and test equivalence, these issues also need to be evaluated by large-scale psychometric studies using various methodologies. The most commonly used methodology for examining measurement invariance at the scale or test level is confirmatory factor analysis (CFA) that compares test data structures across comparison groups (Ercikan and Koh, 2005[40]; Oliveri and Ercikan, 2011[83]). At the item level, differential item functioning (DIF) analysis evaluates whether the probability of a correct response among equally able students is the same for comparison groups (Guo and Dorans, 2019[84]; 2020[85]; Dorans and Holland, 1993[86]; Holland and Thayer, 1988[87]) and it has been the psychometric approach for examining measurement equivalence across groups since the 1980s (Dorans and Holland, 1993[86]; Holland and Thayer, 1988[87]; Holland and Wainer, 1993[88]; Rogers and Swaminathan, 2016[89]; Shealy and Stout, 1993[90]). Research indicates that measurement incomparability identified at the item level does not necessarily result in observable test or scale level differences (Ercikan and Gonzalez, 2008[91]; Zumbo, 2003[92]). This highlights the importance of examining factor structure equivalence at both the test and item levels to provide complimentary evidence for a full evaluation of measurement invariance.

In large-scale assessments that involve large numbers of language and cultural groups, various other methods may be used to examine measurement invariance across groups. In particular in international assessments, measurement invariance is determined by the item-by-language (country) interaction in the item parameter estimates (see Chapters 9 and 16 in OECD (2017[93]), for example). Group-specific parameters (i.e. country item parameters) for items exhibiting group-level DIF in the international calibration are estimated to reduce potential bias introduced by these interactions. If multiple language-adapted assessments are produced, then a linking study may also be needed to create comparable scales with measurement unit equivalence. Research on comparability (Sireci, 1997[94]; 2005[95]) indicates that, in the absence of sufficient evidence for measurement equivalence across groups, score scales should be based on separate language/country calibrations and comparability should be established through a linking procedure.

Several considerations must be kept in mind for interpreting statistical findings in the context of measurement invariance research. First, some level of incomparability exists when measurement is compared for all assessment groups and statistical significance in the violation of measurement invariance may not be a useful indicator in evaluating its practical consequences, especially when sample sizes are large. Effect size measures provide a better indication of the level of incomparability (Nye and Drasgow, 2011[96]) to facilitate making decisions about the exclusion of items or the revision of scales to establish comparability.

Second, statistical results do not always guide what actions should be taken if there is evidence of measurement variance and incomparability. Recent studies examining and confirming sources of DIF have advocated for the use of mixed methods approaches that integrate quantitative results from DIF analysis and qualitative findings from expert appraisal to uncover sources of DIF across comparison groups (Benítez and Padilla, 2014[97]; Benítez et al., 2016[98]). Ercikan et al. (2010[99]) demonstrated that Think Aloud Protocols (TAPs) could be used as an approach for examining and confirming sources of DIF in multiple language versions of assessments. Digital assessment environments also provide opportunities for examining measurement equivalence in different ways by providing information about student behaviour and cognitive processes in the data logs that can be used for examining the comparability of response processes and patterns for students from different cultural groups (Ercikan and Pellegrino, 2017[100]; Guo and Ercikan, 2021[19]) – see also Chapter 12 of this report for more detail.

Third, most current DIF methodologies are designed for the comparison of pre-specified focal and reference groups (i.e. observable manifest groups characterised, for example, by gender or ethnicity). In other words, DIF methods do not identify hidden bias for latent heterogeneity groups that might nonetheless exist. For instance, the assumption of within-group homogeneity often neglects the actual heterogeneity that exists in subgroups (Cohen and Bolt, 2005[101]; Ercikan and Por, 2020[36]; Grover and Ercikan, 2017[102]). Oliveri, Ercikan and Zumbo (2014[103]) demonstrated in a simulation study that an increase in heterogeneity from 0 to 80 percent within the focal groups decreased the accuracy of DIF detection.

Different approaches have been used to account for the heterogeneity in focal and reference groups used in DIF analyses. One such approach, referred to as “melting pot” DIF (Dorans and Holland, 1993[86]) or DIF dissection approach (Zhang, Dorans and Matthews-López, 2005[104]), focused on crossing two manifest groups (e.g. gender and ethnicity) to create more specific subgroups for analysis. Other approaches focused on identifying latent homogenous groups through statistical analyses (Cohen and Bolt, 2005[101]; Strobl, Kopf and Zeileis, 2015[105]). Ercikan and Oliveri (2013[106]) also proposed a two-step approach in conducting DIF using latent class analysis within manifest groups: the first step involves a latent class analysis to identify heterogeneous groupings in the considered populations, and the second step involves applying DIF methodologies to the identified latent classes rather than manifest groups as a whole.

Assessing complex constructs using engaging tasks, often on digital-based platforms, is critical for promoting learning and development of high value skills, knowledge and competencies, and necessary for advancing assessment methodologies. In this chapter we argued for the recognition of the complex sociocultural context that assessments are conducted in and the importance of cross-cultural validity and comparability. In particular, assessment designers need to take the complex sociocultural context into account in deciding what to assess, how to assess it and how assessment results need to be interpreted and used. We highlighted key measurement equivalence issues that arise specifically in digital assessments of complex constructs in multicultural populations. Many of these issues can be mitigated through a principled assessment design process that examines sociocultural influences at the onset of defining the assessment constructs, designing tasks, and developing scoring and measurement models. However, even when all these are taken into account, empirical investigations and supporting empirical evidence are necessary for establishing the validity and comparability of assessment results for individuals from different cultural and language groups.


[31] AERA, APA, NCME (2014), Standards for Educational and Psychological Testing, American Educational Research Association, Washington, D.C., https://www.testingstandards.net/uploads/7/6/6/4/76643089/9780935302356.pdf.

[44] Allalouf, A. (2003), “Revising translated differential item functioning items as a tool for improving cross-lingual assessment”, Applied Measurement in Education, Vol. 16/1, pp. 55-73, https://doi.org/10.1207/s15324818ame1601_3.

[38] Allalouf, A., R. Hambleton and S. Sireci (1999), “Identifying the causes of DIF in translated verbal items”, Journal of Educational Measurement, Vol. 36/3, pp. 185-198, https://doi.org/10.1111/j.1745-3984.1999.tb00553.x.

[75] Attali, Y. (2013), “Validity and reliability of automated essay scoring”, in Shermis, M. and J. Burstein (eds.), Handbook of Automated Essay Evaluation: Current Applications and New Directions, Routledge, New York, https://doi.org/10.4324/9780203122761.

[56] Baker, R. and A. Hawn (2021), “Algorithmic bias in education”, International Journal of Artificial Intelligence in Education, Vol. 32/4, pp. 1052-1092, https://doi.org/10.1007/s40593-021-00285-9.

[76] Bejar, I. (2011), “A validity-based approach to quality control and assurance of automated scoring”, Assessment in Education: Principles, Policy & Practice, Vol. 18/3, pp. 319-341, https://doi.org/10.1080/0969594x.2011.555329.

[97] Benítez, I. and J. Padilla (2014), “Analysis of nonequivalent assessments across different linguistic groups using a mixed methods approach: Understanding the causes of differential item functioning by cognitive interviewing”, Journal of Mixed Methods Research, Vol. 8/1, pp. 52-68, https://doi.org/10.1177/1558689813488245.

[98] Benítez, I. et al. (2016), “Using mixed methods to interpret differential item functioning”, Applied Measurement in Education, Vol. 29/1, pp. 1-16, https://doi.org/10.1080/08957347.2015.1102915.

[77] Bennett, R. and I. Bejar (1998), “Validity and automad scoring: It’s not only the scoring”, Educational Measurement: Issues and Practice, Vol. 17/4, pp. 9-17, https://doi.org/10.1111/j.1745-3992.1998.tb00631.x.

[46] Bennett, R. et al. (2008), “Does it matter if I take my mathematics test on computer? A second empirical study of mode effects in NAEP”, Journal of Technology, Learning, and Assessment, http://www.jtla.org (accessed on 4 March 2023).

[58] Benzeghiba, M. et al. (2007), “Automatic speech recognition and speech variability: A review”, Speech Communication, Vol. 49/10-11, pp. 763-786, https://doi.org/10.1016/j.specom.2007.02.006.

[42] Bergqvist, E., F. Theens and M. Österholm (2018), “The role of linguistic features when reading and solving mathematics tasks in different languages”, The Journal of Mathematical Behavior, Vol. 51, pp. 41-55, https://doi.org/10.1016/j.jmathb.2018.06.009.

[59] Bridgeman, B., C. Trapani and Y. Attali (2012), “Comparison of human and machine scoring of essays: Differences by gender, ethnicity, and country”, Applied Measurement in Education, Vol. 25/1, pp. 27-40, https://doi.org/10.1080/08957347.2012.635502.

[26] Cogan, L., W. Schmidt and D. Wiley (2001), “Who takes what math and in which track? Using TIMSS to characterize U.S. students’ eighth-grade mathematics learning opportunities”, Educational Evaluation and Policy Analysis, Vol. 23/4, pp. 323-341, https://doi.org/10.3102/01623737023004323.

[101] Cohen, A. and D. Bolt (2005), “A mixture model analysis of differential item functioning”, Journal of Educational Measurement, Vol. 42/2, pp. 133-148, https://doi.org/10.1111/j.1745-3984.2005.00007.

[14] Cooper, C. et al. (2009), “Poverty, race, and parental involvement during the transition to elementary school”, Journal of Family Issues, Vol. 31/7, pp. 859-883, https://doi.org/10.1177/0192513x09351515.

[55] DiCerbo, K. (2020), “Assessment for learning with diverse learners in a digital world”, Educational Measurement: Issues and Practice, Vol. 39/3, pp. 90-93, https://doi.org/10.1111/emip.12374.

[86] Dorans, N. and P. Holland (1993), “DIF detection and description: Mantel-Haenszel and standardization”, in Holland, P. and H. Wainer (eds.), Differential Item Functioning, Lawrence Erlbaum, Hillsdale.

[39] El Masri, Y., J. Baird and A. Graesser (2016), “Language effects in international testing: The case of PISA 2006 science items”, Assessment in Education: Principles, Policy & Practice, Vol. 23/4, pp. 427-455, https://doi.org/10.1080/0969594x.2016.1218323.

[43] Ercikan, K. (1998), “Translation effects in international assessments”, International Journal of Educational Research, Vol. 29/6, pp. 543-553, https://doi.org/10.1016/s0883-0355(98)00047-0.

[99] Ercikan, K. et al. (2010), “Application of think aloud protocols for examining and confirming sources of differential item functioning identified by expert reviews”, Educational Measurement: Issues and Practice, Vol. 29/2, pp. 24-35, https://doi.org/10.1111/j.1745-3992.2010.00173.x.

[48] Ercikan, K., M. Asil and R. Grover (2018), “Digital divide: A critical context for digitally based assessments”, Education Policy Analysis Archives, Vol. 26/51, pp. 1-24, https://doi.org/10.14507/epaa.26.3817.

[13] Ercikan, K. and S. Elliott (2015), “Assessment as a tool for communication and improving educational equity”, A white paper for the Smarter Balanced Assessment Consortium.

[45] Ercikan, K. et al. (2004), “Comparability of bilingual versions of assessments: Sources of incomparability of English and French versions of Canada’s national achievement tests”, Applied Measurement in Education, Vol. 17/3, pp. 301-321, https://doi.org/10.1207/s15324818ame1703_4.

[91] Ercikan, K. and E. Gonzalez (2008), “Score scale comparability in international assessments”, Paper presented at the National Council on Measurement in Education, New York.

[18] Ercikan, K., H. Guo and Q. He (2020), “Use of response process data to inform group comparisons and fairness research”, Educational Assessment, Vol. 25/3, pp. 179-197, https://doi.org/10.1080/10627197.2020.1804353.

[40] Ercikan, K. and K. Koh (2005), “Examining the construct comparability of the English and French versions of TIMSS”, International Journal of Testing, Vol. 5/1, pp. 23-35, https://doi.org/10.1207/s15327574ijt0501_3.

[6] Ercikan, K. and J. Lyons-Thomas (2013), “Adapting tests for use in other languages and cultures”, in Geisinger, K. et al. (eds.), APA Handbook of Testing and Assessment in Psychology, Vol. 3: Testing and Assessment in School Psychology and Education, American Psychological Association, Washington, D.C., https://doi.org/10.1037/14049-026.

[7] Ercikan, K. and M. Oliveri (2016), “In search of validity evidence in support of the interpretation and use of assessments of complex constructs: Discussion of research on assessing 21st century skills”, Applied Measurement in Education, Vol. 29/4, pp. 310-318, https://doi.org/10.1080/08957347.2016.1209210.

[106] Ercikan, K. and M. Oliveri (2013), “Is fairness research doing justice? A modest proposal for an alternative validation approach in differential item functioning (DIF) investigations”, in Chatterji, M. (ed.), Validity, Fairness and Testing of Individuals in High Stakes Decision-Making Context, Emerald Publishing, Bingley.

[100] Ercikan, K. and J. Pellegrino (2017), “Validation of score meaning using examinee response processes for the next generation of assessments”, in Ercikan, K. and J. Pellegrino (eds.), Validation of Score Meaning for the Next Generation of Assessments, Routledge, New York, https://doi.org/10.4324/9781315708591.

[36] Ercikan, K. and H. Por (2020), “Comparability in multilingual and multicultural assessment contexts”, in Berman, A., E. Haertel and J. Pellegrino (eds.), Comparability in Large-Scale Assessment: Issues and Recommendations, National Academy of Education, Washington, D.C., https://naeducation.org/wp-content/uploads/2020/04/8-Comparability-in-Multilingual-and-Multicultural-Assessment-Contexts.pdf.

[20] Ercikan, K., W. Roth and M. Asil (2015), “Cautions about inferences from international assessments: The case of PISA 2009”, Teachers College Record, Vol. 117/1, pp. 1-28, https://doi.org/10.1177/016146811511700107.

[12] Ercikan, K. et al. (2014), “Inconsistencies in DIF detection for sub-groups in heterogeneous language groups”, Applied Measurement in Education, Vol. 27/4, pp. 273-285, https://doi.org/10.1080/08957347.2014.944306.

[80] ETS (2021), Best Practices for Constructed-Response Scoring, Educational Testing Service, https://www.ets.org/pdfs/about/cr_best_practices.pdf (accessed on 4 March 2023).

[61] Feldman, M. et al. (2015), “Certifying and removing disparate impact”, Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 259-268, https://doi.org/10.1145/2783258.2783311.

[49] Fishbein, B. et al. (2018), “The TIMSS 2019 Item Equivalence Study: Examining mode effects for computer-based assessment and implications for measuring trends”, Large-scale Assessments in Education, Vol. 6/1, https://doi.org/10.1186/s40536-018-0064-z.

[33] Foster, N. and A. Schleicher (2021), “Assessing creative skills”, Creative Education, Vol. 13, pp. 1-29, https://doi.org/10.4236/ce.2022.131001.

[27] Fuchs, T. and L. Wößmann (2007), “What accounts for international differences in student performance? A re-examination using PISA data”, Empirical Economics, Vol. 32, pp. 433-464, https://doi.org/10.1007/s00181-006-0087-0.

[2] Gee, J. (2008), “A sociocultural perspective on opportunity to learn”, in Moss, P. et al. (eds.), Assessment, Equity, and Opportunity to Learn, Cambridge University Press, Cambridge, https://doi.org/10.1017/CBO9780511802157.004.

[66] Gierl, M. et al. (2016), “Using technology-enhanced processes to generate test items in multiple languages”, in Drasgow, F. (ed.), Technology and Testing: Improving Educational and Psychological Measurement, Routledge, New York, https://doi.org/10.4324/9781315871493.

[64] Gierl, M., H. Lai and S. Turner (2012), “Using automatic item generation to create multiple-choice test items”, Medical Education, Vol. 46/8, pp. 757-765, https://doi.org/10.1111/j.1365-2923.2012.04289.x.

[102] Grover, R. and K. Ercikan (2017), “For which boys and which girls are reading assessment items biased against? Detection of differential item functioning in heterogeneous gender populations”, Applied Measurement in Education, Vol. 30/3, pp. 178-195, https://doi.org/10.1080/08957347.2017.1316276.

[85] Guo, H. and N. Dorans (2020), “Using weighted sum scores to close the gap between DIF practice and theory”, Journal of Educational Measurement, Vol. 57/4, pp. 484-510, https://doi.org/10.1111/jedm.12258.

[84] Guo, H. and N. Dorans (2019), “Observed scores as matching variables in differential item functioning under the one‐ and two‐parameter logistic models: Population results”, ETS Research Report Series, Vol. 2019/1, pp. 1-27, https://doi.org/10.1002/ets2.12243.

[19] Guo, H. and K. Ercikan (2021), “Differential rapid responding across language and cultural groups”, Educational Research and Evaluation, Vol. 26/5-6, pp. 302-327, https://doi.org/10.1080/13803611.2021.1963941.

[41] Hambleton, R., P. Merenda and C. Spielberger (eds.) (2005), Adapting Educational and Psychological Tests for Cross-Cultural Assessment, Psychology Press, New York, https://doi.org/10.4324/9781410611758.

[62] Higgins, D., J. Burnstein and Y. Attali (2006), “Identifying off-topic student essays without topic-specific training data”, Natural Language Engineering, Vol. 12/2, pp. 145-159, https://doi.org/10.1017/s1351324906004189.

[67] Higgins, D., Y. Futagi and P. Deane (2005), “Multilingual generalization of the ModelCreator software for math item generation”, ETS Research Report Series, Vol. 2005/1, pp. i-38, https://doi.org/10.1002/j.2333-8504.2005.tb01979.x.

[87] Holland, P. and D. Thayer (1988), “Differential item performance and the Mantel-Haenszel procedure”, in Wainer, H. and H. Braun (eds.), Test Validity, Lawrence Erlbaum, Hillsdale.

[88] Holland, P. and H. Wainer (1993), Differential Item Functioning, Lawrence Erlbaum, Hillsdale.

[69] International Test Commission (2017), “ITC guidelines for translating and adapting tests (Second edition)”, International Journal of Testing, Vol. 18/2, pp. 101-134, https://doi.org/10.1080/15305058.2017.1398166.

[70] International Test Commission and Association of Test Publishers (2022), Guidelines for Technology-Based Assessment, https://www.intestcom.org/page/16 (accessed on 4 March 2023).

[51] Jackson, L. et al. (2008), “Race, gender, and information technology use: The new digital divide”, CyberPsychology & Behavior, Vol. 11/4, pp. 437-442, https://doi.org/10.1089/cpb.2007.0157.

[81] Kearns, M. et al. (2018), “Preventing fairness gerrymandering: Auditing and learning for subgroups fairness”, in Dy, J. and A. Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, PMLR 80, http://proceedings.mlr.press/v80/kearns18a/kearns18a.pdf.

[52] Kingsbury, G., E. Freeman and M. Nesterak (2014), “The potential of adaptive assessment”, Educational Leadership, Vol. 71/6, pp. 12-18.

[73] Kolen, M. and R. Brennan (2014), Test Equating, Scaling, and Linking: Methods and Practices, Springer, New York, https://doi.org/10.1007/978-1-4939-0317-7.

[30] Kurlaender, M. and J. Yun (2007), “Measuring school racial composition and student outcomes in a multiracial society”, American Journal of Education, Vol. 113/2, pp. 213-243, https://doi.org/10.1086/510166.

[29] Kurlaender, M. and J. Yun (2005), “Fifty years after Brown: New evidence of the impact of school racial composition on student outcomes”, International Journal of Educational Policy, Research, and Practice: Reconceptualizing Childhood Studies, Vol. 6/1, pp. 51-78.

[15] Lee, J. and L. Stankov (2018), “Non-cognitive predictors of academic achievement: Evidence from TIMSS and PISA”, Learning and Individual Differences, Vol. 65, pp. 50-64, https://doi.org/10.1016/j.lindif.2018.05.009.

[4] Lipka, J. and T. McCarty (1994), “Changing the culture of schooling: Navajo and Yup’ik cases”, Anthropology & Education Quarterly, Vol. 25/3, pp. 266-284, https://doi.org/10.1525/aeq.1994.25.3.04x0144n.

[10] Liu, Y., A. Wu and B. Zumbo (2006), “The relation between outside of school factors and mathematics achievement: A cross-country study among the US and five top-performing Asian countries”, Journal of Educational Research & Policy Studies, Vol. 6, pp. 1-35.

[74] Livingston, S. (2014), Equating Test Scores (Without IRT), https://www.ets.org/Media/Research/pdf/LIVINGSTON2ed.pdf (accessed on 4 March 2023).

[9] Lubart, T. (1990), “Creativity and cross-cultural variation”, International Journal of Psychology, Vol. 25/1, pp. 39-59, https://doi.org/10.1080/00207599008246813.

[57] Manyika, J., J. Silberg and B. Presten (2019), What do we do about the biases in AI?, https://hbr.org/2019/10/what-do-we-do-about-the-biases-in-ai (accessed on 4 March 2023).

[82] McCaffrey, D. (2022), “Best practices for constructed‐response scoring”, ETS Research Report Series, Vol. 2022/1, pp. 1-58, https://doi.org/10.1002/ets2.12358.

[3] Moss, P. et al. (eds.) (2008), Assessment, Equity, and Opportunity to Learn, Cambridge University Press, Cambridge, https://doi.org/10.1017/cbo9780511802157.

[21] National Academies of Sciences, Engineering, and Medicine (2019), Monitoring Educational Equity, National Academies Press, Washington, D.C., https://doi.org/10.17226/25389.

[22] National Center for Educational Statistics (2021), Survey questionnaires: Questionnaires for students, teachers, and school administrators, https://nces.ed.gov/nationsreportcard/experience/survey_questionnaires.aspx (accessed on 4 March 2023).

[8] Niu, W. and R. Sternberg (2001), “Cultural influences on artistic creativity and its evaluation”, International Journal of Psychology, Vol. 36/4, pp. 225-241, https://doi.org/10.1080/00207590143000036.

[96] Nye, C. and F. Drasgow (2011), “Effect size indices for analyses of measurement equivalence: Understanding the practical importance of differences between groups”, Journal of Applied Psychology, Vol. 96/5, pp. 966-980, https://doi.org/10.1037/a0022955.

[34] OECD (2022), Thinking Outside the Box: The PISA 2022 Creative Thinking Assessment, https://issuu.com/oecd.publishing/docs/thinking-outside-the-box (accessed on 4 March 2023).

[93] OECD (2017), PISA 2015 Technical Report, OECD Publishing, Paris, https://www.oecd.org/pisa/data/2015-technical-report/ (accessed on 4 March 2023).

[23] OECD (2016), Equations and Inequalities: Making Mathematics Accessible to All, OECD Publishing, Paris, https://doi.org/10.1787/9789264258495-en.

[83] Oliveri, M. and K. Ercikan (2011), “Do different approaches to examining construct comparability in multilanguage assessments lead to similar conclusions?”, Applied Measurement in Education, Vol. 24/4, pp. 349-366, https://doi.org/10.1080/08957347.2011.607063.

[103] Oliveri, M., K. Ercikan and B. Zumbo (2014), “Effects of population heterogeneity on accuracy of DIF detection”, Applied Measurement in Education, Vol. 27/4, pp. 286-300, https://doi.org/10.1080/08957347.2014.944305.

[1] Pellegrino, J., N. Chudowsky and R. Glaser (eds.) (2001), Knowing What Students Know, National Academies Press, Washington, D.C., https://doi.org/10.17226/10019.

[32] Perie, M. (2020), “Comparability across different assessment systems”, in Berman, A., E. Haertel and J. Pellegrino (eds.), Comparability of Large-Scale Educational Assessments: Issues and Recommendations, National Academy of Education, Washington, D.C., https://doi.org/10.31094/2020/1.

[78] Powers, D. et al. (2000), “Comparing the validity of automated and human essay scoring”, ETS Research Report Series, Vol. 2000/2, pp. i-23, https://doi.org/10.1002/j.2333-8504.2000.tb01833.x.

[89] Rogers, H. and H. Swaminathan (2016), “Concepts and methods in research on differential functioning of test items: Past, present, and future”, in Wells, C. and M. Faulkner-Bond (eds.), Educational Measurement: From Foundations to Future, The Guilford Press, New York.

[16] Rotberg, I. (2006), “Assessment around the world”, Educational Leadership, Vol. 64/3, pp. 58-63, https://neqmap.bangkok.unesco.org/wp-content/uploads/2019/08/Assessment-Around-the-World.pdf.

[65] Royal, K. et al. (2018), “Automated item generation: The future of medical education assessment”, EMJ Innovations, Vol. 2/1, pp. 88-93, https://doi.org/10.33590/emjinnov/10313113.

[24] Schleicher, A. (2019), PISA 2018: Insights and Interpretations, https://www.oecd.org/pisa/PISA%202018%20Insights%20and%20Interpretations%20FINAL%20PDF.pdf (accessed on 4 March 2023).

[28] Schmidt, W. et al. (2015), “The role of schooling in perpetuating educational inequality”, Educational Researcher, Vol. 44/7, pp. 371-386, https://doi.org/10.3102/0013189x15603982.

[25] Schmidt, W., P. Zoido and L. Cogan (2014), “Schooling matters: Opportunity to learn in PISA 2012”, OECD Education Working Papers No. 95, https://doi.org/10.1787/19939019.

[90] Shealy, R. and W. Stout (1993), “A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF”, Psychometrika, Vol. 58/2, pp. 159-194, https://doi.org/10.1007/BF02294572.

[95] Sireci, S. (2005), “Using bilinguals to evaluate the comparability of different language versions of a test”, in Hambleton, R., P. Merenda and C. Spielberger (eds.), Adapting Educational and Psychological Tests for Cross-Cultural Assessment, Lawrence Erlbaum, Hillsdale.

[94] Sireci, S. (1997), “Problems and issues in linking assessments across languages”, Educational Measurement: Issues and Practice, Vol. 16/1, pp. 12-19, https://doi.org/10.1111/j.1745-3992.1997.tb00581.x.

[11] Solano-Flores, G. and S. Nelson-Barber (2001), “On the cultural validity of science assessments”, Journal of Research in Science Teaching, Vol. 38/5, pp. 553-573, https://doi.org/10.1002/tea.1018.

[71] Solano-Flores, G., E. Trumbull and S. Nelson-Barber (2002), “Concurrent development of dual language assessments: An alternative to translating tests for linguistic minorities”, International Journal of Testing, Vol. 2/2, pp. 107-129, https://doi.org/10.1207/s15327574ijt0202_2.

[35] Sternberg, R. (2013), “Intelligence”, in Freedheim, D. and I. Weiner (eds.), Handbook of Psychology: History of Psychology, John Wiley & Sons, Hoboken.

[105] Strobl, C., J. Kopf and A. Zeileis (2015), “Rasch trees: A new method for detecting differential item functioning in the Rasch model”, Psychometrika, Vol. 80/2, pp. 289-316, https://doi.org/10.1007/s11336-013-9388-3.

[5] Suzuki, L. and J. Ponterotto (2007), Handbook of Multicultural Assessment: Clinical, Psychological, and Educational Applications, John Wiley & Sons, Hoboken.

[72] Tanzer, N. and C. Sim (1999), “Adapting instruments for use in multiple languages and cultures: A review of the ITC guidelines for test adaptations”, European Journal of Psychological Assessment, Vol. 15/3, pp. 258-269, https://doi.org/10.1027//1015-5759.15.3.258.

[47] Tate, T., M. Warschauer and J. Abedi (2016), “The effects of prior computer use on computer-based writing: The 2011 NAEP writing assessment”, Computers & Education, Vol. 101, pp. 115-131, https://doi.org/10.1016/j.compedu.2016.06.001.

[68] van de Vijver, F. and N. Tanzer (2004), “Bias and equivalence in cross-cultural assessment: an overview”, European Review of Applied Psychology, Vol. 54/2, pp. 119-135, https://doi.org/10.1016/j.erap.2003.12.004.

[37] van de Vijver, F. and N. Tanzer (1997), “Bias and equivalence in cross-cultural assessment: An overview”, European Review of Applied Psychology, Vol. 47/4, pp. 263-280.

[79] Williamson, D., X. Xi and F. Breyer (2012), “A framework for evaluation and use of automated scoring”, Educational Measurement: Issues and Practice, Vol. 31/1, pp. 2-13, https://doi.org/10.1111/j.1745-3992.2011.00223.x.

[53] Yamamoto, K., H. Shin and L. Khorramdel (2018), “Multistage adaptive testing design in international large-scale assessments”, Educational Measurement: Issues and Practice, Vol. 37/4, pp. 16-27, https://doi.org/10.1111/emip.12226.

[50] Zehner, F. et al. (2020), “PISA reading: Mode effects unveiled in short text responses”, Psychological Test and Assessment Modeling, Vol. 62/1, pp. 85-105, https://doi.org/10.25656/01:20354.

[54] Zenisky, A. and R. Hambleton (2016), “Multi-stage test design: Moving research results into practice”, in Yan, D., A. von Davier and C. Lewis (eds.), Computerized Multistage Testing, Chapman and Hall/CRC, New York.

[17] Zhang, H. and F. Luo (2020), “The development of psychological and educational measurement in China”, Chinese/English Journal of Educational Measurement and Evaluation | 教育测量与评估双语季刊, https://www.ce-jeme.org/journal/vol1/iss1/7 (accessed on 4 March 2023).

[60] Zhang, M. (2013), “The impact of sampling approach on population invariance in automated scoring of essays”, ETS Research Report Series, Vol. 2013/1, pp. i-33, https://doi.org/10.1002/j.2333-8504.2013.tb02325.x.

[63] Zhang, M., J. Chen and C. Ruan (2015), “Evaluating the detection of aberrant responses in automated essay scoring”, in van der Ark, L. et al. (eds.), Quantitative Psychology Research. Springer Proceedings in Mathematics & Statistics, Springer, Cham, https://doi.org/10.1007/978-3-319-19977-1_14.

[104] Zhang, Y., N. Dorans and J. Matthews-López (2005), “Using DIF dissection method to assess effects of item deletion”, ETS Research Report Series, Vol. 2005/2, pp. i-11, https://doi.org/10.1002/j.2333-8504.2005.tb02000.x.

[92] Zumbo, B. (2003), “Does item-level DIF manifest itself in scale-level analyses? Implications for translating language tests”, Language Testing, Vol. 20/2, pp. 136-147, https://doi.org/10.1191/0265532203lt248oa.

Metadata, Legal and Rights

This document, as well as any data and map included herein, are without prejudice to the status of or sovereignty over any territory, to the delimitation of international frontiers and boundaries and to the name of any territory, city or area. Extracts from publications may be subject to additional disclaimers, which are set out in the complete version of the publication, available at the link provided.

© OECD 2023

The use of this work, whether digital or print, is governed by the Terms and Conditions to be found at https://www.oecd.org/termsandconditions.