Introduction: Arguments in support of innovating assessments

James W. Pellegrino
University of Illinois Chicago

This introduction sets forth the main arguments for innovating assessments that are elaborated across the chapters in this report. The first argument is that educational policy and practice need to (re)consider what is important to measure and better define the various components of what are often complex constructs and the authentic contexts in which we engage them. In education we need to measure what matters, not simply what is easy to measure. The second argument follows from the first – to assess constructs that matter we need to innovate the ways in which we design assessments and the technologies we use to assist in this process – all while bearing in mind the goal of generating useful evidence about what students know and can do with respect to these constructs. The third argument follows from the first two – for the results of any such assessments to be useful to the intended audiences, be they teachers, administrators or policy makers, they must be valid (i.e. assess those competencies that they purport to measure and not others) and they must be comparable (i.e. assess those competencies reliably across assessment contexts and socio-cultural groups). Furthermore, the particular user group(s) need to be able to make sense of the results. Thus, reporting of the forms of evidence generated by innovative assessments must be done in ways that accurately reflect the complexity of the constructs being assessed and the intended uses of the information.

The three interconnected arguments noted above broadly motivate the division of this report into three distinct parts: 1) the ‘what’ of assessment; 2) the ‘how’ of assessment; 3) and the interpretation and use of results from innovative assessments, including considerations of reliability and comparability. To develop and elaborate these arguments we begin with a brief discussion of a fundamental conception about assessment, namely that it constitutes a process of reasoning from evidence guided by theory and research on critical aspects of knowledge and skill. This fundamental principle provides a basis for developing each of the three arguments noted above, including their elaboration in subsequent chapters of this report. We conclude this chapter with an additional argument of consequence for educational policy and practice – to achieve innovation in assessment and effect positive impact on educational outcomes, more coherent systems of assessment are needed. Such systems better connect assessments to one another given their intended interpretive uses and their relationship to curriculum and instruction, respectively.

Educators assess students to learn about what they know and can do, but assessments do not offer a direct pipeline into a student’s mind. Assessing educational outcomes is not as straightforward as measuring height or weight; the attributes to be measured are mental representations and processes that are not outwardly visible. Thus, an assessment is a tool designed to observe students’ behaviour and produce data that can be used to draw reasonable inferences about what students know. Deciding what to assess and how to do so is not as simple as it might appear.

The process of collecting evidence to support inferences about what students know and can do represents a chain of reasoning from evidence about student competence that characterises all assessments, from classroom quizzes and standardised tests, to computerised tutoring programmes, to the conversation a student has with her teacher as they work through a math problem or discuss the meaning of a text. The first question in the assessment reasoning process is: “evidence about what?” Data become evidence in an analytic problem only when one has established their relevance to a conjecture being considered (Schum, 1987, p. 16[1]). Data do not provide their own meaning; their value as evidence can arise only through some interpretational framework. What a person perceives visually, for example, depends not only on the data she receives as photons of light striking her retinas but also on what she thinks she might see. In the present context, educational assessments provide data such as written essays, marks on answer sheets, presentations of projects or students’ explanations of their problem solutions. These data become evidence only with respect to conjectures about how students acquire knowledge and skill.

In the Knowing What Students Know report (Pellegrino, Chudowsky and Glaser, 2001[2]), the process of reasoning from evidence was portrayed as a triad of three interconnected elements: the assessment triangle. The vertices of the assessment triangle represent the three key elements underlying any assessment (see Figure A): a model of student cognition and learning in the domain of the assessment; a set of assumptions and principles about the kinds of observations that will provide evidence of students’ competencies; and an interpretation process for making sense of the evidence in light of the assessment purpose and student understanding. These three elements may be explicit or implicit, but an assessment cannot be designed and implemented, or evaluated, without consideration of each. The three are represented as vertices of a triangle because each is connected to and dependent on the other two. A major tenet of the Knowing What Students Know report is that for an assessment to be effective and valid, the three elements must be in synchrony. The assessment triangle provides a useful framework for analysing the underpinnings of current assessments to determine how well they accomplish the goals we have in mind, as well as for designing future assessments and establishing their validity (Marion and Pellegrino, 2007[3]; Pellegrino, DiBello and Goldman, 2016[4]).

The cognition corner of the triangle refers to theory, data and a set of assumptions about how students represent knowledge and develop competence in an intellectual domain (e.g. fractions, Newton’s laws or thermodynamics). In any particular assessment application, a theory of competence in the domain is needed to identify the set of knowledge and skills that is important to measure for the intended context of use, whether that be to characterise the competencies students have acquired at some point in time to make a summative judgment or to make formative judgments to guide subsequent instruction so as to maximise future learning. A central premise is that the cognitive theory should represent the most scientifically credible understanding of typical ways in which learners represent knowledge and develop expertise in a domain.

Every assessment is also based on a set of assumptions and principles about the kinds of tasks or situations that will prompt students to say, do or create something that demonstrates important knowledge and skills. The tasks to which students are asked to respond on an assessment are not arbitrary; they must be carefully designed to provide evidence that is linked to the cognitive model of learning and to support the kinds of inferences and decisions that will be made on the basis of the assessment results. The observation vertex of the assessment triangle represents a description or set of specifications for assessment tasks that will elicit illuminating responses from students. In assessment, one has the opportunity to structure some small corner of the world to make observations. The assessment designer can use this capability to maximise the value of the data collected, as seen through the lens of the underlying assumptions about how students learn in the domain.

Every assessment is also based on certain assumptions and models for interpreting the evidence collected from observations. The interpretation vertex of the triangle encompasses all the methods and tools used to reason from fallible observations. It expresses how the observations derived from a set of assessment tasks constitute evidence about the knowledge and skills being assessed. In the context of large-scale assessment, the interpretation method is usually a statistical model, which is a characterisation or summarisation of patterns one would expect to see in the data given varying levels of student competency. In the context of classroom assessment, the interpretation is often made less formally by the teacher and is often based on an intuitive or qualitative model rather than a formal statistical one. Even informally, teachers make coordinated judgments about what aspects of students’ understanding and learning are relevant, how a student has performed one or more tasks, and what the performances mean about the student’s knowledge and understanding.

A crucial point is that each of the three elements of the assessment triangle not only must make sense on its own, but also must connect to each of the other two elements in a meaningful way to lead to an effective assessment and sound inferences. Thus, to have a valid and effective assessment, all three vertices of the triangle must work together in synchrony.

Education research has well established that teachers, students, and local and national policy makers take their cues about the goals for instruction and learning from the types of tasks found on state, national and international assessments. Thus what we choose to assess in areas such as science, mathematics, literacy, problem solving, collaboration and critical thinking is what will end up being the focus of instruction. It is therefore critical that our assessments best represent the forms of knowledge and competency and the kinds of learning we want to emphasise in our classrooms if students are to achieve the complex, multidimensional proficiencies needed for the worlds of today and tomorrow. Doing so, however, requires that we move away from measuring what is easy to measuring what matters.

There is an increasing push to encourage students to develop “21st Century skills” that combine habits of mind and that include social and affective competencies (Bellanca, 2014[5]; Pellegrino and Hilton, 2012[6]). The European Commission's Rethinking Education (2012[7]) reform effort emphasises the need to promote transversal skills in education, such as critical thinking and problem solving. Additionally, the Programme for International Student Assessment (PISA) – the international assessment of student abilities administered by the OECD – has begun testing broader competencies that go beyond the disciplinary areas of mathematics, reading and science such as problem solving and collaborative problem solving. Such 21st Century skills – or 21st Century competencies, as referred to throughout this report – are deemed necessary to prepare a global workforce to succeed in a new information-driven economy. Individuals must have the problem solving, critical thinking, and collaboration and communication skills to evaluate and make sense of new information and to act upon this information in a range of settings.

Business leaders, educational organisations and researchers have begun to call for new education policies that target the development of such broad, transferable skills and knowledge. For example, the US-based Partnership for 21st Century Skills (2010[8]) argues that student success in college and careers requires four essential skills: critical thinking and problem solving, communication, collaboration, and creativity and innovation. The report Education for Life and Work: Developing Transferable Knowledge and Skills in the 21st Century (Pellegrino and Hilton, 2012[6]) argued that the various sets of terms associated with the “21st Century skills” label reflect important dimensions of human competence that have been valuable for many centuries, rather than skills that are suddenly new, unique and valuable today. The important difference across time may lie in society’s desire for all students to attain levels of mastery – across multiple areas of skill and knowledge – that were previously unnecessary for individual success in education and the workplace. At the same time, the pervasive use of new digital technologies has increased the pace of communication and information exchange throughout society with the consequence that all individuals may need to be competent in processing multiple forms of information to accomplish tasks that may be distributed across contexts that include home, school, the workplace and social networks.

In order to shift from policy into practice, assessments need to be able to measure these skills and competencies. To do that we need to have clear conceptions and definitions of the constructs to be assessed (i.e. the cognition), the forms of evidence associated with those constructs (i.e. the observations), and ways to make sense of that evidence for the purposes of reporting and use (i.e. the interpretation).

This report’s first four chapters explicitly focus on the ‘what’ of educational assessment – the key constructs that we should be interested in assessing, why those constructs are important, and where we stand with respect to assessing them given the current educational assessment landscape. The bulk of the argument across Chapters 1-4 is that we should be focused on complex cognitive and socio-cognitive constructs, both within and across disciplinary domains. The chapters discuss what we mean by these constructs and the types of tasks and situations where individuals would be required to exercise the requisite competencies, thereby providing the types of evidence that would be valid, interpretable and useful whether the intended use is at the classroom level to guide learning and instruction or in a large-scale educational monitoring context. Each of the chapters illuminate ways in which we might conceptualise and operationalise these constructs, as well as some of the challenges in doing so. They set the stage for chapters that follow on moving from conceptualisation of what we may want and need to assess as part of the advancement of 21st Century education, to the details of the design process and ways in which technology can enable the creation of situations that will provide the evidence we need while also assisting in the process of making sense of that evidence.

While it is especially useful to conceptualise assessment as a process of reasoning from evidence, the design of an actual assessment is a challenging endeavour that needs to be guided by theory and research about cognition as well as practical prescriptions regarding the processes that lead to a productive and potentially valid assessment for a particular context of use. As in any design activity, scientific knowledge provides direction and constrains the set of possibilities, but it does not prescribe the exact nature of the design nor does it preclude ingenuity to achieve a final product. Design is always a complex process that applies theory and research to achieve near-optimal solutions under a series of multiple constraints, some of which are outside the realm of science. In the case of educational assessment, the design is influenced in important ways by variables such as its purpose (e.g. to assist learning, to measure individual attainment or to evaluate a programme), the context in which it will be used (e.g. classroom or large scale), and practical constraints (e.g. resources and time).

Recognising that assessment is an evidentiary reasoning process, it has proven useful to be more systematic in framing the process of assessment design as an Evidence-Centred Design (ECD) process (Mislevy and Haertel, 2007[9]; Mislevy and Riconscente, 2006[10]). The process starts by defining the claims that one wants to be able to make about student knowledge and the ways in which students are supposed to know and understand some particular aspect of a content domain. Examples might include aspects of algebraic thinking, ratio and proportion, force and motion, heat and temperature, etc. The most critical aspect of defining the claims one wants to make for the purposes of assessment is to be as precise as possible about the elements that matter and express these in the form of verbs of cognition that are much more precise and less vague than high-level cognitive, superordinate verbs such as know and understand. Example verbs might include compare, describe, analyse, compute, elaborate, explain, predict, justify, etc. Guiding this process of specifying the claims is theory and research on the nature of domain-specific knowing and learning.

While the claims one wishes to make or verify are about the student, they are linked to the forms of evidence that would provide support for those claims – the warrants in support of each claim. The evidence statements associated with given sets of claims capture the features of work products or performances that would give substance to the claims. This includes which features need to be present and how they are weighted in any evidentiary scheme, i.e. what matters most and what matters least or not at all. For example, if the evidence in support of a claim about a student’s knowledge of the laws of motion is that the student can analyse a physical situation in terms of the forces acting on all the bodies, then the evidence might be a free body diagram that is drawn with all the forces labelled including their magnitudes and directions.

The precision that comes from elaborating the claims and evidence statements associated with a domain of knowledge and skill pays off when one turns to the design of tasks or situations that can provide the requisite evidence. In essence, tasks are not designed or selected until it is clear what forms of evidence are needed to support the range of claims associated with a given assessment situation. The tasks need to provide all the necessary evidence and they should allow students to show what they know in ways that are as unambiguous as possible with respect to what the task performance implies about student knowledge and skill, i.e. the inferences about student cognition that are permissible and sustainable from a given set of assessment tasks or items.

In the Knowing What Students Know report (Pellegrino, Chudowsky and Glaser, 2001[2]), many of the affordances of technology for advancing assessment design and practice were discussed in terms of the three interconnected components of the assessment triangle. The brief discussion that follows focuses on the constructs that could be represented in innovative assessment frameworks (cognition), the ways in which those constructs could be realised in the assessment environment (observations), and some of the interpretive challenges and solutions associated with doing so for purposes of measurement and reporting (interpretation).

What matters in assessment is what we are trying to reason about – the contemporary conception of student cognition in a domain that matters to domain experts, educators and society. As the conception of student cognition changes and expands in terms of what students are supposed to know and be able to do, as has been the case for many domains, technology affords opportunities for substantially changing and extending the observation and interpretation components of the assessment triangle to more adequately represent and provide evidence about the constructs of interest. Doing so enhances the entire evidentiary reasoning process and the validity of an assessment given its intended interpretive use.

Technology provides opportunities for the presentation of dynamic stimuli (e.g. videos, graphics, 2- and 3-D simulations) that can be interacted with in the service of eliciting relevant sets of responses from students. Simultaneously, technology enables the generation and capture of a variety of response products, including situations in which students generate responses using multiple modalities (e.g. drawing and writing). Technology-enhanced assessments enable engagement with a variety of content and practices by opening the door to interactive stimulus environments and response formats that better match the intended reasoning and response processes that form the basis for desired claims about student proficiency (Gorin and Mislevy, 2013[11]).

Students’ interactions with these technology-enhanced assessments can be logged to provide data on how they engage in particular processes. For various 21st Century competencies, the process by which one completes the activity can be as important a piece of information about knowledge and skill as the final product. In these cases, understanding the operations that students performed in the process of creating the final product may be critical to evaluating students’ proficiency. Log data offer the opportunity to reveal these actions, including where and how students spend their time and what choices they make in situations like using a simulation. Such applications offer the potential to provide large volumes of “click-stream” and other forms of response process data that might be useful for making inferences about student thinking (Ercikan and Pellegrino, 2017[12]).

Technology offers significant opportunities to enhance the reasoning-from-evidence process given the types of observations described above. Collecting these types of data makes little sense unless there are ways to reliably and meaningfully interpret them. This can evolve through mechanisms such as automated scoring of responses and application of complex parsing, statistical and inferential models for response process data (Ercikan and Pellegrino, 2017[12]). Critical data to consider include the time taken to perform various actions, the actual activities chosen, and their sequence and organisation. The potential exists for examining the global and local strategies students use while solving assessment problems and their implications, including how such strategies relate to the accuracy or appropriateness of final responses. Although capturing such data in a digital environment is relatively easy, making sense of the data is far more complicated. The same can be said for capturing data to constructed response questions where students may be expressing in written and/or graphical form an argument or explanation about some social, economic or scientific problem or phenomenon, describing the design of an investigation, or representing a model of some structure or process.

The data capture contexts described above are challenging regarding scoring and interpretation. It is here that artificial intelligence and machine learning may play a significant role in future innovative assessments (Zhai et al., 2020a[13]; 2020b[14]). Developments in machine learning also may allow researchers to analyse complex response process data, including to reveal patterns that provide important insights into students’ cognitive processes in problem solving (Zhai et al., 2020a[13]; 2020b[14]; 2021[15]; Zhai, 2021[16]; Zhai, Krajcik and Pellegrino, 2021[17]). Such data may prove to be especially informative about student thinking and reasoning and thus add greatly to the knowledge gained about student competence from large-scale assessments like PISA. An interesting example was provided in a recent report by Pohl et al. (2021) who showed that differences in student response processes, when combined with scoring methods, can significantly change the interpretation of a country’s performance in PISA.

In summary, digital technologies hold great promise for helping to bring about the changes in assessment that many believe are necessary. Technologies available today and innovations on the immediate horizon can be used to access information, create simulations and scenarios, allow students to engage in learning games and other activities, and enable collaboration among students. Such activities make it possible to observe, document and assess students’ work as they are engaged in natural activities – perhaps reducing the need to separate formal, external assessments from learning in the moment (Behrens, DiCerbo and Foltz, 2019[18]). Technologies will certainly make possible the greater use of formative assessment that in turn has been shown to significantly impact student achievement. Digital activities may also provide information about abilities such as persistence, creativity and teamwork that current testing approaches cannot. Juxtaposed with this promise is the need for considerable work to be done on issues of scoring and interpretation of evidence before such embedded assessment can be useful for these varied purposes.

Developing assessments of complex cognitive competencies requires being explicit about all three elements of the assessment triangle and their inter-relationships. While Chapters 1-4 of this report primarily focus on Argument 1 concerns regarding the cognition element of the assessment triangle, Chapters 5-10 address various aspects of Argument 2 regarding the observation and interpretation elements of the assessment triangle, with an emphasis on how technology can be exploited through and within a principled design process to create assessments of the complex cognitive and socio-cognitive performances that matter. Through a combination of argument and specific examples, Chapters 5-10 provide support for the claim that next-generation assessments are possible but can only be generated through a highly principled design process that makes explicit the evidentiary chain of reasoning at the core of valid assessment. The chapters also reveal the complexities that accrue in designing such assessments and then making sense of the multiple forms of evidence they can produce.

The joint American Educational Research Association (AERA), American Psychological Association (APA) and National Council on Measurement in Education (NCME) Standards (1999[19]; 2014[20]) frame validity largely in terms of “the concept or characteristic that a test is designed to measure” (2014, p. 11[20]). In Messick’s construct-centred view of validity, the theoretical construct the test score is purported to represent is the foundation for interpreting the validity of any given assessment (Messick, 1994[21]). For Messick, validity is “an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores” (1989, p. 13[22]). Important work has been done to refine and advance views of validity in educational measurement (Haertel and Lorie, 2004[23]; Kane, 1992[24]; 2001[25]; 2006[26]; 2013[27]; Mislevy, Steinberg and Almond, 2003[28]). Contemporary perspectives call for an interpretive validity argument that “specifies the proposed interpretations and uses of test results by laying out the network of inferences and assumptions leading from the observed performances to the conclusions and decisions based on the performances” (Kane, 2006, p. 23[26]).

Kane (2006[26]) and others (Haertel and Lorie, 2004[23]; Mislevy, Steinberg and Almond, 2003[28]) distinguish between: 1) the interpretive argument, i.e. the propositions that underpin test score interpretation; and 2) the evidence and arguments that provide the necessary warrants for the propositions or claims of the interpretive argument. In essence this view identifies as the two essential components of a validity argument the claims being made about the focus of an assessment and how the results can be used (interpretive argument), together with the evidence and arguments in support of those claims. Appropriating this approach, contemporary educational measurement theorists have framed test validity as a reasoned argument backed by evidence (Kane, 2006[26]). An argument and evidence framing of validity supports investigations for a broad scope of assessment designs and purposes, including many that go beyond typical large-scale tests of academic achievement or aptitude and move one into the arena of innovative and instructionally supportive assessments (Pellegrino, DiBello and Goldman, 2016[4]).

Given the nature of the constructs of interest, including their inherent complexity and multidimensionality, we must acknowledge from the outset the challenges that will be faced in establishing validity arguments for innovative assessments of 21st Century competencies, including the reporting of results for various intended use cases. Validity arguments will depend on well-developed interpretive arguments that include: 1) clear specifications of the constructs of interest and their associated conceptual backing; 2) the forms of evidence associated with those constructs; and 3) the methods for interpretation and reporting of that evidence. Such interpretive arguments are essential to guide assessment design processes, including carefully thought-out applications of technology and data analytics to support the observational and inferential aspects of the overall reasoning-from-evidence process. As noted above, carefully developed and articulated claims about what is being assessed and reported then need to be supported by empirical evidence. Such evidence can be derived from multiple forms of data involving variations in human performance and are essential to establishing an assessment’s validity argument.

In pursuing innovative assessments of 21st Century competencies, of paramount concern are issues of equity and fairness as part of the validity argument. Of particular concern is the comparability of results and validity of inferences derived from performance obtained across different modes of assessment, especially for varying groups of students (Berman, Haertel and Pellegrino, 2020[29]). As large-scale assessment has moved from paper-and-pencil formats to digitally-based assessment, the general focus has been on mode comparability and concerns about student familiarity and differential access to the hardware and software used (Way and Strain-Seymour, 2021[30]). However, as the digital assessment world advances, a significant issue for large-scale innovative assessment is determining how student background characteristics including language, culture and educational experience influence performance on different types of tasks and innovative assessment designs that leverage the power of technology. As the assessment environments and tasks become more innovative, equity and fairness concerns become even more important than general mode comparability effects. Thus, a key part of the validity argument for any innovative assessment will be establishing the socio-cultural boundaries related to equitable and fair interpretations and uses of the assessment results.

Much of this report focuses on critical aspects of design and development as part of establishing the validity of next-generation assessments for 21st Century competencies. More specifically, Chapters 5-10 focus on the validity evidence that would be derived through the application of a principled design process that forces one to articulate, in varying degrees of detail, the connections between and among the cognition, observation and interpretation components of the assessment. Such evidence contributes to the assessment’s overall validity argument but needs to be complemented by various forms of empirical data on how the assessment performs. Chapters 11-13 extend the validity evidence and argument discussion by considering comparability and fairness concerns in large-scale, technology-rich assessments, as well as considering the valid interpretation and use of results derived from innovative analytic approaches. These chapters discuss methodologies and principles for examining validity issues throughout assessment design and once assessment data have been collected.

No single assessment can evaluate all of the forms of knowledge and skill that we value for students; nor can a single instrument meet all of the goals held by parents, practitioners and policymakers. As argued below, it is important to envision a coordinated system of assessments in which different tools are used for different purposes – for example, formative and summative, or diagnostic vs. large-scale reporting. Within such systems, however, all assessments should faithfully represent the constructs of interest and all should model good teaching and learning practice.

At least four major features define the elements of assessment systems that can fully reflect rigorous standards and support the evaluation of deeper learning (see Darling-Hammond et al. (2013[31]) for an elaboration of the relevance, meaning and salient features of each of these criteria):

  • Assessment of higher-order cognitive skills through most of the tasks that students encounter – in other words, tasks that tap the skills that support transferable learning rather than emphasising only those that tap rote learning and the use of basic procedures. While there is a necessary place for basic skills and procedural knowledge, it must be balanced with attention to critical thinking and applications of knowledge to new contexts.

  • High-fidelity assessment of critical abilities, as articulated in the standards – such as communication (speaking, reading, writing and listening in multi-media forms), collaboration, modelling, complex problem solving and research, in addition to key subject matter concepts. Tasks should measure these abilities directly as they will be used in the real world rather than through a remote proxy.

  • Use of items that are instructionally sensitive and educationally valuable – in other words, tasks should be designed so that the underlying concepts can be taught and learned, distinguishing between students who have been well- or badly-taught rather than reflecting students' differential access to outside-of-school experiences (frequently associated with their socio-economic status or cultural context) or interpretations that mostly reflect test-taking skills. Preparing for (and sometimes engaging in) the assessments should engage students in instructionally valuable activities, and results from the tests should provide instructionally useful information.

  • Assessments that are valid, reliable and fair for a range of learners, such that they measure well what they purport to measure, be accurate in evaluating students' abilities and do so reliably across testing contexts and scorers. They should also be unbiased and accessible and used in ways that support positive outcomes for students and instructional quality.

A major challenge is determining the conditions and resources needed to create coherent systems of assessments that work across contexts ranging from the classroom to larger organisational units such as districts, states, countries and internationally. Regardless of their context of implementation, assessments in such systems must support the ambitious goals we have for the educational system, meet the information needs of different stakeholders, and align with the criteria above. Aspects of this assessment system design and implementation challenge are taken up in the Conclusion chapter of this report.

Innovation and change are always challenging no matter the context. They have been especially challenging in education systems given long standing and entrenched histories of educational policy and practice. Many have argued that education has changed little over the last 50-100 years in terms of how it is organised, delivered, what is taught and how it is assessed. Yes, there have been changes in the subject matter learned, in the pedagogies employed and, most recently, in the uses of technology. Those changes have been evolutionary and not revolutionary. Not surprisingly, much the same can be argued about educational assessment regarding what we assess and how we do so, including applications of technology to the practice of assessment – evolutionary, but not revolutionary.

This report is focused on an alternative and perhaps revolutionary vision that starts with the complex cognitive competencies that are deemed critical for citizens of the 21st Century. The report’s chapters provide a vision of what they are by characterising how we might create environments and situations where the competencies of interest would necessarily be expressed in addition to describing the evidence that those environments could provide about those competencies. Some might find it curious that a vision for the future of education starts with assessment rather than curriculum and instruction. One of the benefits of thinking first about the outcomes we desire from the educational system, with a particular focus on what they would look like, is that this information provides the basis for a ‘Backwards Design’ process regarding the design of curriculum and instruction that can lead to those outcomes (Wiggins and McTighe, 2011[32]).

As you read the chapters in this report, we hope they help you consider the costs and benefits of innovative educational assessment. These considerations include the competencies described, the types of environments for assessing them, conceptual and operational design and implementation challenges, and the value of the information derived in terms of its utility for classroom teaching and learning and for education more broadly. We also suggest that you consider what it might take to move in the directions highlighted by this report given the many entrenched assumptions, policies and practices that have come to dominate the educational assessment landscape. These and other process of change issues are taken up in the concluding chapter that closes this report.


[20] AERA, APA, NCME (2014), Standards for Educational and Psychological Testing, American Educational Research Association, Washington, D.C.,

[19] AERA, APA, NCME (1999), Standards for Educational and Psychological Testing, American Educational Research Association, Washington, D.C.

[18] Behrens, J., K. DiCerbo and P. Foltz (2019), “Assessment of complex performances in digital environments”, The ANNALS of the American Academy of Political and Social Science, Vol. 683/1, pp. 217-232,

[5] Bellanca, J. (2014), Deeper Learning: Beyond 21st Century Skills, Solution Tree Press, Bloomington.

[29] Berman, A., E. Haertel and J. Pellegrino (eds.) (2020), Comparability of Large-Scale Educational Assessments: Issues and Recommendations, National Academy of Education, Washington, D.C.,

[31] Darling-Hammond, L. et al. (2013), Criteria for High-Quality Assessment, Stanford Center for Opportunity Policy in Education, Stanford.

[12] Ercikan, K. and J. Pellegrino (eds.) (2017), Validation of Score Meaning for the Next Generation of Assessments, Routledge, New York,

[7] European Commission (2012), Rethinking Education: Investing in Skills for Better Socio-Economic Outcomes, European Commission, Strasbourg.

[11] Gorin, J. and R. Mislevy (2013), “Inherent measurement challenges in the Next Generation Science Standards for both formative and summative assessment”, Paper presented at the Invitational Research Symposium on Science Assessment, Washington D.C., Washington, D.C.,

[23] Haertel, E. and W. Lorie (2004), “Validating standards-based test score interpretations”, Measurement: Interdisciplinary Research & Perspective, Vol. 2/2, pp. 61-103,

[27] Kane, M. (2013), “Validating the interpretations and uses of test scores”, Journal of Educational Measurement, Vol. 50/1, pp. 1-73,

[26] Kane, M. (2006), “Validation”, in Brennan, R. (ed.), Educational Measurement, American Council on Education/Praeger, Westport.

[25] Kane, M. (2001), “Current concerns in validity theory”, Journal of Educational Measurement, Vol. 38/4, pp. 319-342,

[24] Kane, M. (1992), “An argument-based approach to validity”, Psychological Bulletin, Vol. 112/3,

[3] Marion, S. and J. Pellegrino (2007), “A validity framework for evaluating the technical quality of alternate assessments”, Educational Measurement: Issues and Practice, Vol. 25/4, pp. 47-57,

[21] Messick, S. (1994), “The interplay of evidence and consequences in the validation of performance assessments”, Educational Researcher, Vol. 23/2, pp. 13-23,

[22] Messick, S. (1989), “Meaning and values in test validation: The science and ethics of assessment”, Educational Researcher, Vol. 18/2, pp. 5-11,

[9] Mislevy, R. and G. Haertel (2007), “Implications of evidence-centered design for educational testing”, Educational Measurement: Issues and Practice, Vol. 25/4, pp. 6-20,

[10] Mislevy, R. and M. Riconscente (2006), “Evidence-centered assessment design: Layers, concepts, and terminology”, in Downing, S. and T. Haladyna (eds.), Handbook of Test Development, Lawrence Erlbaum, Mahwah.

[28] Mislevy, R., L. Steinberg and R. Almond (2003), “On the structure of educational assessments”, Measurement: Interdisciplinary Research and Perspectives, Vol. 1/1, pp. 3-67.

[8] Partnership for 21st Century Skills (2010), 21st Century Readiness for Every Student: A Policymaker’s Guide, (accessed on 4 March 2023).

[2] Pellegrino, J., N. Chudowsky and R. Glaser (eds.) (2001), Knowing What Students Know: The Science and Design of Educational Assessment, National Academy Press, Washington, D.C.,

[4] Pellegrino, J., L. DiBello and S. Goldman (2016), “A framework for conceptualizing and evaluating the validity of instructionally relevant assessments”, Educational Psychologist, Vol. 51/1, pp. 59-81,

[6] Pellegrino, J. and M. Hilton (eds.) (2012), Education for life and work: Developing transferable knowledge and skills in the 21st century, The National Academies Press, Washington, D.C.,

[1] Schum, D. (1987), Evidence and Inference for the Intelligence Analyst, University Press of America, Lantham.

[30] Way, D. and E. Strain-Seymour (2021), “A framework for considering device and interface features that may affect student performance on the National Assessment of Educational Progress”, Paper commissioned by the NAEP Validity Studies Panel,

[32] Wiggins, G. and J. McTighe (2011), The Understanding by Design Guide to Creating High-Quality Units, ASCD.

[16] Zhai, X. (2021), “Practices and theories: How can machine learning assist in innovative assessment practices in science education”, Journal of Science Education and Technology, Vol. 30/2, pp. 139-149,

[13] Zhai, X. et al. (2020a), “From substitution to redefinition: A framework of machine learning‐based science assessment”, Journal of Research in Science Teaching, Vol. 57/9, pp. 1430-1459,

[15] Zhai, X. et al. (2021), “A framework of construct-irrelevant variance for contextualized constructed response assessment”, Frontiers in Education, Vol. 6, pp. 1-13,

[17] Zhai, X., J. Krajcik and J. Pellegrino (2021), “On the validity of machine learning-based Next Generation Science Assessments: A validity inferential network”, Journal of Science Education and Technology, Vol. 30/2, pp. 298-312,

[14] Zhai, X. et al. (2020b), “Applying machine learning in science assessment: A systematic review”, Studies in Science Education, Vol. 56/1, pp. 111-151,

Metadata, Legal and Rights

This document, as well as any data and map included herein, are without prejudice to the status of or sovereignty over any territory, to the delimitation of international frontiers and boundaries and to the name of any territory, city or area. Extracts from publications may be subject to additional disclaimers, which are set out in the complete version of the publication, available at the link provided.

© OECD 2023

The use of this work, whether digital or print, is governed by the Terms and Conditions to be found at