14. Conclusions and implications

James W. Pellegrino
University of Illinois Chicago
Natalie Foster
Mario Piacentini

This report argues for the need to seriously examine current assessment practices and pursue significant innovation in educational assessment. Such innovation encompasses: 1) the types of educational outcomes we assess; 2) how to do so by capitalising on the affordances of technology to systematically design assessment situations that provide rich and meaningful sources of data; and 3) the considerations needed to ensure that those assessments are valid and cross-culturally comparable given various possible interpretive uses to guide educational practice and policy.

This concluding chapter is divided into two sets of comments. Part I briefly restates key concerns expressed in the Introduction chapter for purposes of summarising and reflecting on the content of Chapters 1-13 of this report. The focus is on relevant theory and capabilities we now have while also considering the landscape of where we need to travel to advance the theory, science and practice of educational assessment. The chapters also remind us that assessment development is an application of scientific knowledge within constraints dictated by context and circumstances of use. Assessment is a “design science” like engineering, drawing upon foundational knowledge in the cognitive and learning sciences and in the measurement and data sciences. Part II of this concluding chapter attempts to look ahead and considers implications for what needs to be done to advance this design science and ensure its utility for education well into this century.

The Introduction chapter of this report set forth three main arguments for innovating assessments that are elaborated and addressed across the following thirteen chapters in this publication. The first argument is that educational policy and practice need to consider what is important to measure and better define the components of what are often complex constructs and the authentic contexts in which we engage them. In education we need to measure what matters not simply what’s easy to measure. The second argument follows from the first: to assess constructs that matter we need to innovate the ways in which assessments are designed, including the technologies used to assist in this process, bearing in mind the goal of generating useful evidence about what students know and can do with respect to these complex constructs. The third argument follows from the first two: for the results of any such assessments to be useful to the intended audiences – be they teachers, administrators or policy makers – they must be valid, i.e. assess those competencies that they purport to measure and not others. Thus the evidence generated by innovative assessments must accurately reflect the complexity of the constructs assessed while taking into consideration the diversity of the individuals assessed as well as the intended uses of the information.

The process of collecting evidence to support inferences about what students know and can do represents a chain of reasoning from evidence about student competence that characterises all assessments. This process has been portrayed as a triad of three interconnected elements – the assessment triangle – whose vertices represent the three key elements underlying any assessment: 1) a model of student cognition and learning in the domain of the assessment; 2) a set of assumptions and principles about the kinds of observations that will provide evidence of students’ competencies; and 3) an interpretation process for making sense of the evidence considering the assessment’s purpose and intended interpretive use (Pellegrino, Chudowsky and Glaser, 2001[1]). These three elements may be explicit or implicit, but an assessment cannot be designed and implemented, nor evaluated for its validity, without consideration of each.

Several of this report’s chapters explicitly focus on the “What” of educational assessment – the key constructs that we should be interested in assessing, why those constructs are important and where we stand with respect to assessing them given the current educational assessment landscape. The bulk of the argument across Chapters 1-4 of this report is that we should be focused on complex cognitive and socio-cognitive constructs, the labels for which fall under the broader heading of “21st Century competencies.” The challenge is carefully defining what we mean by these constructs so we can develop tasks and situations where individuals exercise the requisite competencies allowing us to obtain evidence that is valid, interpretable and useful whether the intended use is at the classroom level to guide learning and instruction or in a large-scale educational monitoring context such as the OECD’s Programme for International Student Assessment (PISA).

Multiple reasons are offered for the need to assess such complex competencies. Among them is the fact that assessments have value beyond the data they provide. They can send key policy and aspirational signals by illustrating the types of performance we want students to master and, in so doing, they can become a driver of innovation rather than an obstacle to it. Preparing for (and sometimes engaging in) such assessments should engage students in instructionally valuable activities and results from the assessments should provide instructionally meaningful information. The tasks that students encounter should tap “higher-order” cognitive skills that support transferable learning rather than primarily emphasising skills that tap rote learning and the use of basic procedures. While there is a necessary place for basic skills and procedural knowledge in educational assessment it must be balanced with attention to critical thinking and applications of knowledge to new contexts.

Chapter 1 of this report provides an important review of different conceptual schemes that have been proposed regarding the complex cognitive and socio-cognitive competencies of interest, noting their family resemblances. The chapter also notes that while there is broad agreement and a strong narrative on the need to develop so-called 21st Century competencies, translating this vision into educational practice requires that curriculum, pedagogy and practice are aligned. In that regard, given its signalling power, assessment can be a powerful driver of alignment. The chapter is also realistic about the challenge of assessing 21st Century competencies with respect to defining constructs and learning progressions, the role of knowledge, task design, scoring, evidence interpretation, reporting and validation.

The challenge for task designers is to identify suitable test situations that call for students to engage 21st Century competencies as well as develop environments that allow students to respond authentically and generate interpretable evidence about their ways of thinking and doing. Along these lines, Chapter 2 of this report reinforces the importance of developing and using more interactive, complex and authentic assessment tasks, and argues that the theory and research in the learning sciences can provide important directions on how to do this. The authors propose some design innovations to develop new assessments that are complementary to existing ones. These include using extended performance tasks that have “low floors and high ceilings”; situating new assessments in specific knowledge domains; including opportunities for exploration, discovery and invention; and embedding intelligent feedback and learning scaffolds. If assessments are able to engage the same type of cognitive and socio-emotional processes of well-designed, deep learning educational experiences, then time spent on assessment need not take time away from learning.

Chapter 3 of this report takes the argument about the construct issue a step further regarding the cognition element of the assessment triangle and its implications for the assessment development process. While many agree that “higher-order” 21st Century competencies are important, the authors argue that identifying a long list of competencies and creating single assessment instruments for each one is probably not the most productive way to go. Such competencies do not exist in a vacuum and they are often used in combination in situations of real-life learning and problem solving. The chapter offers a simple framework to orient decision making on what to assess, guided by focusing on students’ capacity to solve important problems. The framework includes: 1) deciding the type of activity students will work on (e.g. finding and evaluating information to make a decision; understanding how something works; designing a new product or process; building and communicating an argument, etc.); 2) deciding the context/knowledge required by the activity (disciplinary or cross-disciplinary knowledge); and 3) deciding whether students will work on their own or collaboratively.

Elaborating on the close links between complex competencies and domain-specific knowledge and skills, Chapter 4 of this report argues that assessments and educational experiences must be re-evaluated to provide students with opportunities to engage and practice the type of decision making and problem solving that practitioners in a given domain face in the real world, e.g. learning how to think and reason like a mathematician, a scientist, an historian, etc. What this means is that assessment problems should require learners to choose strategies and make decisions that reflect authentic situations but nonetheless be constrained by requiring the knowledge expected only of students at a particular level. To make this possible, the authors argue, the key is to have a good understanding of the decisions practitioners face and define learning trajectory levels that are appropriate for the individuals assessed (cognition vertex of the assessment triangle), in turn using this knowledge to inform the design of tasks and scoring methods. The chapter operationalises this approach for an assessment of complex problem solving in science and engineering.

While Chapters 1 through 4 of this report focus on aspects of the cognition element of the assessment triangle, Chapters 5 through 10 address various aspects of the observation and interpretation elements of the assessment triangle with an emphasis on how technology can be exploited through and within a principled design process to create robust assessments instruments.

Chapter 5 of this report emphasises the point that digital technologies significantly expand our assessment capabilities. The author notes that innovative technology-enhanced assessments have the potential to expand the competencies it is possible to assess, enhance the way that tasks are designed and presented, as well as generate new sources of data and methods for analysing such evidence. Technology also introduces the possibility of embedding assessments in learning environments capable of providing unobtrusive and more contextualised data on problem solving and decision-making processes to complement data from stand-alone assessment scenarios. Regardless of the context for data capture, the author underlines that the use of technology to assess complex constructs must be done following a coherent and principled design assessment process.

Chapter 6 reminds us that every new assessment needs to be guided by a theory of knowing and learning in the domain of interest and that anchoring the design of tasks and the evidence model to a well-defined theoretical framework is essential for generating valid inferences about performance – especially in the context of assessing complex constructs. The author emphasises that this process is an exercise in design science; one requiring close collaboration among potential users of the results, domain experts, psychometricians, software designers and UI/UX experts from the beginning of the design process. Evidence-Centred Design (ECD) provides an architecture to structure the process and products of such a collaborative design activity. The author notes that the main challenges in this process lie in interpreting the data that innovative and open tasks provide. We have much to learn about modelling the complex data associated with multidimensional and dynamic constructs including in situations where students can “learn” by interacting with the assessment environment. Some of these modelling and interpretive challenges are taken up in Chapters 8 and 13 of this report.

Chapter 7 of this report reminds us that the features of many educational assessments, particularly large-scale standardised tests, have been designed and formalised given various “practical” constraints (e.g. administration and scoring costs, time) and “technical” constraints (e.g. psychometric standards of reliability, validity, comparability and fairness). Today, in large part thanks to technological and data analytic advances, more is possible in terms of designing task formats, test features and sources of evidence for assessments. Notably, innovative assessments allow us to move beyond static tasks to present individuals with interactive and immersive problems. Such situations can also be designed to be adaptive to the test taker, to include resources for on-the-fly learning and to capture process data in addition to product data.

Chapter 8 of this report addresses how innovative assessments might capitalise on these technology affordances to generate defensible measurement claims that allow us to make inferences about respondents or the groups they represent, given that typical psychometric models used operationally in large-scale assessment programmes do not easily or well incorporate such complex data. While newer “data analytic” techniques can handle large and complex data sets that include process data, they do not yet have mature machinery appropriate to meeting the challenges of making measurement and inferential claims. The chapter discusses new analytical methods where existing models in psychometrics and data-mining techniques borrow strength from each other directly. It exemplifies a mIRT-Bayes hybrid approach that integrates scores generated by Bayes nets into an Item Response Theory (IRT) model, generating sizable measurement precision gains. The author argues that these approaches exploit the suitability and flexibility of Bayes nets for describing construct-relevant patterns from process data in technology-enhanced tasks while preserving the robust statistical properties of latent variable methods.

Chapters 5-8 of this report collectively make the case that processes of principled assessment design can take advantage of the affordances of technology to expand the space of the possible for design and implementation of next-generation assessments, and describe how the cognition, observation and interpretation components of the assessment triangle can be linked together to enable the reasoning from evidence process that must accompany any valid assessment effort. Chapters 9 and 10 of this report then provide more concrete cases of the possible, illustrating some of what can be done with technology while presenting some of the challenges inherent in working in complex design and interpretive spaces.

Chapter 9 argues that as educational goals increasingly focus on students’ capacity to learn, assessment should also enable and evaluate that capacity. Accordingly, assessment situations should invite students to engage in scenarios that can help elucidate the processes of learning. The chapter provides an example of designing feedback and resource affordances embedded in assessment/learning scenarios to serve two purposes: 1) to support authentic learning; and 2) to provide evidence about learning processes and how learners regulate their learning. The authors note that the design and use of resources introduces many challenges regarding the validity of inferences – perhaps most notably regarding the role of prior knowledge. Typically, the knowledge that learners bring to an assessment is the target construct of the assessment; in the example discussed in the chapter, the assessment is the context in which learning takes places. Interpreting learning activities in light of prior knowledge is a major undertaking, as learners’ activities and strategy choices are contingent on their knowledge. A related challenge is that of generalisability and transfer of students’ ability across resource types, learning opportunities and tasks. The authors conclude that inferences about the use of tools should combine top-down (justified by theory) and bottom-up (visible in data) arguments and evidence.

In Chapter 10 of this report, the authors turn to Intelligent Tutoring Systems (ITS) to illustrate how advances in technology and data analytics (e.g. natural language processing, speech recognition software, etc.) can enable the sorts of innovative designs argued for in previous chapters. ITS, the authors contend, exemplify how digital environments can already provide learners with dynamic learning tasks, interactivity and constant feedback loops – and hence innovative assessments have much to learn from them. With a number of examples from ITS, the chapter emphasises how artificial intelligence (AI)-based applications can support task design and scoring, from automating intelligent feedback tailored to the actions of examinees to producing indicators of learners’ collaboration with others in open scenarios.

Validity is the single most important property of any assessment, yet an assessment is never valid in and of itself: an assessment that may be valid for one interpretive use (e.g. a classroom teaching situation) may be invalid for a very different interpretive use (e.g. a cross-national comparison), and vice versa. As such, an assessment’s validity depends on arguments and evidence about its specific interpretive use. To be valid for a wide range of learners, assessments of complex constructs should measure well what they purport to measure, be accurate in evaluating students' abilities, and do so reliably across assessment contexts and scorers. They should also be unbiased and accessible and used in ways that support positive outcomes for students and educational systems. Principles associated with establishing validity are especially important and deserve careful attention and investigation for technology-rich assessments that target the measurement of complex performances.

Much of this report focuses on one of the most critical aspects of establishing the validity of next-generation assessments for 21st Century competencies: validity arguments and evidence derived through the application of a principled design process. For example, Chapter 6 discusses the key decisions that need to be considered and addressed at the beginning of an ECD process to guide development of valid assessments of complex constructs, starting with the definition of the construct(s) of interest. Evidence from the design process would then be complemented by various forms of empirical data on how the assessment performs, and the entire complex of evidence would constitute the elements of an assessment’s validity argument. Chapters 11 and 12 of this report address particular issues of validity and comparability in large-scale, technology-rich assessments including methodologies and principles for examining validity issues throughout assessment design and once data have been collected.

Chapter 11 notes that complex constructs like creativity, critical thinking, problem solving or collaborative skills are characteristically shaped by cultural norms and expectations. As a result, challenges arise in balancing measurement validity with score comparability in multilingual or multicultural assessment contexts. Therefore assessment developers should consider construct equivalence, test equivalence and testing condition equivalence during the assessment design process. Use of digital assessments, especially for assessing complex skills, also necessitates evaluating students’ digital literacy and examining potential biases against cultural subgroups in AI-based methodologies such as in test adaptivity, automated scoring and item generation engines.

Chapter 12 of this report expands on the very important point that process data, as mentioned in Chapters 7 and 8, can serve as important evidence regarding the processes of reasoning and problem solving that individuals employ when they work on complex assessment tasks irrespective of whether those tasks are stand-alone or are embedded in broader learning environments. Such process data can function in two ways related to assessment validity. Studies using process data in complex tasks have shown their value in validating assumptions about the cognitive constructs involved in assessment performance and as such they can constitute critical data during assessment design and initial validation efforts before assessments become fully operational. For tasks where prior validation of performance has been done, process data obtained during task execution may enrich score meaning and reporting and constitute a part of the interpretive process and evaluation of performance that goes beyond scores based solely on response accuracy. For example, differential engagement with an assessment task or situation is a potentially important index for both practitioners and policy makers. Students’ performance on large-scale assessments may not be a pure reflection of what they actually know and can do because of differences in prior knowledge, cultural norms, familiarity with technologies, attitudes and differences in educational experiences.

Finally, one of the implications throughout this report regarding the design of complex tasks and performances is that the interpretation of the evidence provided will not be simple. Undoubtedly it will require models and interpretive schemes that go well beyond the psychometric models and methods that have been the mainstay of most large-scale assessments. Chapter 13 of this report discusses validity implications of using the results of “predictions” from data mining and machine learning methods in reporting, given that these analyses are not supported by validity evidence that is deemed central in educational measurement (e.g. on reliability and precision, check of data-model fit, differential functioning and invariance). The authors argue that there is an important and critical intersection emerging between the fields of educational measurement and learning analytics, issues of vocabulary and definitions notwithstanding. These broad fields, which are really many fields, can meaningfully learn from each other when making claims or inferences about the complex constructs represented in innovative assessments. By engaging in solving the measurement and inferential issues that currently exist, both fields will likely advance the science and practice of educational assessment.

No single assessment can evaluate all the forms of knowledge and skill that we value for students, nor can a single instrument meet all the goals and information needs held by parents, practitioners and policy makers. As argued in the Introduction chapter, we need coordinated systems of assessments in which different tools are used for different inferential and reporting purposes – for example, formative and summative, or diagnostic vs. large-scale monitoring. Such assessment tools would operate at different levels of the educational system from the classroom on up to school, district, state, national and/or international levels of application. Within and across these levels, all assessments should faithfully represent the constructs of interest and reflect and reinforce desired outcomes that arise from good instructional practices and effective learning processes.

As noted in the Introduction chapter, the following features define the elements of assessments that operate within and across such systems of assessment: 1) the assessment of higher-order cognitive skill; 2) high-fidelity assessment of critical abilities; 3) items that are instructionally sensitive and educationally valuable; and 4) assessments that are valid, reliable and fair. A major challenge is determining a way forward whereby we can create coherent systems of assessments that meet the goals we have for the educational system, satisfy the information needs of different stakeholders, and that align with these criteria. The chapters in this report reveal progress that has been made in conceptualising and operationalising critical aspects of the assessments needed within such systems. The report provides a vision of what next-generation assessments should focus on, what they might look like and how they should function. As such we have the beginnings of a map of the terrain we need to move through to get there and some destinations along the way. The map includes the constructs of interest, the innovations and practices needed to make progress, as well as many of the conceptual and technical obstacles to overcome along the way.

A journey of the type envisioned by this report’s body of work cannot be undertaken nor will it succeed without an investment of multiple forms of capital. In the discussion that follows we consider three particular forms of capital that are needed and expand on why each is critical to the success of such an endeavour. They include intellectual capital, fiscal capital and political capital. Each is necessary but insufficient on its own – yet collectively they provide the capital needed to advance the theory and practice of educational assessment and maximise its societal benefit in the 21st Century.

The collective work described in this report illustrates that no single discipline or area of expertise will be sufficient to accomplish what needs to be done to innovate assessment. Advances to date reveal that next-generation assessment development is inherently a multidisciplinary enterprise: different communities of experts need to work together collaboratively to find solutions to the many conceptual and technical challenges already noted as well as those yet to be uncovered as part of the journey. Enlisting creative people from multiple backgrounds and perspectives to the enterprise of assessment design and use, and facilitating collaboration among them, is critical. Synergies need to be fostered between assessment designers, technology developers, learning scientists, domain experts, measurement experts, data scientists, educational practitioners and policy makers.

Given that learning is embedded within social contexts and is characteristically shaped by cultural norms and expectations, we can expect performance to vary across cultures. Designing valid assessments for different student groups, particularly those for complex skills, requires multidisciplinary teams and expertise. Therefore it is necessary to consider the complex sociocultural context in deciding what to assess, how to assess it, and how assessment results will be interpreted and used. The PISA 2022 assessment of creative thinking (OECD, 2022[2]) exemplifies comparability challenges related to assessing a complex construct across language and cultural groups (see Box 11.1 in Chapter 11 of this report for more). Systematic evaluations of measurement comparability can provide the basis for future assessments of complex skills.

In addition to design and validation concerns arising from context and culture, the assessment development community writ large will need to grapple with complex issues including designing tasks that can simulate authentic contexts and elicit relevant behaviours and evidence, how to interpret and accumulate the numerous sources of data that technology-enhanced assessments can generate, and how to compare students meaningfully in increasingly dynamic and open test environments. To address these and related issues, considerable research will need to focus on modelling and validating complex technology-enabled performances that yield multifaceted data sets. This includes modelling dependencies and non-random missing data in open and extended assessment tasks.

Emerging studies have shown that by working with experts from different disciplines, machine learning and AI techniques can help researchers better understand and model learning processes (Kleinman et al., 2022[3]) and can assist content experts in efficiently and effectively annotating students’ entire problem-solving processes at scale (Guo et al., 2022[4]). Work of this type is needed to supplement evidence derived from small-scale cognitive lab studies, advance learning science and have an impact on large-scale assessment.

At a pragmatic level, Schwartz and Arena (2013[5]) argue that we need to “democratise” assessment design in the same way the design of videogames has become more accessible with the proliferation of online communities. Crowdsourcing platforms, such as the Platform for Innovative Learning Assessments1 (PILA) at the OECD, provide developers with model tasks they can iterate and embed data collection instruments that simplify researchers’ work on validation and measurement. Such environments and testbeds could make it far easier to engage in some of multidisciplinary intellectual work noted above.

In summary, there are multiple intellectual and pragmatic challenges in merging learning science, data science and measurement science to understand how the sources of evidence we can obtain from complex tasks can best be analysed and interpreted using models and methods from AI, machine learning, statistics and psychometrics. Collaborative engagement with these concerns by learning scientists, data scientists, measurement experts, assessment designers, technology experts, experts in user interfaces and educational practitioners could yield a new discipline of Learning Assessment Engineering.

The development of assessments for application and use at any reasonable level of scale is a time consuming and costly enterprise, especially for innovative assessment of the types envisioned in this report. The bulk of the substantial funds currently expended at national and international levels on assessment programmes is for the design and execution of large-scale assessments focused on traditional disciplinary domains like mathematics, literacy and science (e.g. the National Assessment of Educational Progress (NAEP) programme in the United States and the OECD’s PISA programme). Most such assessments fall within conventional parameters for task development, delivery, data capture, scoring and reporting. This has been true for quite some time despite the fact that most large-scale assessment programmes have moved to technology-based task presentation, data capture and reporting. Capitalising on many of the affordances of technology as described in this report has not been a distinct feature of those assessment programmes.

Developing and validating technology-rich assessment tasks and environments of the type advocated for in this report is a much more costly activity than updating current assessments by generating traditional items using standard task designs and specifications and presenting them via technology rather than paper-and-pencil. Such new instruments require considerable research and development regarding task design, implementation, data analysis, scoring, reporting and validation. As noted above, that scope of work needs to be executed by interdisciplinary groups representing domain experts, problem developers, psychometricians, UI designers and programmers. Sustained funding for the type of research and development needed is a key element in advancing next-generation assessment.

A significant roadblock to achieving assessment of 21st Century competencies is the paucity of examples of assessment instruments of complex cognitive and socio-cognitive constructs, especially examples that have been built following systematic design principles and then validated in the field. Those cases where the work has advanced to the point where validity arguments can be offered, including evidence of feasibility for implementation at scale, have seldom moved beyond the research and development labs where they were prototyped. This is true even for cases that have achieved a high level of visibility within the assessment research and development technical community. Regrettably, this body of work has not managed to change the way assessment is conceptualised and executed at scale. To advance the field of 21st Century innovative assessment, considerably larger capital investments need to be made of two types as argued below.

Substantial fiscal capital is required to assemble and support the multidisciplinary teams needed to conduct research and development supporting the creation of innovative next-generation assessments. The amount currently invested in multidisciplinary assessment research and development (R&D) work are but a tiny fraction of what is spent on more conventional large-scale assessment development and implementation. Neither government funding agencies, private foundations, testing companies nor governmental assessment agencies have been willing to make the systematic and sustained investments required. Funding at fiscal levels representing a small percentage of the total fiscal expenditures on educational assessment would make a significant difference in what could be done and the time to do so. Without sustained and increased investments in the types of work required it will prove difficult, if not impossible, to accumulate the knowledge required to solve the conceptual and technical problems that remain and generate the solutions required for valid and useful assessment of challenging constructs.

Of equal need is investment in bringing existing innovative assessments efforts to full maturity by scaling up their implementation when evidence exists that they can effectively address the challenge of measuring the constructs that matter. Current and future innovative assessment solutions are likely to languish within the R&D laboratory unless funding can be provided to move them out of the laboratory and into the space of large-scale implementation, where their efficacy and utility can be properly evaluated. Only then will the possibility exist of using them to replace current ways of doing business.

As currently practiced educational assessment is a highly entrenched enterprise, particularly the use of large-scale standardised assessments for educational monitoring and policy decisions. Standardisation includes what is assessed, how it is assessed, how the data are collected and then analysed, and how the results are interpreted and then reported. This is not an accident but the product of many years of operating within a particular perspective on what we want and need to know about the knowledge, skills and abilities of individuals, coupled with a highly refined technology of test development and administration that is further coupled with an epistemology of interpretation about the mental world rooted in a measurement metaphor derived from the physical world.

It is hard to make major changes within existing systems when there are well-established operational programmes that are entrenched in practice and policy. Change of the type needed requires strong political will and vision to encourage people to think beyond what is possible now or even in the near future. Without political will, it will be impossible to generate sufficient fiscal capital to assemble the intellectual capital required to pursue next-generation assessment development and implementation and achieve meaningful change in educational assessment.

The political capital needed is not limited to policy makers. It encompasses multiple segments of the educational assessment development community, the measurement and psychometric community, and the educational practice community. Each of these communities has entrenched assumptions and practices when it comes to assessment. Thus, each community needs to buy into a vision of transformation that may well yield outcomes at variance with aspects of current standard operating procedure. For example, if students’ knowledge and skills are no longer seen as discrete and independent, assessing them requires examining the entire interactive process in adaptive learning environments that mimic real-world scenarios. Regardless of where the process may lead, these communities must work together to generate the amount of political will and capital needed to organise, support and sustain a transformation process for educational assessment in ways envisioned in this report.

It should be obvious that much is needed to advance the agenda for innovation in assessment along the lines outlined throughout this report. One of the biggest challenges in making change happen is that scale is needed to show what is possible. As noted earlier, scaling up promising ideas is critical for testing how flexible or brittle those ideas and assessment approaches may be, in addition to what it takes to put them into practice at scale. Fortunately we have some examples of efforts to do so, which in turn have taught us much with respect to what is possible as well as where challenges remain.

International assessments generally serve as tools for monitoring performance on contemporary disciplinary standards. As such these programmes make statements about what is valued globally and provide information about student proficiency at scale. They also illustrate an operational example of the pooling of intellectual, fiscal and political capital required to move an innovative large-scale assessment agenda forward. For example, in addition to its ongoing regular assessment programme in mathematics, reading and science, the OECD’s PISA Programme has embarked on including one “innovative” assessment in each of its assessment cycles. Through this effort, the OECD has signalled the important forms of 21st Century knowledge and skill that should be assessed as a part of monitoring broader educational goals and aims. We will briefly consider one recent example from that programme to illustrate some of what has been learned through attempts to put innovative ideas about the assessment of learning into practice.

In its 2025 cycle, PISA will include an assessment of Learning in the Digital World (LDW). When the PISA Governing Board embarked on this new development back in 2020, there were clear expectations about the added value it should bring: countries were interested in comparable data on students’ readiness to learn and problem solve with digital tools. Even before the COVID-19 global pandemic, it was clear to stakeholders that digital technologies are significantly impacting education, yet there is not enough information on whether students have the necessary skills to learn with these new tools and on whether schools are equipped to support these new ways of learning.

This policy demand oriented several design decisions. As already discussed, an assessment of learning skills has different requirements from an assessment of knowledge. To distinguish more effective learners from less effective learners, the assessment had to provide opportunities for students to engage in some type of knowledge construction activities. In other words, the assessment designers had to structure the assessment as a learning experience where it would be possible to evaluate how students’ knowledge changed over the course of the assessment. Consequently, the structure of the assessment units has diverged from the traditional PISA format with a series of stimuli and independent questions to a new format that is structured as a series of connected lessons (Figure 14.1).

A virtual tutor guides the students through the test, explaining how they can solve relatively complex problems using digital tools that include block-based coding, simulations, data collection and modelling interfaces. An interactive tutorial with videos is embedded in each unit to help students understand how to use these tools and mitigate differences in students’ familiarity with particular digital tools or learning environments. Students then solve a series of tasks that progress from easier to more difficult, introducing them to the concepts and practices they are expected to learn in the unit and that they will need to apply to the later, more complex “challenge” task.

Part of the assessment construct relates to students’ capacity to engage in self-regulated learning, therefore requiring the development of measures such as monitoring and adapting to feedback and evaluating knowledge and performance. In order to generate observables for these self-regulated learning processes, a number of affordances were embedded in the assessment environment. Over the course of the test, students can receive feedback by testing whether they achieve the expected outcomes by asking the tutor to check their work. They can choose to see the solutions to the training tasks after they submit their answers, and for each task they can access hints and worked examples to help them solve the problem. At the end of each unit, students are asked to evaluate their performance and report the effort they invested while working through the unit and the emotions they felt during as they worked. The assessment thus integrates the idea that we can better measure complex socio-cognitive constructs by giving students choice in the assessment and monitoring not just how well students solve problems but also how they go about learning to do so.

These innovations represent responses to well-defined evidentiary needs. As further elaborated in Chapter 6 of this report, the assessment has been designed to provide responses to three interconnected questions: 1) what types of problems in the domain of computational design and modelling can students solve? 2) To what extent are they able to learn new concepts in this domain by solving sequences of connected, scaffolded tasks? And 3) to what extent is this learning supported by productive behaviours, such as decisions to use learning affordances when needed or monitor progress towards their learning goals? These questions have defined the cognition model of the assessment, have oriented the design of tasks needed to elicit the necessary observations, and are guiding analysis plans to interpret the data in a way that is consistent with the reporting purposes of the assessment and that accounts for the complex nature of the data.

The expectation is to produce multidimensional reports of student performance on this test including measures of: 1) students’ overall performance on the tasks (represented in a scale, as in other PISA assessments); 2) learning gains, i.e. how much students’ knowledge of given concepts and their capacity to complete specific operations increases following the training; and 3) students’ capacity to self-regulate their learning and manage their affective states. These different measures will be triangulated in the analysis, for example explaining part of the variation in learning gains with the indicators of self-regulated learning behaviours. The goal is to provide policy makers with actionable information that is not limited to one score and a position in an international ranking but that includes more nuanced descriptions of what students can do and indicates what aspects of their performance deserve more attention.

The development of the PISA 2025 Learning in the Digital World assessment was only possible because of the convergence of the different types of capital described in this chapter. The political backing of a research and development agenda by PISA participating countries has been strong. The innovative assessment included in each PISA cycle is now seen as a safe space to test important innovations in task design and analytical models that can then be transferred to the trend domains of reading, mathematics and science or that can provide inspiration for the development of national assessments once their value is proven. Acknowledging the need for multiple iterations in the design of tasks and for extensive validation processes for design and analytical choices through cognitive laboratories and pilot studies, the PISA Governing Board provided the financial and political support needed to start the development of the test five years before the main data collection. Further resources were made available by research foundations that recognised the value of innovating assessments.

The development of the assessment has also been steered by a group of experts with different disciplinary backgrounds: subject matter experts worked side-by-side with psychometricians, scholars in learning analytics and experts in UI/UX design. This cross-fertilisation was important to make space for new methods of evidence identification in digital learning environments while keeping in mind the core objective to achieve comparable metrics that result in valid interpretations of performance differences across countries and student groups.

This new PISA test represents only an initial foray into the enterprise of innovating assessments. As argued in this report, we need many new disciplinary and cross-disciplinary assessments to provide an exhaustive description of the quality of educational experiences across countries. Several challenges also remain, particularly in the interpretation vertex of the assessment triangle. International fora like PISA have a role to play in coordinating policy demands and facilitating a consensus on what pieces of the puzzle we need to work on and what the priorities should be for the near term and beyond. There is more than ample evidence that innovative assessment of educationally and socially significant competencies is both desirable and possible. The evidence also suggests that cooperation and collaboration on a global scale may well be the best and only way to achieve such advances.


[4] Guo, H. et al. (2022), “Understanding students’ test performance and engagement”, Invited session.

[3] Kleinman, E. et al. (2022), “Analyzing students’ problem-solving sequences”, Journal of Learning Analytics, Vol. 9/2, pp. 1-23, https://doi.org/10.18608/jla.2022.7465.

[2] OECD (2022), Thinking Outside the Box: The PISA 2022 Creative Thinking Assessment, OECD Publishing, Paris, https://issuu.com/oecd.publishing/docs/thinking-outside-the-box (accessed on 16 April 2023).

[6] OECD (forthcoming), PISA 2025 Learning in the Digital World Assessment Framework.

[1] Pellegrino, J., N. Chudowsky and R. Glaser (eds.) (2001), Knowing What Students Know, National Academies Press, Washington, D.C., https://doi.org/10.17226/10019.

[5] Schwartz, D. and D. Arena (2013), Measuring What Matters Most: Choice-Based Assessments for the Digital Age, The MIT Press, Cambridge.

Metadata, Legal and Rights

This document, as well as any data and map included herein, are without prejudice to the status of or sovereignty over any territory, to the delimitation of international frontiers and boundaries and to the name of any territory, city or area. Extracts from publications may be subject to additional disclaimers, which are set out in the complete version of the publication, available at the link provided.

© OECD 2023

The use of this work, whether digital or print, is governed by the Terms and Conditions to be found at https://www.oecd.org/termsandconditions.