Chapter 18. How the science of learning is changing science assessment

Barbara Means
Digital Promise
Britte Haugan Cheng
Menlo Education Research
Christopher J. Harris

Mandated tests exert a strong influence both on what gets taught in schools and on how it gets taught. In the past, conventional multiple-choice tests have inclined educators towards broad content coverage rather than deeper learning in science as well as in other academic subjects. But over the last decade, the prerequisites for a major shift in the nature of science assessments have emerged. New frameworks and standards for science learning, such as the Next Generation Science Standards, are based on conceptions of science proficiency coming out of learning sciences research. In combination with advances in interactive, adaptive digital systems and psychometric modelling, this learning sciences conception of what it means to develop science proficiency is stimulating new ways to capture students’ science ideas, concepts and practices simultaneously in the context of rich, extended scientific investigations.


The assessment of student learning is fundamental both to improving instruction (Black and Wiliam, 1998[1]) and to making judgments about the effectiveness of school systems (OECD, 2013[2]). In the past, different kinds of assessments have been used for these two purposes, and neither classroom assessments nor mandated large-scale assessments have rendered a full picture of what we now understand to be needed science proficiencies. An understanding of the latter, based on learning sciences research, has provided the basis for new science education standards in the United States and has influenced international assessments, such as the Programme for International Student Assessment (OECD, 2016[3]). In this chapter, we argue that rich technology-based environments will be necessary to assess the science proficiencies described in those standards and that such technology-based assessments have the potential to resolve the challenge that many countries face in trying to align and reconcile classroom-based and national learning assessments.

Origins of testing practices.

While national tests of educational achievement are relatively new to many countries (OECD, 2013[2]), they have a long history in the United States. The US approach to large-scale testing of elementary and secondary school students evolved from 20th-century efforts to identify those who were and those who were not fit for certain tasks – originally, serving in the armed forces during World War I and later, entering various professions or going to college. The methods for developing and interpreting assessments were attuned to the goal of discriminating among levels of a hypothesised construct, such as general intelligence or maths skill. The logic underlying classical test development is that a sample of items drawn from the universe of all possible items would elicit examinee responses that would justify inferences about how the test taker would have performed if every possible test item had been administered. To administer assessments to large numbers of examinees in as little time as possible and with low cost, test developers rely on questions with one right answer and multiple-choice formats. Such tests produce reliable scores and can be scored automatically. Unfortunately, the multiple-choice format has a serious downside: it tends to lure the test item writer into focusing on discrete bits of knowledge and highly structured problems – a far cry from the model of science proficiency and expertise that has now emerged from learning sciences research.

Learning sciences perspective on the nature of science expertise

Everyday conceptions of expertise regard the individual who has retained large numbers of science facts and data points as a science expert. But cognitive studies of expertise starting in the 1980s and more recent fine-grained analyses of the inter-personal and social nature of learning and competence suggest that this layperson’s view of expertise is fundamentally misleading (Bransford, Brown and Cocking, 2000[4]; Chi, Glaser and Farr, 1988[5]; Sawyer, 2005[6]). Expertise lies not so much in the number of facts an individual can state but in the way in which that knowledge is organised and the ability to apply it flexibly and appropriately in new situations. Moreover, the long-admired virtues of critical thinking and problem solving are no longer viewed as generalised capacities for abstract thinking, but rather as forms of thinking within particular domains that are necessarily manifested in combination with content knowledge (Bransford, Brown and Cocking, 2000[4]; Chi, Glaser and Farr, 1988[5]; Sawyer, 2005[6]). In addition, we now understand that scientific standards of evidence and forms of argument are socially constructed norms that are maintained through collaboration and communication (Lemke, 2001[7]).

These learning sciences insights into the nature of expertise exerted a major influence on The Framework for K-12 Science Education Practices developed by the National Research Council in 2012 (National Research Council, 2012[8]). In contrast to the earlier National Science Education Standards that treated science inquiry processes and core concepts in the various domains as separate learning goals, the Framework makes a strong statement that science practices and science content must be integrated in order for students to become proficient in science. The Framework describes proficiency in science as the integration of concepts (such as causality) that cut across many fields of science, (OECD, 2013[2]; OECD, 2016[3]) the practices that scientists and engineers use (such as evidence-based argument), and (OECD, 2016[3]) core ideas in particular science disciplines (such as natural selection). Science proficiency lies in the ability to orchestrate all three dimensions (practices, crosscutting concepts and core ideas in the domain). Science learning is a trajectory in which these inter-related dimensions of proficiency emerge over time, with expected progressions of increasingly sophisticated understandings, not just within an instructional unit, but over multiple experiences and years (National Research Council, 2012[8]).

Research on learning progressions with respect to understanding a number of core science ideas and crosscutting concepts (Alonzo and Gotwals, 2012[9]; Corcoran, Mosher and Rogat, 2009[10]; Molnár et al., 2017[11]) had a major influence on the vision articulated in the Framework. Cognitive and science education researchers participated along with scientists in the fields of physical, life, earth/space, and engineering sciences in developing the Framework. When US state representatives and other stakeholders subsequently organised to lay out the Next Generation Science Standards (NGSS) based on the Framework, their descriptions of performance expectations for different grade bands were shaped by the learning progressions research. We note that this conceptualisation of science competencies in the Next Generation Science Standards now widely influential in the United States, treats science proficiency as crossing multiple specific scientific domains (e.g. ecology and genetics) rather than as expressions of domain, independent or general problem solving skills as conceptualised by many European researchers (Csapó and Funke, 2017[12]; Molnár et al., 2017[11]; Zoanetti and Griffin, 2017[13]).

Implications for assessing science proficiency

Assessing progress towards attainment of science proficiency as set forth in the Framework and the NGSS requires assessing students’ application of practices, crosscutting concepts and core disciplinary ideas all at the same time within some larger problem context (National Research Council, 2014[14]). These three dimensions of proficiency will need to be assessed through multiple tasks that vary in what they ask students to do, including tasks calling on students to develop and use models, construct explanations, and evaluate the merit of others’ ideas and methods. Science assessments will also need to call on students to make connections between different science ideas and crosscutting concepts. Finally, they will need to be designed to provide information at multiple points in time about students’ progress with respect to the levels in the learning progressions incorporated in the NGSS.

In addition, the statistical models developed to produce scores and interpretations of test results for assessments of a single dimension are not sufficient for multi-dimensional assessment tasks consistent with the Framework. These models assume that all the items on a test are sampled from a single achievement domain and that responses to different items are independent of each other. Assessments meeting these criteria could not possibly capture the multiple, inter-related facets and complexities of the proficiencies defined by the three dimensions in the Framework.

The importance of technology advances

To assess such combinations of the three Framework dimensions, we need more open-ended, multi-part problem contexts, and we must also allow for a student’s response to one portion of the problem to constrain responses to other portions (with the consequence that items are not independent). Recent advances in the capability of interactive learning technologies to present virtual environments, models of complex systems, digital workspaces and simulations are supporting efforts to address the challenge of providing rich problem contexts to all of the students being assessed. And, simultaneously, advances in psychometric theory and modelling have been such that we can now build measurement models for such complex, multi-part assessment tasks (Mislevy, Steinberg and Almond, 2003[15]; National Research Council, 2001[16]; Shute et al., 2016[17]). It is no longer unrealistic to think about measuring the kinds of thinking that scientists and engineers do in different domains using computer-based assessment tasks that are both engaging and systematically presented and scored.

Example of a technology-based science assessment task

An example of how assessment tasks can elicit the three intertwined dimensions of the NGSS comes from the ongoing work of the Next Generation Science Assessment project, a multi-institutional research and development collaboration to design, develop and validate sets of technology-enhanced assessment tasks for teachers to use formatively in classroom settings. The assessments are designed to help teachers gain insights into their students’ progress towards achieving the NGSS performance expectations for middle school science. The research team is using an evidence-centred design approach (Mislevy, Steinberg and Almond, 2003[15]; National Research Council, 2014[14]) to create computer-based, instructionally supportive assessment tasks with accompanying rubrics that integrate the three NGSS dimensions (Harris et al., 2016[18]).

The developers systematically deconstruct each NGSS performance expectation into a coherent set of learning performances, which can guide formative assessment design. Learning performances are statements that incorporate aspects of disciplinary core ideas, science practices and crosscutting concepts that students need to attain as they progress towards achieving an NGSS performance expectation. Each set of learning performances helps identify important formative assessment opportunities for teachers aligned with the proficiencies in a performance expectation. Learning performances are akin to learning goals that take on the three-dimensional structure of the performance expectations – they articulate and integrate assessable aspects of performance that build towards the more comprehensive performance expectation.

The learning performances then guide the design of assessment tasks that integrate the NGSS dimensions and collectively align with the performance expectation. Table 18.1 shows a set of learning performances derived from an NGSS performance expectation for the middle school topic of matter and its interactions.

Table 18.1. A middle grades NGSS performance expectation and related set of learning performances

Performance Expectation:

MS-PS1-4. Develop a model that predicts and describes changes in particle motion, temperature and state of a pure substance when thermal energy is added or removed.

Related Learning Performances

LP 1: Evaluate a model that uses a particle view of matter to explain how states of matter are similar and/or different from each other.

LP 2: Develop a model that explains how particle motion changes when thermal energy is transferred to or from a substance without changing state.

LP 3: Develop a model that includes a particle view of matter to predict the change in the state of a substance when thermal energy is transferred from or to a sample.

LP 4: Construct a scientific explanation about how the average kinetic energy and the temperature of a substance changes when thermal energy is transferred from or to a sample, based on evidence from a model.

LP 5: Develop a model that includes a particle view of matter to predict how the average kinetic energy and the temperature of a substance change when thermal energy is transferred from or to a sample.

Source: Next Generation Science Assessment Collaborative,

Each of the learning performances developed for a performance expectation then becomes the learning outcome for a multi-part technology-based assessment task. Figure 18.1 shows an example of a task addressing all three NGSS dimensions for a learning performance aligned to the NGSS performance expectation for thermal energy and particle motion. In this task, students watch a short video of what happens when dye-coated candies are placed into water at different temperatures. Students then develop models and write a description of what is happening as the dye spreads differently at the different temperatures. The task assesses disciplinary core ideas around temperature and the kinetic energy of particles, the science practice of developing models, and the crosscutting concept of cause and effect. It requires students to integrate knowledge about particles, temperature and kinetic energy and the underlying mechanism linking cause (water temperature) and effect (spread of the dye) with the ability to develop a model of a phenomenon using drawings and written descriptions.

Figure 18.1. Assessment task example: Thermal energy and particle motion
Figure 18.1. Assessment task example: Thermal energy and particle motion

Source: Next Generation Science Assessment Collaborative:

Role of technology-based learning environments for supporting learning and assessment

Even though we now have appropriate measurement models to assess multi-dimensional science proficiencies, the practical problem of finding time to administer multiple complex assessment tasks persists. Fortunately, the use of learning technologies in classrooms is increasing, and students’ daily instructional activities often involve the use of digital resources and instructional software (e.g. simulations, serious games, etc.). As a result, the distinction between “learning” and “assessment” is becoming blurred. Digital learning environments can be designed to elicit student thinking and proficiencies as a natural by-product of interacting with the system.

In the River City virtual environment, for example, students seek the source of an infectious disease by exploring various parts of the virtual city and its environs, performing environmental tests (which reveal the hypotheses they are entertaining) and making online journal entries with their findings and interpretations (revealing how students reason from data and put different pieces of information together to form inferences). These online activities are opportunities both for learning and for assessment. With a learning system like River City that is continually gathering information relevant to student proficiencies, there is no need to stop learning activities to see what students know and can do. It is possible to bring an assessment perspective and modern measurement models to this endeavour so that assessment becomes something that is “always on” during learning rather than a special event with different materials, formats and rules than either classroom learning or normal functioning in the world. The log file traces from River City, for example, have been used as assessment data, providing a basis for making inferences about students’ science inquiry processes (Ketelhut and Dede, 2006[19]).

Other powerful examples of tech-based formative assessments of multi-dimensional competencies are being developed that capitalise on ongoing assessment approaches. Researchers in Luxembourg have analysed log files from students using Genetics Lab to assess students’ competencies in systematically exploring relationships among variables determining genotypes and phenotypes (Csapó and Funke, 2017[12]). ChemVLab simulates a chemistry stockroom and workbench for carrying out a wide array of investigations, providing students with practical, simulated exposure to wet-lab work, data collection and interpretation, problem solving and sense making (Davenport et al., 2012[20]). Again, using student log data, reports to students and teachers provide ongoing progress monitoring and allow teachers to adjust their instruction accordingly.

Implications for large-scale testing and assessment systems

To date, most of the research and development around rich technology-based assessment environments such as River City, Genetics Lab and ChemVLab has focused on low-stakes uses for formative purposes within classrooms. In part, this is because the learning sciences-based view of the multi-dimensional nature of science proficiencies described above is at odds with the nature of large-scale testing practiced in the United States and in many other countries as part of educational accountability systems. In the United States, most states conduct state-wide end-of-year assessments in science just three times during a student’s K-12 schooling, once in elementary school, once in middle school, and once in high school, as required by federal education legislation.

These large-scale test administrations have been characterised as “drop in from the sky” assessments because they come from outside the classroom, interrupt the flow of classroom learning activities, and vary in the degree to which they relate to the curriculum students have been studying. The total amount of time US classrooms are being required to spend on drop-in-from-the-sky assessments has been a source of contention in recent years, but even so, it should be realised that the state assessment in any one subject area is quite limited in duration. For example, state science assessments in the United States typically are completed in 60-90 minutes on a single day. It can be argued that the combinations of practices, crosscutting concepts, and core ideas that comprise the proficiencies in the Framework and NGSS cannot possibly be exhibited within such a constrained time-period (Pellegrino, 2016[21]). Rather, these proficiencies come into play when learners work with complex problems and challenges over an extended timeframe.

A strategy for addressing the challenge of obtaining a meaningful assessment of domain-embedded science proficiencies as now understood without unduly burdening teachers and students with many hours of testing is to supplement (or replace) state testing activities with information about student proficiencies gleaned from formative assessments done within classrooms as part of instruction (Csapó and Funke, 2017[12]; Zoanetti and Griffin, 2017[13]) for a European view of this alternative). Because such classroom assessments are part of the learning process rather than an isolated, unrelated activity, they naturally occur over time and generate much more information about students’ science thinking and performance than any single drop-in-from-the-sky test could. Such ongoing, curriculum-embedded assessments also provide many more opportunities for examining the emergence of conceptual understanding and proficiency over time, consistent with the trajectories and learning progressions emphasised in the Framework and NGSS.

Challenges remain

Clearly, there are many challenges to using data from classroom-based assessment activities within accountability systems. Different classrooms use different curricula and sequence the treatment of science topics differently both within and between grade levels. District and state assessment data systems cannot accommodate a hodgepodge of different kinds of assessment data gathered differently in every classroom. Maintaining data over time and formatting it in some standard way for submission to the district or state would place a significant burden on teachers and schools. More importantly, meaningful comparisons over time or across schools could not be made if the testing content and conditions varied in unknown and drastic ways.

It is unlikely that these barriers could be overcome without using technology-based assessments. In the case of science learning, technology-based activities within microworlds and simulations that are geared to grade-level science performance expectations and combine learning and assessment in a seamless whole could provide some traction. As noted above, if a principled approach is applied to learning system design, the log file of a student’s actions while working with that system can be analysed automatically to yield assessment data. Now that a significant number of states have adopted the NGSS, investment of the resources needed to develop high-quality digital learning and assessment systems becomes more attractive because there are more potential users of such systems. To a large extent, the technology can provide for standard assessment conditions across classrooms and for standard data outputs that can be aggregated across classrooms, schools and districts.

While some researchers envision a time when externally mandated tests are replaced entirely by such classroom technology-based assessments (DiCerbo and Behrens, 2012[22]; Shute et al., 2016[17]) others suggest that some combination of data from externally mandated summative assessments and more detailed information from classroom technology-based assessments is the best path forward (Pellegrino, 2016[21]).

New opportunities

The influence of large-scale testing and accountability systems on classroom instruction has been well researched. Teachers, especially those in schools serving students from low-income backgrounds, have a tendency to narrow what they teach to the content that will be on mandated tests (Dee, Jacob and Schwartz, 2013[23]; Koretz, 2009[24]; Shepard, 2000[25]), (Koretz, 2009[24]). Further, teachers often tend to model their classroom assessments on the item types and formats used in large-scale testing (Shepard, 2000[25]). These very natural reactions to testing and accountability regimes have resulted in science instruction featuring shallow treatment of a large number of topics rather than investigation and formation of connections among core ideas and crosscutting themes in science.

But the zeitgeist is shifting. As described above, the Framework and NGSS were crafted based on a research-informed perspective on the goals for science education and the nature of science proficiency. State adoption of the new science standards is approaching critical mass, and states are looking for new assessment instruments that measure student performance against standards-aligned performance expectations. The standards also are inspiring the development of new science curricula incorporating interactive digital learning resources such as simulations, models and virtual environments (Harris et al., 2015[26]; Roseman et al., 2015[27]).

Policy implications

If properly designed, new digital learning systems can capture a rich set of data documenting students’ science proficiencies and conceptual understanding as portrayed in the NGSS. As discussed above, modern statistical modelling techniques, principled approaches to assessment design, and technology affordances will be critical supports for the creation of assessments embedded in digital learning systems. State and national consortia should invest in research and development on the use of science assessments embedded in digital learning systems at scale. New federal education legislation (replacing No Child Left Behind with the Every Student Succeeds Act) has reduced pressure on states to show continually rising test scores, and this factor too helps to create a window for innovation in assessment practices. If states, districts, research and development labs, and commercial developers take advantage of this propitious set of circumstances to design and implement assessments reflecting learning science findings as embodied in the Framework, we can indeed improve not only our understanding of how well students are prepared for the science challenges of our century but also the quality of the instruction we offer them. The creation of such assessment systems would constitute a tremendous contribution of basic learning sciences research to educational practice.


[9] Alonzo, A. and A. Gotwals (2012), Learning Progressions in Science: Current Challenges and Future Directions, SensePublishers.

[1] Black, P. and D. Wiliam (1998), “Assessment and classroom learning”, Assessment in Education: Principles, Policy & Practice, Vol. 5/1, pp. 7-74,

[4] Bransford, J., A. Brown and R. Cocking (2000), How People Learn: Brain, Mind, Experience, and School, National Academy Press, Washington, DC,

[5] Chi, M., R. Glaser and M. Farr (1988), The Nautre of Expertise, Lawrence Erlbaum Associates, Hillsdale, NJ.

[10] Corcoran, T., F. Mosher and A. Rogat (2009), Learning Progressions in Science: An Evidence-Based Approach to Reform, Consortium for Policy Research in Education, New York, NY,

[12] Csapó, B. and J. Funke (eds.) (2017), “Assessing complex problem solving in the classroom: Meeting challenges and opportunities”, in The Nature of Problem Solving: Using Research to Inspire 21st Century Learning, OECD Publishing, Paris,

[20] Davenport, J. et al. (2012), “ChemVLab+: Evaluating a virtual lab tutor for high school chemistry”, in Proceedings of the 2012 International Conference of the Learning Sciences, Academic Press.

[23] Dee, T., B. Jacob and N. Schwartz (2013), “The effects of NCLB on school resources and practices”, Educational Evaluation and Policy Analysis, Vol. 35/2, pp. 252-279,

[22] DiCerbo, K. and J. Behrens (2012), “From technology-enhanced assessments to assessment-enhanced technology”, paper presented at the annual meeting of the National Council on Measurement in Education.

[18] Harris, C. et al. (2016), “Constructing assessment items that blend core ideas, crosscutting concepts and science practices for classroom fomative applicatations”, in Educational Measurement: Issues and Practice, Menlo Park,

[26] Harris, C. et al. (2015), “Impact of project-based curriculum materials on student learning in science: Results of a randomized controlled trial”, Journal of Research in Science Teaching, Vol. 52/10, pp. 1362-1385,

[19] Ketelhut, D. and C. Dede (2006), “Assessing inquiry learning”, paper presented at the annual meeting of the National Association for Research in Science Teaching.

[24] Koretz, D. (2009), Measuring Up: What Educational Testing Really Tells Us, Harvard University Press, Cambridge, MA,

[7] Lemke, J. (2001), “Articulating communities: Sociocultural perspectives on science education”, Journal of Research in Science Teaching, Vol. 38/3, pp. 296-316,;2-R.

[15] Mislevy, R., L. Steinberg and R. Almond (2003), “On the structure of educational assessments”, Measurement: Interdisciplinary Research and Perspective, Vol. 1/1, pp. 3-62,

[11] Molnár, G. et al. (2017), “Empirical study of computer-based assessment of domain-general complex problem-solving skills”, in Csapó, B. and J. Funke (eds.), The Nature of Problem Solving: Using Research to Inspire 21st Century Learning, OECD Publishing, Paris,

[14] National Research Council (2014), “Developing assessments for the Next Generation Science Standards”, in Pellegrino, W. (ed.), Committee on Developing Assessments of Science Proficiency in K-12, National Academies Press, Washington, DC,

[8] National Research Council (2012), A Framework for K-12 Science Education, National Academies Press, Washington, DC,

[3] OECD (2016), PISA 2015 Assessment and Analytical Framework: Science, Reading, Mathematic and Financial Literacy, PISA, OECD Publishing, Paris,

[2] OECD (2013), Synergies for Better Learning: An International Perspective on Evaluation and Assessment, OECD Reviews of Evaluation and Assessment in Education, OECD Publishing, Paris,

[21] Pellegrino, J. (2016), “21st century science assessment: The future is now”, paper commissioned by SRI Education on behalf of the National Science Foundation,

[16] Pellegrino, J., N. Chudowsky and R. Glaser (eds.) (2001), Knowing What Students Know, National Academies Press, Washington, DC,

[27] Roseman, J. et al. (2015), “Curriculum materials for next generation science standards: what the science education research community can do”, paper, paper presented at the annual meeting of the National Association for Research in Science Teaching.

[6] Sawyer, R. (ed.) (2005), The Cambridge Handbook of the Learning Sciences, Cambridge University Press, Cambridge,

[25] Shepard, L. (2000), “The role of assessment in a learning culture”, Educational Researcher, Vol. 29/7, pp. 4-14,

[17] Shute, V. et al. (2016), “Advances in the science of assessment”, Educational Assessment, Vol. 21/1, pp. 34-59,

[13] Zoanetti, N. and P. Griffin (2017), “Log-file data as indicators for problem-solving processes”, in Csapó, B. and J. Funke (eds.), The Nature of Problem Solving: Using Research to Inspire 21st Century Learning, OECD Publishing, Paris,

End of the section – Back to iLibrary publication page