Chapter 3. Methodology for assessing computer capabilities using the Survey of Adult Skills (PIAAC)

This chapter describes the motivation and methodology for carrying out an exploratory assessment of computer capabilities to answer questions in the Survey of Adult Skills (PIAAC). The goal for this exercise was to develop a measure of computer capabilities that would be meaningful to educators and education researchers and also provide a credible basis for economic analysis. To achieve this goal, the OECD worked with a group of computer scientists to assess the difficulty of the PIAAC questions for computers. After setting out how the experts were chosen, the chapter describes the challenges they overcame to develop a methodology to carry out the assessment. A summary of the scope and limitations of the methodology is offered, as well as suggestions for possible future improvements.

  

The Survey of Adult Skills (PIAAC) measures a set of general cognitive skills that are developed during formal education and widely used at work. The survey includes tests of skills in literacy, numeracy and problem solving with computers.1 To provide a way of anticipating the changes that technology may bring to the use of these skills in the future, the OECD asked a group of computer scientists to assess the capabilities of computers related to answering the questions in the three skill areas included in the survey.

As this chapter sets out, the experts developed a common approach for rating computer capabilities after extensive discussion of a series of questions. The approach involved providing a rating of the ability of current computer techniques to answer each test question after a one-year development period costing no more than USD 1 million, and using the same visual materials that were used by the adults who took the test. The rating options were Yes, No and Maybe, with respect to the capabilities of computers to answer each question. The analyses of the ratings that resulted from this exploratory work are discussed in Chapter 4.

Objective for the exploratory assessment of computer capabilities

The goal for the exploratory assessment was to develop a way of obtaining information about computer capabilities in a form that would be meaningful to educators and education researchers. Educators and education researchers are usually familiar with the types of skills assessed on tests like the Survey of Adult Skills. They are also familiar with the ways those skills are developed in education and potentially used at work and in daily life. PIAAC was specifically designed to provide this type of information across countries. Other tests also provide such information for particular types of skills and particular groups of individuals. However, educators and education researchers usually have little familiarity with the kinds of capabilities currently being demonstrated by computer science. This makes it difficult for the education community to understand the kinds of changes computers are likely to bring to work and skill demand over the next several decades.

The OECD’s analysis of computer capabilities was carried out to help the education community begin to analyse how computers are likely to change the skill requirements for future jobs. If computers have demonstrated some of the general cognitive capabilities assessed by PIAAC, then it is likely that employers will begin to use that technology to perform some of the tasks requiring general cognitive skills. This will ultimately shift workers to a different set of tasks, resulting in job destruction, creation and transformation. The shift is likely to take place slowly, probably over a decade or more. However, it is useful for the education system to anticipate these changes since schools often help students acquire skills that are believed to be useful one or more decades ahead. If technology is likely to substantially change the work skills that will be useful in several decades, then the education community needs to begin anticipating this change.

The exploratory study was also carried out to develop a more credible approach to assessing the capabilities of computers than has been achieved to date within economics. PIAAC allows an assessment of computer capabilities at a much more specific level of detail than prior work discussed in Chapter 1. This prior work has involved general descriptions of occupations or occupational tasks. Such descriptions are too coarse for computer scientists to be able to understand exactly what behaviour is included. As a result, when experts provide judgments about whether or not computers can carry out these tasks, it is generally not clear exactly what tasks they have in mind. It is almost certain that different experts are thinking of different tasks when responding to the same descriptions. For example, a task description such as “reads reports” could be used in describing many different occupations. The difficulty of the relevant task could vary widely between and within occupations. A computer scientist responding to such a description could therefore have many different possible tasks in mind when considering possible computer performance. By contrast, the PIAAC test questions involve precisely defined tasks. This allows computer scientists to closely analyse the specific information provided and the necessary information processing to answer a specific question. The PIAAC test questions provide a much more credible basis for assessing the capabilities of computers, just as they provide a more credible basis for assessing the skills of adults than simply asking adults whether they are able to “read reports”.

In order to assess the capabilities of computers using PIAAC, the plan was to ask a group of computer scientists to review the test questions in PIAAC’s three skill areas, and identify the questions that could be answered by machines today. The expectation was that computer scientists who work in areas related to language understanding and reasoning would be able to make these judgments based on their expertise about the capabilities and limitations of existing techniques. Their assessments would then be used to help educators and education researchers understand the capabilities of computers with respect to these three general cognitive skills and to help economists develop a comprehensive programme for credibly assessing computer capabilities across the full range of work skills.

This study was approached as an exploratory effort, with an expectation that it would take several additional attempts to refine a methodology for comparing machine capabilities to human skills. A relevant comparison is that it took several decades to develop and refine the approaches for comparing the skills of diverse individuals, including people from different cultures, people who speak different languages, and people with disabilities (e.g., National Research Council, 2002, 2004). With each of these expansions in the group of tested individuals, it was necessary to think carefully about which skills were being tested and why. When an existing test is given to a new group, it often becomes clear that some questions are unexpectedly hard or easy for the new group, for reasons that have nothing to do with the skill being assessed. For example, a test of arithmetic may be difficult for a non-native speaker because of the language used to give instructions and describe the problems, rather than because of the mathematical difficulty of the problems. It is reasonable to expect similar challenges when using tests to compare machine capabilities with human skills. It may therefore take time to develop appropriate ways of addressing them.

It is expected that the assessment in this study will be only the first step in the development of an approach to regularly monitor the increase in such computer capabilities. As such, lessons learned in conducting the assessment are as important at this stage as the findings themselves.

Identifying a group of computer scientists

Over a period of 10 months, approximately 60 computer scientists were contacted to provide input into the project. Initial recommendations for computer scientists were obtained from a set of social scientists who study the effects of computers on the labour market. These initial contacts were used to generate additional suggestions. The process was repeated until a full set of computer scientists had been identified who had appropriate expertise and were willing to participate in the evaluation.

Based on the initial set of contacts, the project identified a number of relevant areas of computer science for the assessment, including natural language processing, reasoning, common sense knowledge, computer vision, machine learning and integrated systems. The project set out to find participants in each of these areas who were willing to participate in the exploratory work. A group of prominent experts matching these criteria was successfully assembled.2 Table 3.1 lists the 11 participating computer scientists along with their areas of expertise.3

Table 3.1. Computer scientists providing assessments of computer capabilities

Computer scientists

Expertise

Jill Burstein, Research Director, Natural Language Processing Group, ETS Research Division

Natural language processing, automated essay scoring, discourse analysis, educational technology

Ernest Davis, Professor of Computer Science, Courant Institute, New York University

Representation of common sense knowledge

Kenneth D. Forbus, Walter P. Murphy Professor of Computer Science and Professor of Education, Northwestern University

Qualitative reasoning, analogical reasoning and learning, spatial reasoning, sketch understanding, natural language understanding, cognitive architecture, reasoning system design, intelligent educational software

Arthur C. Graesser, Professor, Department of Psychology and Institute for Intelligent Systems, University of Memphis

Question asking and answering, text comprehension, inference generation, artificial intelligence, computational linguistics, discourse technologies, human-computer interaction, problem solving

Jerry R. Hobbs, Research Professor, Fellow and Chief Scientist for Natural Language Processing, Information Sciences Institute, University of Southern California

Computational linguistics, discourse analysis, artificial intelligence, parsing, syntax, semantic interpretation, information extraction, knowledge representation, encoding common sense knowledge

Rebecca J. Passonneau, Director, Center for Computational Learning Systems, and Senior Research Scientist, Columbia University

Computational linguistics, computational semantics and pragmatics, discourse analysis, data mining, methodology

Vasile Rus, Professor, Department of Computer Science and Institute for Intelligent Systems, University of Memphis

Artificial intelligence, machine learning, computational linguistics, automated and human question answering and asking

Vijay Saraswat, Research Staff Member and Manager, IBM TJ Watson Research Center

Cognitive computing, theoretical computer science, programming systems, artificial intelligence, natural language processing, machine learning, probabilistic logic

Jim Spohrer, Director, Global University Programs and Cognitive Systems Group, IBM

Artificial intelligence, cognitive systems for holistic service systems

Mark Steedman, Professor of Cognitive Science, School of Informatics, University of Edinburgh

Computational linguistics, artificial intelligence, cognitive science, speech generation, communicative use of gesture, parsing, semantics

Moshe Vardi, George Distinguished Service Professor in Computational Engineering and Director of the Ken Kennedy Institute for Information Technology, Rice University

Database systems, computational-complexity theory, multi-agent systems, design specification and verification

Structure of the assessment of computer capabilities

The assessment was carried out during a two-day meeting, with materials provided to the participants to review in advance. All participants were given copies of the test questions in all three skill areas. In total, there were 128 questions across the areas of literacy, numeracy and problem solving using computers.4

The advance instructions and initial discussion addressed four primary issues regarding how to structure the task of evaluating computer capability to answer the questions: 1) whether to assess individual questions or to use cut-points across the full set of questions; 2) whether to assess computer capabilities in the past and the future; 3) how much development work to allow for applying the computer techniques to the specific context of the test questions, and 4) how to address the extensive visual input used on the test. Each of these issues is discussed in turn below.

Rating individual test questions or using cut-points across the full set of questions

The copies of the test questions were grouped separately by the three skill areas and arranged in order of increasing difficulty for adults, using the difficulty score for each question that is calculated as part of the analysis and scaling of the test results. Instructions provided to the group before the meeting suggested that they should identify cut-points in the series of questions (arranged from easy to difficult) between the questions that could be answered by computers now, and those that could not. The use of cut-points was suggested to provide an easy way of aggregating and comparing the ratings given by the different experts at the meeting. The instructions recognised that the order of difficulty of the questions would not necessarily be the same for computers as for people. Therefore, the instructions suggested that the experts identify questions that were not ordered with respect to their likely difficulty for computers so they could be considered separately.

In the discussion at the meeting, however, it became clear that the approach using cut-points did not work for the computer scientists. Only about half had been able to evaluate the questions using cut-points and most had strong practical and theoretical objections to the approach.

The group recognised that there are some ways in which problems that are more difficult for people will also be more difficult for machines: for instance, because they involve longer texts, require more inferences and include more possible wrong answers that need to be avoided. However, the experts also noted a number of ways that the difficulty of the questions is substantially different for people and machines. On the one hand, questions are often difficult for people if they involve long, repetitive texts or complicated calculations, factors that often pose little difficulty for machines. On the other hand, many questions that are easy for people involve interpreting pictures or social contexts, or coordinating information from pictures and text. Such factors are often quite difficult for computers. Because of these arguments, the group decided it would be better to rate computer capabilities with respect to each question. While potentially being more time-consuming, this approach avoided the necessity of making an assumption about how the ordering of difficulty of the test questions for computers relates to the ordering of their difficulty for people.

Giving ratings for the past and the future

The advance instructions asked the computer scientists to make their initial assessments with respect to the current capabilities of computers in 2016. Although there is great interest in the likely future capabilities of computers, the goal of the assessment was to avoid speculation on the initial rating and to assess computer capabilities in a way that could be justified by results demonstrated in the published research literature. After the initial assessment, the experts were also asked to consider how their assessments would have been different in 2006 and how they might be different in 2026. The point of introducing these alternative dates was to provide a way of thinking about the change in capabilities over time. Because of the speculation involved with projecting improvements to 2026, the initial plan was to rely primarily on the ratings for 2006 to look at change over time, since these ratings could be linked to the published literature and thereby avoid speculation.

It turned out, however, that the ratings for 2006 were difficult for the group to provide. Several of the experts explained this as being related to the difficulty of trying to imagine not knowing something that you already know. The improvements that have taken place since 2006 have been fully integrated into expert thinking and it is hard to identify when particular changes took place without going back to reconstruct the developments from the published literature itself.5

The group found it easier to think about likely improvements by 2026, while acknowledging that these projections could be quite wrong. There is a long history in AI of wildly optimistic projections of success in resolving problems that turned out to be much more difficult than was originally believed.6 Three experts provided a complete set of projections for the test questions for 2026. Although the group was generally more comfortable in projecting forwards than backwards, one expert pointed out that a projection of five years to 2021 would be more natural, because many grant applications require investigators to project the results of their own research over a three to five year period. This means that researchers have regular experience in estimating the degree of change that can occur over this shorter period.

Setting parameters for development of computer systems for the test questions

It was necessary for the group to consider how much development work would be allowed to adapt current computer techniques to the context of the test questions in the three skill areas. Although the questions are designed to be familiar to the general adult population, there is no reason to expect that existing computer systems would have already been developed for the types of questions included in the test.

Some computer techniques, such as text search, can be applied to many different contexts without special preparation. Other computer techniques need to be adapted to specific contexts. This adaptation can involve training the system on a set of relevant examples or coding information about specific vocabularies, relationships or types of knowledge representation such as charts and tables. In asking the experts to consider the possibility of developing a computer system using current techniques to answer the test questions, it was necessary to set some boundaries on the size of the hypothetical development effort that would be required.

Two rough criteria were used in selecting an appropriate set of boundaries for the development effort that the experts should have in mind when making their judgments. First, the assessment was intended to reflect the application of current computer techniques, not the creation of completely new computer techniques. If a development effort uses large quantities of people, time and funding, it looks more like a research effort to develop new techniques than a development effort to apply current techniques. Second, the rating was intended to reflect the level of investment that a large company might be willing to make to automate some frequently performed task in the organisation. In this sense, the test questions were being used as a proxy for company-specific or job-specific tasks using general cognitive skills that a company might consider automating. Both of these criteria suggest a relatively limited development effort.

The advance instructions suggested that the computer scientists should think about a development effort representing roughly the work that could be done by a few people during a single year. During the discussion at the meeting, this constraint was further specified to involve an expenditure of no more than USD1 million for development.

How to approach the use of visual materials

PIAAC uses materials in its test questions that are similar to the types of written materials that adults encounter at work and in their daily lives. These materials include signs, labels, advertisements, charts, tables, webpages, maps, drawings and photographs (OECD, 2016a). This range of test material differs substantially from more academic tests that might assess literacy only with narrative texts, and numeracy only with mathematical problems.

The diverse range of material used in PIAAC raises challenges for computers. The group of computer scientists spent substantial time figuring out how to address those challenges. In many cases, the diversity of input is included precisely because of the desire to assess whether adults are able to use information from such different sources. Most of the different types of materials are in general use. It is thus reasonable to assume that most adults will have been exposed to similar materials at school, at work or in their daily lives.

In other cases, however, the diversity was likely included to produce material that looks realistic, such as advertising with colourful designs and writing in distinctive layout. In these cases the extra realistic features probably do not cause any extra difficulty for the adults who take the test; indeed, extra realism may well make the materials more familiar to many adults and easier to use. However, such materials can make the questions substantially more difficult for computers. One example that the group discussed extensively was the easiest numeracy question for people, which uses a photograph of two packages of bottled water and asks how many bottles are in the packages. The numeracy aspect of the problem involves a simple multiplication, which is why the problem is so easy for people. However, the visual interpretation needed to answer the question, which people also find easy, is quite difficult for machines because the packaging makes many of the bottles hard to see. This question received the lowest average rating for computers across the group of experts. Machines have difficulty in interpreting this sort of image, since it is necessary to combine interpretation of the image itself with the right knowledge about the physical world.

The group discussed two options for addressing the visual material in the test questions. The first option involved assuming that the visual input would be transformed into a textual or numerical form, such as extracting the written material from an advertisement or turning a graphical chart into a digital table. In this option, a computer would answer the question using transformed materials that eliminate the problem of interpreting the visual input. The second option involved taking the visual input as given, requiring the computer to solve the same visual interpretation problem that people need to solve. The group decided to adopt the second option to preserve the integrity of the full set of test questions. As a result, some of the questions that are identified as ones that computers could not answer, such as the easiest question in numeracy discussed above, were identified as too difficult for computers primarily because they use visual material that is hard for computers to interpret.

Carrying out the second option added an extra practical difficulty for the exploratory assessment. Since visual processing is often not considered relevant to work in computer language and reasoning, many of the participating computer scientists did not have extensive knowledge about current capabilities in vision. To make up for this, the participating experts who do have some knowledge of those capabilities discussed the visual features that were likely to be easy or difficult in a sample of the problems. As a result, the judgments about the difficulty of the visual aspects of the questions reflected a more limited range of expertise across the group than the judgments about the language and reasoning aspects of the questions.

Final specifications of the assessment exercise carried out at the meeting

The first day of the meeting involved each computer scientist discussing how he or she had prepared in advance for the task. After discussing the above issues, the group agreed upon a common approach reflecting the criteria outlined at the start of this chapter regarding time limit for development, budget and the use of the test’s visual materials, as well as the decision to provide ratings for individual questions rather than cut-points. Using these criteria, all 11 computer scientists in the group provided ratings for the literacy and numeracy questions for 2016. In addition, six of the experts provided ratings for the third skill area of problem solving using computers, and three of them provided ratings for computer capabilities in 2026. The assessment ratings are analysed in Chapter 4.

Suggestions for improving the approach to assessing computer capabilities

The above issues shaped the group’s decision about how to provide a comparable set of assessments of computer capabilities for answering the test questions. In addition, the participants made a number of other suggestions for assessing computer capabilities that they did not have time to pursue at the meeting. The common theme linking these different suggestions was the possibility of finding ways to resolve disagreements in the ratings across the group. The section that follows discusses two types of suggestion: one focusing on improving understanding of the test questions and the other focusing on improving understanding of the capabilities of current techniques.

Improving understanding of the test questions

As the group discussed different questions at the meetings, there were a number of cases where the computer scientists realised they had misunderstood the requirements of a particular question. Sometimes this realisation led them to decide that the question was actually easier or more difficult for computers than they had originally thought. For example, the instructions for a number of questions say that the test-taker should highlight the passages in the text that provide an answer to the question, rather than directly provide the answer itself. In some cases, this difference – between highlighting the relevant text and independently specifying the answer – significantly affects the difficulty of providing an answer. Sometimes some of the participants had missed this distinction in their evaluation and the discussion allowed the group to come closer to consensus about the difficulty of the question for computers.

To help make the rating process more systematic, several of the participants suggested it should be carried out in two stages, first identifying the different types of capabilities needed for each problem and then identifying what computers can do in each area. For example, the group’s extensive discussion of the challenges raised by the visual materials used in some of the questions showed the importance of identifying the questions that require visual interpretation. The group discussed some key contrasts in visual processing requirements, such as the difference between black-and-white and colour images. This is related to limits in current computer capabilities and could be used to code specific aspects of the visual materials used in the questions. Although a two-stage method seemed like a promising way to approach the rating process systematically, the group did not have enough time to apply it. Clearly the second-stage assessment, requiring multiple judgments for each test question, would be more time-consuming than the single judgments the computer scientists made at the meeting. In addition, several of the group members thought it would be time-consuming in the first stage to agree on a set of categories to describe the different types of capabilities.

One concern raised during the discussion was that tests generally focus on assessing capabilities that are hard for people, while often omitting capabilities that are generally easy for people but hard for machines, such as vision and social interaction. This raises problems for interpreting computer performance against human performance using the same test questions. If a test omits capabilities that most people share but machines do not, then the results would overestimate computer performance in situations where those capabilities are important. On the other hand, if a test includes such capabilities, then computers may perform poorly primarily because of those capabilities, rather than because they lack the primary capabilities being assessed. In this case, the results would underestimate computer performance in situations where these sorts of capabilities are not important. Without being aware of the potential confounding role of the capabilities that are generally easy for people, it can be misleading to use estimates of computer capabilities from human tests to draw conclusions about the types of work tasks that computers might be able to perform.

The challenge of including capabilities that are easy for people but hard for machines was addressed most closely in the discussion on visual materials discussed above, with a notable example being the easiest numeracy question requiring the counting of packaged bottles in an image. This question is clearly easy for most adults and the numerical reasoning aspect of the question is also easy for machines. However, the group gave this question the lowest rating with respect to computer capabilities because of the difficulty posed by the packaging of the bottles. This question provides a good measure of computer numeracy capabilities in combination with visual interpretation, but a misleading measure of computer numeracy capabilities on their own.

In general, the experts surmised that the diverse material used in PIAAC does a better job representing capabilities that are easy for people but difficult for computers than is the case for many narrow academic tests. However, it would be useful to analyse the questions separately that require these additional skills from the questions that do not. This more precise analysis of the test questions would make it easier to understand where low computer performance is related specifically to the primary skills that are being tested by PIAAC – literacy, numeracy and problem solving – and where that performance is related to the need for additional capabilities such as vision.

Some additional capabilities, such as social interaction, are not reflected at all in the PIAAC. For such capabilities, there are no relevant questions in the test that could be identified by a detailed analysis of the test questions. It would be helpful to simply identify that these skills have been omitted from the test and take that limitation into account when using the assessment results to analyse the potential effects of computers in different work settings. For example, an assessment of computer capabilities in literacy using PIAAC will probably be more useful in analysing the automation potential of language-related tasks in administrative jobs than in customer service jobs, because social interaction is more important for the latter. Another option would be to use other tests to assess these additional capabilities.

Finally, another question raised by one of the participants concerned how to generalise the skills being measured on the test and therefore how to evaluate the underlying computer capabilities. When the computer scientists considered whether a particular question could be answered, they were interested in proposing general computer techniques that could potentially be successful on a wide range of comparable questions, rather than techniques geared specifically to work on a single question. However, it was sometimes difficult to know what questions would be truly comparable, since small differences in wording can often make a question much harder or easier for people, and presumably for computers as well. One way to address the range of generalisation of the skills being tested would be to provide more examples of test questions. Although this is not possible with PIAAC, which has a limited set of questions, many other standardised tests have large sets of practice questions that illustrate the range of material that will be tested.

Improving understanding of computer capabilities

There was general agreement across the group that their expertise was weak in the areas of computer vision and machine learning. Although there were participants who were familiar with work in each of these areas, the group did not include researchers for whom these areas are a primary focus. The group recommended that any future work to assess computer capabilities using PIAAC should include researchers with these specialties.

The meeting included numerous exchanges about the level of performance achieved by particular computer techniques. In most cases, all of the computer scientists were generally aware of the techniques mentioned, but not all of them knew about particular recent results or details about how a technique had been applied. Given the time constraints, the exchanges on details of a specific technique were limited to mentioning a relevant research article. Unlike the exchanges about the nature of the questions, discussion on the performance of particular techniques did not appear to cause any of the experts to re-evaluate their conclusions about the difficulty of some of the test questions, except in the area of computer vision. With respect to computer techniques used for language and reasoning, it appeared that the group would have required substantially more time for discussion to move closer to a consensus in their assessments.

One question raised by the discussion was what conclusions to draw from the disagreements in the assessments, given the time available. For instance, one group member might be aware of a new technique they believe would allow computers to successfully answer one type of question. However, this member would not necessarily be able to convince the other participants without time to share further details. The benefit of working towards a group consensus is that it allows this one person to educate everyone else. Of course, this can also go the other way, with a single sceptic who understands the limitations of a particular technique convincing everyone else that it would not be successful on a particular type of question. However, there was a lack of time to work towards a full consensus understanding of the different computer capabilities. Instead, the analysis of the assessment ratings in Chapter 4 uses a variety of approaches to explore the range of views across the group.

Finally, several of the computer scientists argued that discussion and analysis alone would ultimately be insufficient for reaching a consensus about the ability of current techniques to answer the test questions, even after extensive exchange of views. Instead, these experts suggested that it would be necessary in some cases to actually apply computer techniques to the test questions to see whether they would be successful. Such tests have frequently been performed in the field of computer science by holding competitions, which can sometimes attract substantial interest (e.g., Quillen, 2012; Visser and Burkhard, 2007). However, for resolving questions about the potential performance of particular techniques, it could also be effective to commission specific research groups who work with those techniques to apply them to a set of questions to assess their performance.

Summary of possible extensions for future work

The discussions at the meeting produced a range of suggestions for deepening the assessment of computer capabilities on a set of tested skills.

With respect to the test questions themselves, the meeting discussion suggested three possible extensions for future work: 1) conducting a two-stage evaluation with separate analyses of question requirements and computer capabilities; 2) considering the full set of work skills and identifying skills that are omitted from the test but that may be important in some work contexts where the tested skills are used; and 3) working with tests with a larger number of example questions. With respect to the computer techniques, the meeting suggested another three extensions: 4) expanding the range of computer science expertise included in the discussion; 5) reviewing a set of key research papers in greater detail; and 6) obtaining empirical results about the ability of computers to answer the test questions, particularly with respect to techniques or question types where the group was not able to reach consensus. These extensions provide a set of approaches that could be pursued in future work to sharpen the assessment ratings discussed in Chapter 4.

References

National Research Council (2004), Keeping Score for All: The Effects of Inclusion and Accommodation Policies on Large-Scale Educational Assessments, Committee on Participation of English Language Learners and Students with Disabilities in NAEP and Other Large-Scale Assessments, J.A. Koenig and L.F. Bachman, eds., The National Academies Press, Washington, DC.

National Research Council (2002), Methodological Advances in Cross-National Surveys of Educational Achievement, Board on International Comparative Studies in Education, A.C. Porter and A. Gamoran, eds., The National Academies Press, Washington, DC.

OECD (2016a), Skills Matter: Further Results from the Survey of Adult Skills, OECD Skill Studies, OECD Publishing, Paris, http://dx.doi.org/10.1787/9789264258051-en.

OECD (2016b), The Survey of Adult Skills: Reader’s Companion, Second Edition, OECD Skills Studies, OECD Publishing, Paris, http://dx.doi.org/10.1787/9789264258075-en

OECD (2012) Literacy, Numeracy and Problem Solving in Technology-Rich Environments: Framework for the OECD Survey of Adult Skills, OECD Publishing, Paris, http://dx.doi.org/10.1787/9789264128859-en.

Papert, S. (1966), “The Summer Vision Project, Artificial Intelligence Group”, Vision Memo. No. 100, Massachusetts Institute of Technology, available at http://hdl.handle.net/1721.1/6125 (accessed 24 January 2017).

Quillen, I. (2012), “Hewlett Automated-Essay-Grader Winners Announced”, Education Week, 9 May http://blogs.edweek.org/edweek/DigitalEducation/2012/05/essay_grader_winners_announced.html (accessed 24 January 2017).

Silver, D., et al. (2016), “Mastering the Game of Go with Deep Neural Networks and Tree Search”, Nature, Vol. 529, Macmillan Publishers, pp. 484-489.

Visser, U., and H.-D. Burkhard, 2007, RoboCup: 10 Years of Achievements and Future Challenges, AI Magazine, Vol. 28/2, Association for the Advancement of Artificial Intelligence, pp. 115-132.

Notes

← 1. The formal name used for the problem solving skill area in PIAAC is “problem solving in technology-rich environments.”

← 2. The recruiting process also specifically attempted to identify a geographically balanced set of experts to ensure that a broad mix of research traditions from different countries would be reflected in the discussions. Although the project failed to find experts from a broad range of countries who were willing and able to participate, the experts who did participate in the meeting were well aware of work being carried out in different countries since computer science research is conducted on an international basis.

← 3. In addition to the 11 computer scientists, the meeting included four social scientists familiar with applications of computers in the workplace: Charles Fadel, Center for Curriculum Redesign; Michael J. Handel, Northeastern University; Frank Levy, MIT and Harvard Medical School; and Alistair Nolan, OECD.

← 4. See Chapter 2 for a brief description of the survey administration and Chapter 4 for a brief description of the different skill areas. PIAAC is usually administered on a computer when data are collected from adults, but the assessment of computer capabilities was carried out using static screen shots. As a result, in some cases only part of the question was available for evaluation. For more information on the design, administration and results of the Survey of Adult Skills see OECD (2012, 2016a, 2016b).

← 5. In fact, only one expert provided a complete set of ratings for 2006, although two others categorised the difficulty of the problems and suggested a way that their categories might relate to capabilities in 2006.

← 6. One example cited in the discussion was the case of computer vision, which was proposed as a summer research project in the mid-1960s and now a half century later is still one of the hardest problems in AI (Papert, 1966). However, unexpected successes from new techniques can also lead to the opposite result. For example, after the recent victory of Google DeepMind’s AlphaGo program over one of the world champions of the game of Go, some experts commented that such success was not anticipated for at least another decade (Silver et al., 2016).