Machine reading: Successes, challenges and implications for science

J. Dunietz
AAAS Science and Technology Policy Fellow (STPF)
United States

As the rate of scientific publication has skyrocketed, many researchers have proposed taming the literature with artificial intelligence (AI). By harnessing the tools of natural language processing (NLP), researchers hope to automate some of the paper reading. This essay lays out a variety of reading comprehension behaviours, or “tasks”, that NLP systems might perform on scientific literature. The essay places these tasks on a spectrum of sophistication based on models of human reading comprehension. It argues that today’s NLP techniques grow less capable as tasks require more sophisticated understanding. For example, today’s systems excel at flagging names of chemicals. However, they are only moderately reliable at extracting machine-friendly assertions about those chemicals, and they fall far short of, say, explaining why a given chemical was chosen over plausible alternatives. The essay also discusses implications for where NLP tools can fit into researchers’ workflows and offers several policy-relevant suggestions.

The core insight of this essay is that “reading” is not one monolithic capability. A shallow reader – whether human or automated – can do far less with a text than one who has combed through it and comprehended it deeply. Accordingly, the plausibility of proposals to have machines read papers depends on precisely what capabilities NLP is imagined to have. Without a well-calibrated notion of what NLP systems can do after having “read”, the scientific community risks either missing opportunities for discovery or pinning its hopes on technology that does not yet exist.1

Among the many theoretical models of human reading comprehension, the “Construction-Integration” (CI) model (Kintsch, 1988) is one of the most influential (McNamara and Magliano, 2009). It posits that concepts and propositions are first “activated” – i.e. made available for easy mental retrieval – then iteratively selected and merged into a globally coherent interpretation.

For this essay, the CI model is significant not for its proposed cognitive processing mechanisms, but rather for the form it assumes for the interpretation. The model asserts that a reader’s mental representation includes three inter-constrained levels of information:

This is raw linguistic information, such as what words and phrases are present and what syntactic structures connect them.

The textbase is the set of explicit propositions expressed by phrases and sentences. Given one or more passages, the textbase includes all elementary propositions a reader would take away. For a scientific paper, this might include assertions like “Sample A was kept at 20°C”, “MgCl2 represses expression of PmrA-activated genes” and “the response curve was modelled as a sigmoid.”2

A situation model3 is an integrated representation that stitches together all asserted propositions and their relationships, as well as their relationships with unstated knowledge. This level might include relationships between assertions (e.g. that an organism retains a trait even though a gene has been knocked out); background information (e.g. that samples from 2018 can be assumed negative for COVID-19); agents’ implicit goals (e.g. why experimenters wanted to ensure their samples were pure); and mentally simulated counterfactuals (e.g. what it would have meant had a solution turned a different colour). In many versions of the CI framework (e.g. Zwaan and Radvansky, 1998), the situation model focuses on spatial, temporal, causal and motivational relationships.

The CI model’s taxonomy was intended to describe how human readers represent information. However, it can also be viewed as defining a spectrum of comprehension-dependent tasks. Some tasks, such as determining a paper’s topic, can be performed using surface structure alone (e.g. by identifying keywords). Other tasks can be successfully performed only if the reader’s internal representation includes at least a textbase-level understanding of the document(s). Finally, the most sophisticated tasks, such as proposing alternative versions of an experiment, require a full situation model.

From this standpoint, the taxonomy applies just as well to NLP systems as to human readers (Sugawara et al., 2021). Accordingly, the next section considers what behaviours might be wanted from scientific NLP systems at each level of reading comprehension, and how current technology fares at each.

Examples of NLP tasks at various levels of representation are shown in Figure 1. Each task is discussed in more detail below, including how it applies to scientific texts and where the state-of-the-art stands.

A few points should be kept in mind throughout:

  • The levels of representation are best thought of not as three discrete levels but as a spectrum with three clusters. For instance, given the sentence “She didn’t leave out a single one”, it would be difficult to extract textbase-level propositions without background knowledge from the situation model about who “she” refers to, what she was doing and who or what she might have left out. Tasks with this property have been depicted at the textbase level but closer to the situation model boundary.4

  • These categorisations should be seen only as rough intuitions, particularly since tasks that appear to hinge on higher-level representations may prove to be solvable using shallower techniques. If researchers were trying to answer the question, “What biomarkers indicate adenomas?”, they might apply their situation model-level knowledge to home in on a paragraph about “biological markers” of “pituitary tumours”. They would then extract the list of biomarkers from the textbase-level assertions in that paragraph. NLP tools, however, might succeed using surface structure alone – e.g. by looking for sentences with words that often co-occur with “biomarkers” and “adenomas”.

  • The CI model assumes the reader is consuming a single passage or document. In contrast, many NLP applications entail consuming an entire corpus of scientific literature. For instance, clustering documents by topic only makes sense with multiple documents. For this essay, such tasks have been termed corpus-scale tasks. This term contrasts with instance-scale tasks that operate on single propositions, sentences or documents.5

  • This is far from a comprehensive list of relevant NLP applications. Still, it should give the reader an intuition for what can be expected from any NLP tool.

Surface-level NLP tasks mine associations between words, phrases and categories. Such tasks are far removed from anything normally considered “comprehension” (or perhaps even “reading”). They are most useful for helping a human reader locate and rapidly absorb information, particularly if researchers know exactly what terms or concepts to search for. The tools may also ease researchers’ exploration and discovery processes.6 Given that these systems leave most of the reading to humans, the occasional error matters little; the human can simply ignore irrelevant results or try a different search.

Two instance-scale tasks, named entity recognition and named entity identification, are described below.

This is a heavily studied NLP task, applicable to many domains, in which systems must flag “mentions” of predefined concept types. For example, a common version of the task is to scan passages for any phrase that refers to a person, a location or an organisation, and to classify each such phrase into one of these predefined categories. Intuitively, this task is about making associations between words or phrases and the categories.

Scientific text often requires specialised NER systems, both because the style differs from non-scientific text and because the categories are typically domain-specific. For instance, the CHEMDNER challenge (Krallinger et al., 2015) – the first community-wide effort to evaluate NLP methods for chemistry – included a task to automatically tag chemical names in scientific papers. Similarly, many biomedical NER systems look for phrases describing the population, intervention, comparison and outcome in papers about clinical trials (Kim et al., 2011).

With appropriate training data, scientific NER can perform quite well: recent systems achieve scores around 70-90%,7 depending on the dataset and evaluation metric (Beltagy et al., 2019).

NER merely tags phrases such as “rabeprazole” with categories like “Intervention”. Named entity identification (NEI), also sometimes called entity linking or named entity normalisation, goes a step further: it associates each tagged phrase with an entry in a structured knowledge base. For example, “rabeprazole” might be recognised as a reference to the “Rabeprazole.01” entry in a drug database.

Identifying ambiguously named entities can be easier with information from the textbase level. Still, like NER, NEI is largely about associating words and phrases with concepts – a surface structure task. Scores tend to be lower and far more varied than for NER, roughly 50-85% (Arighi et al., 2017).

NER and NEI are most often used in support of some downstream task, such as populating a knowledge base or allowing a searcher to filter papers to those that discuss a specified compound. NER and NEI can also be used to augment the reading experience, e.g. by colour-coding or hyperlinking gene names.

At the corpus scale, core surface structure tasks include retrieving and ranking documents and clustering documents.

This classic task, often called information retrieval (IR), consists of returning documents that match a user query and ranking them by relevance. IR tools typically rely on some measure of alignment between the words or phrases in the query and those in a document. Examples include all competitors in the CORD-19 challenge (Roberts et al., 2021). In this challenge, systems received queries like “SARS-CoV-2 spike structure” and had to retrieve the most relevant research papers about COVID-19. Like more familiar search engines, scientific IR systems perform well, returning a satisfactory document in the top 5 results around 75-90% of the time.

It can be helpful to automatically detect which documents are about similar topics, and perhaps even to organise topics into a hierarchy. A “topic” here is effectively a collection of closely related words or phrases. Recent systems that cluster scientific papers include the CORD-19 Topic Browser (MITRE, 2021) and COVID Explorer (Penn State Applied Research Laboratory, 2020). The clusters can be used either as a means of exploring the corpus or as an additional filter on search results. Though clustering is hard to evaluate objectively, it is a mature and well-studied task that generally produces respectable results.

A surprisingly large fraction of scientific NLP has focused only on the surface structure level. Still, there is plenty of research on extracting and manipulating papers’ textbase-level propositions – tasks closer to conventional “reading”. In general, tools for these tasks are less reliable but still useful for investigating and generating hypotheses.

Instance-scale textbase tasks (operating on individual phrases, papers or texts) include knowledge base construction, question answering, evidence retrieval and multi-hop question answering.

Knowledge bases (e.g. of chemicals or genes) are widely used in science and beyond. To automatically populate a KB from one or more documents, an NLP system must turn natural language assertions into formal, machine-friendly propositions.8 For instance, ChemDataExtractor (Swain and Cole, 2016) extracts many numerical properties of chemicals from the chemistry literature, generally with well over 95% accuracy. Kahun (2020) and COVID-KG (Wang et al., 2021) similarly extract relations such as “Condition-causes-symptom” and “Gene-chemical-interaction”, albeit somewhat less reliably.

The resulting knowledge graphs can be used to look up characteristics of a gene, protein or chemical; to generate summary reports; to support question answering; or enable further machine learning on the structured relationships (e.g. predicting new drugs’ possible side effects).

There is a large NLP literature on question answering (QA) (Zhang et al., 2019; Zhu et al., 2021). The task is usually defined as responding to a user’s question either with a yes/no answer or with a sentence or short phrase drawn from a body of text.9 All of these variants have also been attempted in a scientific context (Nentidis et al., 2020). Recent examples include AWS CORD-19 Search (Bhatia et al., 2020) and covidAsk (Lee et al., 2020).

On many general-purpose QA benchmarks, NLP systems match or surpass humans. It is tempting to infer that these systems must comprehend at least some propositional content. However, systems’ “comprehension” often proves brittle in the face of small changes to the question or passage (Jia and Liang, 2017) or shifts in the topics and content (Dunietz et al., 2020; Miller et al., 2020). The on-paper successes thus seem to stem largely from the benchmarks’ artificial easiness: models are rewarded for exploiting ungeneralisable quirks of the data (Kaushik and Lipton, 2018).

It should not be surprising, then, that systems for scientific QA score only ~25-65% for retrieving relevant snippets and ~30-50% for retrieving relevant answer phrases (Nentidis et al., 2020).

A similar task is to take a user-supplied assertion and look for snippets that support or refute it. This is often framed as fact-checking, particularly when systems are also asked to state whether the evidence supports or refutes the assertion. Recent systems score ~50-65% at extracting evidentiary sentences from abstracts (Wadden and Lo, 2021).

Many questions turn out to be answerable using little more than surface structure. In what is termed “multi-hop” QA, the questions are designed to rely on information from multiple pieces of text, theoretically requiring a higher level of comprehension and reasoning (Min et al., 2019). Several datasets have been constructed to test general-purpose multi-hop QA. Performance on these benchmarks is generally much lower than on regular QA. Among science-specific QA datasets, few (if any) target multi-hop reasoning, though at least a few of the datasets’ questions likely require such reasoning.

Perhaps the most exciting textbase-level task10 is one that only makes sense at corpus scale: predicting relationships between concepts based on large corpora. The study that launched this line of research identified materials likely to exhibit previously undiscovered thermoelectric properties (Tshitoyan et al., 2019). It used the materials science literature to train word vectors – representations that treat words as points (vectors) in a high-dimensional mathematical space of possible meanings. In such methods, word vectors are learnt from patterns of co-occurrence in the corpus. The distance11 between two vectors corresponds to the similarity between the words’ usage patterns, and hence presumably their meanings. The study authors identified chemical formulas whose vectors were close to that of “thermoelectric”, thereby proposing several new possible thermoelectric materials. The work has begun to inspire similar efforts in other fields such as molecular biology (Škrlj et al., 2021).

Word vector analogies do over-generate proposed relationships; researchers must comb through to determine which hypotheses are worth testing. Still, the results can be valuable. The thermoelectrics’ researchers demonstrated this by evaluating their approach retrospectively. They truncated the corpus to, say, 2009 to see what materials would have been proposed. This produced results far more likely than randomly chosen materials to have been studied later as thermoelectrics.

What emerges at the textbase level, then, is a useful but not entirely reliable suite of NLP tools. Given a well-defined question or hypothesis, these tools can help home in on relevant paper snippets. NLP tools can also help generate hypotheses, provided the researcher poses the right question (e.g. “What materials might have undiscovered thermoelectric properties?”). Systems can also extract structured data, but the resulting answers, fragments of evidence, KB entries or relationships must generally be treated with caution. If errors would be harmful, users will want to double-check the results.

Far less work has been done at the situation model level. At the instance scale – that of individual papers or passages – a NLP/AI system might help with tasks such as summarising a single document; explaining why an observation reported in a paper might have occurred; and proposing variations on an experiment described in a paper (in approximate order of increasing sophistication). Corpus-scale tasks might include summarising multiple documents (i.e. digesting an entire body of literature and summarising key takeaways); identifying gaps in the literature; combining concepts to propose a novel hypothesis, explanation or method; and proposing completely novel experiments to address knowledge gaps. To a greater or lesser degree, all of these tasks rely on manipulating a detailed, integrated representation of extensive information extracted or inferred from the text or corpus.

Of these tasks, only single- and multi-document summarisation have received significant attention. General-purpose summarisation has been widely studied in NLP (e.g. Hou et al., 2021). Approaches include a variety of extractive methods (stitching together fragments from the source text) and abstractive methods (generating an original summary). Similar techniques have been applied to scientific paper summarisation (e.g. Altmami and Menai, 2020), with some modifications for the peculiarities of scientific papers.

Results have been mixed: evaluation scores vary wildly, and it is not even clear which evaluation scores are meaningful (Kryściński et al., 2019). This is particularly true of abstractive methods, given that abstractive techniques, like many forms of natural language generation (Lin et al., 2021), struggle to ensure that output is factual (Maynez et al., 2020); they often fabricate information or misstate the facts.

Looking beyond summarisation, the level of comprehension and reasoning required for the other tasks above seems far out of reach. The fundamental problem is that current NLP techniques lack rich models of the world to which they can ground language (Bender and Koller, 2020; Bisk et al., 2020; Dunietz et al., 2020). They have no exposure to the entities, relationships, events, experiences and so forth that a text speaks about. As a result, even the most sophisticated models still often generate fabrications or outright nonsense.12 A few intriguing methodological proposals have been sketched (e.g. Tamari et al., 2020), but the research trajectory towards situation models will be long indeed.13

Despite the difficulties, research policies may be able to facilitate some progress towards machines that comprehend what they read – including scientific papers – at the situation-model level. Achieving that goal will likely require radical, interdisciplinary, blue-sky thinking. Yet NLP research is often driven by the pursuit of standardised metrics, by expectations of quick publications and by the allure of the low-hanging fruit from the past decade’s progress. This environment produces much high-quality work, but it offers limited incentives for the sort of high-risk, speculative ideation that breakthroughs may demand.

Policy makers could provide space and incentives for researchers to think more adventurously. To that end, three possible avenues are proposed below.

Research centres, funding streams and/or publication processes could be set up to reward methods that break with existing paradigms, even at the expense of publishing speed, performance metrics and immediate commercial applicability. Such programmes could even encourage ideas that remain half-baked or difficult to assess experimentally so long as they suggest novel, credible directions.

Policy makers could push NLP researchers to learn more from sociologists, philosophers and cognitive scientists, whose work on language likely holds untapped technical inspiration.14

Finally, policy makers can fund specific lines of under-studied research. This author is wary of some scholars’ prescription of reviving symbolic methods; formal symbols tend to carve up the world too rigidly. However, perhaps NLP architectures could be explicitly designed to fluidly form, revise and apply composable concepts. In any case, funding for techniques may prove less pivotal than funding for tasks. Situation models seem likeliest to emerge from collaborative tasks where systems must communicate with humans to perform tasks in a real or simulated physical environment (e.g. Abramson et al., 2022).

Today’s NLP provides many functionalities that can help scientists make good use of the literature. NLP can help winnow down the deluge of papers to ones relevant to a particular topic or question. It can also help researchers quickly find specific answers or pieces of evidence. It can even sometimes hypothesise previously undiscovered relationships, though humans must still do the hard work of asking the right questions and verifying systems’ answers. Tools for these use cases will continue to improve, including on dimensions beyond accuracy (e.g. “few-shot training” may reduce the need for training data).

Where NLP falls short is on tasks that require deeper forms of comprehension. Distilling conclusions out of the literature remains the province of humans for the foreseeable future. This is even more true for generating creative insights about studies, though funding policies could begin to move the needle.

Of course, the history of AI contains many instances of apparently demanding tasks falling to surprisingly shallow techniques. Much NLP research amounts to finding ways of “hacking” tasks higher on the spectrum of sophistication using minimal information from the surface structure or textbase. It remains to be seen how much comprehension at the situation model level is not so difficult, after all.


Abramson, J. et al. (2022), “Creating multimodal interactive agents with imitation and self-supervised learning”, arXiv, arXiv:2112.03763 [cs],

Aguera y Arcas, B. (2021), “Do large language models understand us?”, 16 December, Medium,

Altmami, N.I. and M.E.B. Menai (2020), “Automatic summarization of scientific articles: A survey”, Journal of King Saud University – Computer and Information Sciences, Vol. 34/4, pp. 1011-1028,

Arighi, C. et al. (2017), “Bio-ID track overview”, in Proceedings of BioCreative VI Workshop, BioCreative, Bethesda,

Beltagy, I. et al. (2019), “SciBERT: A pretrained language model for scientific text”, in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong,

Bender, E.M. and A. Koller (2020), “Climbing towards NLU: On meaning, form, and understanding in the age of data”, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, on line,

Bhatia, P. et al. (2020), “AWS CORD-19 search: A neural search engine for COVID-19 literature”, arXiv, arXiv:2007.09186,

Bisk, Y. et al. (2020), “Experience grounds language”, in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistic, on line,

Dunietz, J. et al. (2020), “To test machine comprehension, start by defining comprehension”, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, on line,

Gärdenfors, P. (2000), Conceptual Spaces: The Geometry of Thought, The MIT Press, A Bradford Book, Cambridge, MA.

Hou, S.-L. et al. (2021), “A survey of text summarization approaches based on deep learning”, Journal of Computer Science and Technology, Vol. 36/3, pp. 633-663,

Jia, R. and P. Liang (2017), “Adversarial examples for evaluating reading comprehension systems”, in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Copenhagen,

Johnson-Laird, P.N. (1980), “Mental models in cognitive science”, Cognitive Science, Vol. 4/1, pp. 71-115,

Kahneman, D. (2011), Thinking, Fast and Slow, Farrar, Straus and Giroux, New York.

Kahun (2020), “Coronavirus Clinical Knowledge Search”, webpage, (accessed 28 October 2021).

Kaushik, D. and Z.C. Lipton (2018), “How much reading does reading comprehension require? A critical investigation of popular benchmarks”, in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Brussels,

Kim, S.N. et al. (2011), “Automatic classification of sentences to support evidence based medicine”, BMC Bioinformatics, Vol. 12/2, pp. 1-10,

Kintsch, W. (1988), “The role of knowledge in discourse comprehension: A construction-integration model”, Psychological Review, Vol. 95/2, pp. 163-182,

Krallinger, M. et al. (2015), “CHEMDNER: The drugs and chemical names extraction challenge”, Journal of Cheminformatics, Vol. 7/Suppl 1, p. S1,

Kryściński, W. et al. (2019), “Neural text summarization: A critical evaluation”, in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong,

Langacker, R.W. (1987), Foundations of Cognitive Grammar: Theoretical Prerequisites, Stanford University Press, Stanford.

Lee, J. et al. (2020), “Answering questions on COVID-19 in real-time", in Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020, Association for Computational Linguistics, on line,

Lin, S. et al. (2021), “TruthfulQA: Measuring how models mimic human falsehoods”, arXiv, arXiv:2109.07958 [cs],

Maynez, J. et al. (2020), “On faithfulness and factuality in abstractive summarization”, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, on line,

McNamara, D.S. and J. Magliano (2009), “Toward a comprehensive model of comprehension”, in Psychology of Learning and Motivation – Advances in Research and Theory, Academic Press, Cambridge, MA,

Michael, J. (23 July 2020), “To dissect an octopus: Making sense of the form/meaning debate”, Julian Michael’s blog,

Miller, J. et al. (2020), “The effect of natural distribution shift on question answering models”, in Proceedings of the 37th International Conference on Machine Learning (PMLR), Association for Computational Linguistics, on line,

Miller, T. (2021), “Contrastive explanation: A structural-model approach”, The Knowledge Engineering Review, Vol. 36,

Min, S. et al. (2019), “Compositional questions do not necessitate multi-hop reasoning”, in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence,

MITRE (2021), “MITRE CORD-19 Topic Browser”, webpage, (accessed 28 October 2021).

Nentidis, A. et al. (2020), “Overview of BioASQ 2020: The Eighth BioASQ challenge on large-scale biomedical semantic indexing and question answering”, in Arampatzis, A. et al. (eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction, Springer International Publishing,

Nye, M. et al. (2021), “Improving coherence and consistency in neural sequence models with dual-system, neuro-symbolic reasoning”, arXiv, abs/2107.02794,

Penn State Applied Research Laboratory (2020), COVID Explorer (database), (accessed 28 October 2021).

Roberts, K. et al. (2021), “Searching for scientific evidence in a pandemic: An overview of TREC-COVID”, Journal of Biomedical Informatics, Vol. 121, p. 103865,

Schaffer, J. (2005), “Contrastive causation”, The Philosophical Review, Vol. 114/3, pp. 327-358,

Škrlj, B. et al. (2021), “PubMed-scale chemical concept embeddings reconstruct physical protein interaction networks”, Frontiers in Research Metrics and Analytics, Vol. 6, 13 April,

Sugawara, S. et al. (2021), “Benchmarking machine reading comprehension: A psychological perspective”, in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Association for Computational Linguistics, on line,

Swain, M.C. and J.M. Cole (2016), “ChemDataExtractor: A toolkit for automated extraction of chemical information from the scientific literature”, Journal of Chemical Information and Modeling, Vol. 56/10, pp. 1894-1904,

Tamari, R. et al. (2020), “Language (Re)modelling: Towards embodied language understanding”, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, on line,

Tshitoyan, V. et al. (2019), “Unsupervised word embeddings capture latent knowledge from materials science literature”, Nature, Vol. 571/7763, pp. 95-98,

Wadden, D. and K. Lo (2021), “Overview and Insights from the SCIVER shared task on scientific claim verification”, in Proceedings of the Second Workshop on Scholarly Document Processing, Association for Computational Linguistics, on line,

Wang, Q. et al. (2021), “COVID-19 literature knowledge graph construction and drug repurposing report generation”, in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations, Association for Computational Linguistics, on line,

Zhang, X. et al. (2019), “Machine reading comprehension: A literature review”, arXiv, arXiv:1907.01686 [cs.CL],

Zhu, F. et al. (2021), “Retrieving and reading: A comprehensive survey on open-domain question answering”, arXiv, arXiv:2101.00774 [cs.AI],

Zwaan, R.A. and G.A. Radvansky (1998), “Situation models in language comprehension and memory”, Psychological Bulletin, Vol. 123/2, pp. 162-185,


← 1. This article represents the views of the author, and was written before his fellowship with AAAS began. It does not necessarily represent the views of AAAS or the US government.

← 2. Of course, the words expressing each of these assertions are included in the surface structure representation. What distinguishes the textbase representation is that these assertions have been abstracted into a more conceptual form in the reader’s mind, e.g. modeled-as (response-curve-1, Sigmoid). The textbase is usually assumed to consist of formal logical propositions, although other processing-friendly abstractions may sometimes be more suitable.

← 3. Situation models have also sometimes been referred to as “mental models” (Johnson-Laird, 1980).

← 4. The situation-model level is especially spectrum-like: even for a human, a situation model may be more or less complete depending on how many inferences and pieces of background knowledge the reader manages to incorporate.

← 5. Clearly, one could repeat an instance-scale application for every instance in a larger corpus. For example, rather than extracting a few facts from a document into a knowledge base, one could attempt to extract as many facts as possible from an entire corpus of documents. This corpus-scale version of the task might even be approached differently from its instance-scale cousin. For example, computational constraints might rule out processing each instance separately, or perhaps the resultant knowledge base could be improved by considering mutually contradictory or reinforcing pieces of evidence across the corpus.

Nonetheless, these applications were listed as instance-scale because one could perform them instance by instance, at least in principle. The “corpus-scale” designation has been reserved for tasks where the task does not even make sense without a larger corpus.

← 6. Literature-based discovery (LBD) tasks, though omitted here for space, would generally fall into the surface structure category, as well: LBD typically tries to connect two sets of terms, A and C, by finding some set B of words or phrases that are connected to both A and C.

← 7. The metrics reported in this essay vary by task but typically balance precision (what fraction of instances output by the system are correct) against recall (what fraction of correct instances are output by the system). Such a metric, known as an “F1 score” or “F-measure”, ensures that to score highly, systems must simultaneously notice relevant phrases or answers and ignore irrelevant ones.

← 8. Strictly speaking, some knowledge base construction can be done purely at the surface level: a system needs nothing more than entity recognition and/or identification to extract co-occurrence relationships as an indication of unspecified “relatedness”.

← 9. There are many additional variants on question answering tasks, including answering multiple-choice questions about a passage; declining to answer when the answer is not present; generating full-sentence answers (as opposed to retrieving decontextualized phrases); and answering multiple questions in context as part of a multi-turn dialogue. In a scientific context, these more exotic variations seem unlikely to be substantially more useful than vanilla question answering. They have therefore been elided in the main text. With the exception of multiple-choice questions, which are somewhat artificially easy (Dunietz et al., 2020), these variants generally see worse system performance than more conventional QA tasks.

← 10. It is debatable whether word vectors are really operating at the textbase level, given that they are trained only on associations between words and sequences thereof. However, they do seem to capture information that are conventionally thought of as propositional – e.g. relationships between concepts – albeit only at the corpus scale, not at the scale of individual training sentences.

← 11. More precisely, the cosine distance, related to the angle between the vectors. Directions in the vector space are generally taken to correspond to concepts or elements of meaning (e.g. gender). The vectors for two words such as “large” and “enormous” might have identical orientations but different magnitudes.

← 12. Even researchers who argue that modern large language models do “understand us” (e.g. Aguera y Arcas, 2021) typically acknowledge that these models confabulate at best and give “off-target, nonsensical or nonsequitur [responses]” at worst.

The shortfalls are most obvious when the task involves generating text. However, even non-generative tasks – e.g. multiple-choice question answering – are typically approached using the same underlying language models. Even without the opportunity to fabricate, the lack of deep comprehension becomes painfully clear with sufficiently rigorous evaluation procedures (Dunietz et al., 2020). For any given reading comprehension system, most researchers with experience in NLP would have little trouble finding questions that trip up the system even though the answers are obvious to humans.

See also Bender and Koller (2020), Bisk et al. (2020) and Michael (23 July 2020) for discussions of whether reading comprehension systems trained purely on text could learn to construct and manipulate situation models even in principle.

← 13. Reading at this level might even be “AGI-complete”. In other words, achieving it might be tantamount to solving every problem in artificial general intelligence (“strong” or fully human-like AI), from planning to commonsense reasoning, social interaction and perhaps even perception and object manipulation.

← 14. Examples of less technical work that could suggest NLP and AI approaches include prototypes and radial categories from cognitive linguistics (e.g. Langacker, 1987); contrastive accounts of causal language from philosophy (e.g. Schaffer, 2005; Miller, 2021); conceptual spaces from cognitive science (Gärdenfors, 2000); and the System 1/System 2 distinction from psychology (Kahneman, 2011; e.g. Nye et al., 2021).

Metadata, Legal and Rights

This document, as well as any data and map included herein, are without prejudice to the status of or sovereignty over any territory, to the delimitation of international frontiers and boundaries and to the name of any territory, city or area. Extracts from publications may be subject to additional disclaimers, which are set out in the complete version of the publication, available at the link provided.

© OECD 2023

The use of this work, whether digital or print, is governed by the Terms and Conditions to be found at