Using machine learning to verify scientific claims

L.L. Wang
University of Washington
United States

The verification of scientific claims – also known as scientific fact-checking – is an important application area for machine learning (ML) and natural language processing (NLP). This essay explores the current state and limitations of ML systems for scientific claim verification. It begins with some background and motivation for the task, followed by an overview of technological progress and future directions.

There is a sense of renewed urgency around automated methods for claim verification. This has been driven by the abundance of misinformation spread on line during the COVID-19 pandemic, as well as in relation to sensitive topics such as climate change. Indeed, during the COVID-19 pandemic, there were reports of conflicting findings in the literature and early preprints with results that were subsequently disproved. In addition, high-profile retractions found avid uptake by news and social media organisations (Abritis, Marcus and Oransky, 2020).

To combat misinformation, platforms like Twitter, Facebook and others engage in both manual and automated fact-checking. These companies may employ teams of fact-checkers to search for and validate uncertain claims. At the same time, they deploy ML models to identify check-worthy claims, retrieve relevant evidence or predict factuality to different degrees of accuracy and success (Guo, Schlichtkrull and Vlachos, 2022).

Manual fact-checking is laborious and resource-intensive, and difficult to scale to the growing size of content on social media. Science faces a similar rapid growth in output. Millions of papers are written annually, and hundreds published each day in notable and sometimes contentious areas such as COVID-19 and climate change.

Additionally, scientific claims pose a unique set of challenges for fact-checking. This is due to the abundance of specialised terminology, the need for domain-specific knowledge and the inherent uncertainty of scientific findings. In other words, results can go through a long process of theory, experimentation, validation and replication before being accepted as scientific canon. Indeed, some claims may remain contentious due to small effect sizes and difficulties obtaining measurements from humans at a population level. Nonetheless, automated ML-based methods for claim verification are desirable and valuable. They are needed to assist and reduce efforts for human fact- checkers and improve the coverage of fact-checking systems.

Automated scientific claim verification has made significant advances in recent years due to progress in NLP methods. This includes the introduction of pretrained language models, and task-specific gains through the release of new datasets, models and applications to support the study of scientific claim verification. Though results are promising, several key challenges remain:

  1. 1. Scientific discourse does not lend itself easily to claim verification.

  2. 2. Claim verification methods suitable for the news or political domains may not be appropriate in the scientific domain.

  3. 3. Research systems for scientific claim verification do not yet tackle a realistic version of the problem.

  4. 4. The social implications of automated claim verification for science are unclear (i.e. what are the desired results from applying fact-checking methods to scientific discourse in social media and elsewhere?).

Few works on automated scientific claim verification engage deeply with the social issues or consequences of such automation. Do these models help assist or replace manual fact-checking? Or are they built to increase scientific literacy and the ability of lay people to engage in scientific discourse? One would expect the outputs of models serving these two goals to be quite different.

Similarly, many questions remain around how to integrate the outputs of claim verification models with the decisions of human fact-checkers. The focus on modelling progress is justifiable given the current technological state and limitations of automated scientific claim verification systems. However, as model performance improves and prototype systems are deployed, the possible social implications of these developments must be addressed.

The task of scientific claim verification begins with a claim. The claim should be a statement about some entity or process, and it must be verifiable – it should not be a statement of opinion. Additionally, some definitions require the claim to be atomic (about a single aspect of the entity or process), decontextualised (able to be understood on its own without additional context) and check-worthy (to confirm its veracity for a target audience) (e.g. Wadden et al., 2020).

Given a valid claim, the goal is to predict its veracity. Is the claim supported or refuted by the evidence? Is there insufficient information to make a prediction? The model’s prediction is known as the “veracity label”.

In many cases, the task also requires identifying evidence from trusted sources to support the veracity label. Documents providing evidence towards or countering the prediction are referred to as “evidence documents”. Specific spans of text from the evidence documents that support or refute the claim can be provided optionally as “rationales” towards the decision. Figure 1 shows how these components relate to one another with an example claim and its associated evidence that is identified from one of a set of scientific documents.

Pretrained contextual language models lie at the foundation of many state-of-the-art systems for natural language understanding; claim verification as a task is no different. These language models are pretrained on a large amount of unlabelled text in a self-supervised manner, allowing the model representations to capture the meaning and relationships between words. The models are then adapted to various downstream tasks, usually by fine-tuning on a small, labelled dataset specific to the task. Pretrained language models can and have been adapted to perform fact-verification in this manner. For example, they have been fine-tuned on datasets such as FEVER (Thorne et al., 2018) to produce a general domain fact-checker. They have also been fine-tuned on SciFact (Wadden et al., 2020) to produce a fact-checking model adapted for scientific claims. In addition to using textual evidence, some fact-checking models also investigate how source metadata and other information can be used to improve veracity predictions.

Claim verification in science faces several unique challenges. Scientific text contains an abundance of specialised terminology, which can be challenging for language models if the terms are rarely observed in pretraining data. Readers are also assumed to have the background to understand text in various domains such as anatomy and physiology and the functional pathways of various tissues, as well as common acronyms to understand typical sentences in scientific literature. Finally, scientific claims are not clearly true or false. Science, as a process, is designed to help us arrive at increasing certainty through iteration on hypotheses and controlled experiments.

During this process, each result may only provide limited evidence towards a claim. Contradictory evidence is prevalent, as observed in DeYoung et al. (2021). Consequently, positing this task as claim verification rather than fact-checking casts the goal as identifying evidence to both support and refute the claim. In other words, it is not about making a summative judgement on the truth or falsehood of a particular claim. Given the uncertainties of many scientific outcomes, this is a pragmatic choice. It allows outputs of these models to be less brittle and more suitable for consumption by human fact-checkers and downstream users.

Automated tools can help researchers and the public evaluate the veracity of scientific claims. In recent years, automated fact-verification has advanced significantly in the domain of news, politics and social media. A number of datasets and shared tasks (FEVER; CheckThat!) have been created to support research in these areas. Several shared tasks address fact-verification in science, such as the TREC Health Misinformation Track and the SciVer Scientific Claim Verification. These have helped move the state of the field forward.

Scientific claim verification has received more attention in the last couple of years due to misinformation and disinformation related to COVID-19. However, collecting labelled data at scale for training ML models remains a challenge. Datasets like FEVER use large bodies of crowdsourced factual knowledge – articles on Wikipedia – to produce training data at scale. FEVER consists of many hundreds of thousands of instances, describing claims about a similar order of magnitude of entities. The largest scientific fact-checking datasets released to date are on the order of thousands or tens of thousands of claims and paired evidence documents (see Table 1 for a comparison).

Datasets are more difficult to construct in the scientific domain, requiring domain expertise to identify or write claims, and classify evidence. By way of illustration, two claim verification datasets in the scientific and health domains and their construction procedure are described here. Others are referenced in Table 1, where readers can also find references to more detailed information.

Citation sentences from biomedical papers were rewritten into claims by a group of trained expert annotators. These claims were verified against the cited evidence articles by a different group of annotators. Refuted or negative claims were created by manually negating some of the written claims. The dataset consists of 1 409 claims verified against over 5 000 scientific paper abstracts, along with rationale sentences identified from evidence documents.

This derives claims and evidence from the r/COVID19 subreddit. Claims are verified against the text of linked scientific papers and against documents retrieved through Google Search. Claim negations are created automatically by detecting and replacing salient entity spans in the original claim. COVID-Fact contains naturally occurring claims written by their original authors, which are often complex, describing more than one facet of an entity or process. The dataset consists of 4 086 claims and their associated evidence documents on the subject of COVID-19.

Unlike FEVER, where crowdsourced annotations were used to construct the claim and evidence dataset, SciFact, COVID-Fact and other scientific claim verification datasets required expert annotators. Expert annotation has been employed for components of dataset construction such as claim extraction, claim rewriting, claim negation, evidence classification, rationale extraction, explanation writing and/or veracity labelling.

In some cases (Saakyan, Chakrabarty and Muresan, 2021), little to no expert rewriting of claims takes place. However, the natural claims in these cases tend to be complex. It can be difficult to evaluate model performance (e.g. how to award credit if evidence provides support for only part of the claim).

Manually writing claims and claim negations is a laborious process that can introduce biases into the data. An emerging trend in dataset construction is exploring techniques for automatically deriving claims and evidence from documents for training without labelled data. One example is the automatic production of claim negations (Wright and Augenstein, 2020; Saakyan, Chakrabarty and Muresan, 2021), which are needed to train fact-verification models. Table 1 compares datasets for scientific claim verification using FEVER as a reference for general domain fact-checking.

System performance is improving rapidly. However, more real-world case studies are needed to understand the error tolerance of fact-checkers and downstream users. Wadden et al. (2020) conducted a COVID-19 case study with a baseline system trained on SciFact. They found their system produced plausible outputs for around two-thirds of input claims. In this case, plausibility is defined as more than 50% of retrieved evidence and classifications being judged correct by an expert with medical training. Since then, model performance on scientific claim verification has improved considerably. More work is needed to understand how improvements in ML model performance map to system and user gains in real-world settings, especially when considering potential performance degradation on emerging and unseen scientific topics.

This section outlines some possible future directions for automated scientific claim verification. The first four directions – bootstrapping training data, integrating additional sources of information, generalisation and robustness, and open-domain fact-checking – propose improvements in the scope and performance of models. They can be thought of as extensions to current tasks and systems. The latter two directions – user-centric fact-checking and characterising social implications – aim to understand how ML technologies are applied or ought to be applied in this domain in practice.

Many automated science claim verification tools and prototypes have user interfaces that present veracity labels for each claim-evidence pair. This may not be the optimal interface for browsing and understanding evidence. As scientific knowledge is always evolving, the best ways to communicate the uncertainty and contradictions of scientific claims and evidence must be studied.

Due to the difficulty and expense of creating training data to verify scientific claims, methods that leverage distant supervision or that generalise well in the few- or zero-shot settings are desirable. Recent work introduces methods for learning general domain fact-checking without any labelled data (Pan et al., 2021). Variants of this method have been adapted to the scientific domain with good results (Wright and Augenstein, 2020). Optimistically, recent findings (Wadden et al., 2022) also demonstrate that training on weakly labelled data may be sufficient for domain transfer. This suggests that model generalisation could be achieved with fewer instances of expensive labelled data.

Much of the discussion thus far focuses on scientific claim verification as a pure language modelling task, though this is only part of the picture. Metadata about the source of information – such as the authors, institutions, funding sources and the source’s historical trustworthiness – could be useful indicators of veracity. Additionally, the text of evidence articles is not the only viable source of evidence for predicting veracity. Other structured and semi-structured resources such as curated knowledge bases (see the essay on knowledge bases in this volume by Ken Forbes), patient data or experimental data could also be used as sources of evidence. The integration of these external sources of information into veracity prediction is an important direction for future work.

Scientific claim verification datasets are limited to a few select domains, most notably biomedicine, public health and climate change. This is due in part to the prevalence and high negative costs of misinformation in these domains. However, the tide of public interest can shift unpredictably; scientific findings will be called into question whenever they interface with policy and the public.

Therefore, scientific claim verification tools need to perform well and generalise beyond select domains. There are at least two directions to explore: understanding the fact verification needs of users in underexplored scientific domains; and developing evaluation benchmarks to assess the performance and suitability of claim verification models in these other domains.

Model robustness is a related direction. For example, Kim et al. (2021) showed that performance of fact verification models degrades when given colloquial claims as inputs. Methods to improve scientific claim verification model robustness are important avenues of future study.

Another direction for scientific fact verification is the exploration of open- vs. closed-domain retrieval. Closed-domain retrieval predefines a set of documents that may provide evidence, e.g. a set of 10 000 trusted scientific articles. In open-domain retrieval, the space of potential evidence documents is significantly larger. It may be defined, for example, as all peer-reviewed scientific documents or all indexed websites on the Internet. The more realistic setting of open-domain retrieval is significantly more challenging. The scope of retrieval is orders of magnitude larger, requiring improvements in retrieval efficiency and a different model training regimen. However, this setting also better approximates real-world claim verification, where fact-checkers do not presuppose a limited set of sources for evidence.

Real-world claim verification must account for the beliefs and needs of users. Individuals may hold varying beliefs about the same claim – from strongly supportive to uncertain and in search of evidence. Knowledge of such stances may be important for selecting how best to communicate model outputs to these users. Nguyen et al. (2018) in their work on human-AI collaborative fact-checking, found that humans tended to trust model predictions even when they are incorrect. They concluded that some communication around model internals is needed to produce better outcomes.

Another aspect of modelling users involves understanding their role and intent. Are they a fact-checker, a journalist, a health-care consumer or some combination of many roles? The model must adjust its goal depending on the user’s intended actions and the intended goal. For example, if the goal of verifying claims is to convince rather than inform, it may be important to expose both evidence documents and the rationales within those documents to support or justify the veracity label.

Finally, as noted earlier, there has been limited engagement with the social implications of ML models for scientific fact verification. Work has focused on technological challenges such as improving the performance of models in increasingly real-world settings. At the same time, social science researchers have documented confirmation bias – the likelihood of individuals to seek out or focus on information that confirms their existing beliefs (Bronstein et al., 2019; Park et al., 2021). They also observed how fact-checking can induce questioning of scientific findings that undermines trust in the scientific process (Roozenbeek et al., 2020). These types of cognitive biases and responses can lead to counterproductive results in the application of automated claim verification. These social phenomena require consideration and should help guide the development and evaluation of machine-assisted claim verification systems in the wild.

Significant progress has been made in defining and executing on automated systems for scientific claim verification. Advancements in data, modelling, analysis and evaluation continue at a rapid pace. However, the community must address how claim verification models should present uncertainty and assess the soundness of a claim in the face of contradictory evidence. Work also remains in assessing whether state-of-the-art systems are ready for wide-scale deployment. To progress on these goals, more emphasis is needed on the potential social implications and ramifications of automated science claim verification systems. Through such improvements, these technologies could help improve understanding of the consistency and replicability of science, and transform people’s trust and understanding of emerging scientific topics.


Abritis, A., A. Marcus and I. Oransky (2020), “An ‘alarming’ and ‘exceptionally high’ rate of COVID-19 retractions?”, Accountability in Research, Vol. 28/1, pp. 58-59,

Bronstein, M. et al. (2019), “Dual process theory, conflict processing, and delusional belief”, Clinical Psychology Review, Vol. 72, pp. 101748,

DeYoung, J. et al. (2021), “MS^2: Multi-Document Summarization of Medical Studies”, in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, on line and Punta Cana, Dominican Republic, pp. 7494-7513,

Diggelmann, T. et al. (2020), “CLIMATE-FEVER: A dataset for verification of real-world climate claims”, arXiv, arXiv abs/2012.00614,

Guo, Z., M. Schlichtkrull and A. Vlachos (2022), “A survey on automated fact-checking”, Transactions of the Association for Computational Linguistics, Vol. 10, pp. 178-206,

Kim, B. et al. (2021), “How robust are fact checking systems on colloquial claims?”, in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, on line, pp. 1535-1538,

Kotonya, N. and F. Toni (2020), “Explainable automated fact-checking for public health claims”, arXiv, arXiv:2010.09926 [cs.CL],

Nguyen, A.T. et al. (2018), “Believe it or not: Designing a human-AI partnership for mixed-initiative fact-checking”, in Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology,

Pan, L. et al. (2021), “Zero-shot fact verification by claim generation”, arXiv, arXiv:2105.14682 [cs.CL],

Park, S. et al. (2021), “The presence of unexpected biases in online fact-checking”, 27 January, Misinformation Review,

Roozenbeek, J. et al. (2020), “Susceptibility to misinformation about COVID-19 around the world”, Royal Society Open Science, Vol. 7,

Saakyan, A., T. Chakrabarty and S. Muresan (2021), “COVIDFact: Fact extraction and verification of real-world claims on COVID-19 pandemic”, arXiv, arXiv:2106.03794 [cs.CL],

Sarrouti, M., A. Ben Abacha and Y. Mrabet (2021), “Fact-checking of health-related claims”, in Findings of EMNLP, Association for Computational Linguistics,

Thorne, J. et al. (2018), “FEVER: A largescale dataset for fact extraction and VERification”, arXiv, arXiv:1803.05355 [cs.CL],

Wadden, D. et al. (2022), “MultiVerS: Improving scientific claim verification with weak supervision and full-document context”, in Findings of the Association for Computational Linguistics: NAACL, Association for Computational Linguistics, Seattle, United States, pp. 61-76,

Wadden, D. et al. (2020), “Fact or fiction: Verifying scientific claims”, in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, on line, pp. 7534-7550,

Wright, D. and I. Augenstein (2020), “Claim check-worthiness detection as positive unlabelled learning”, in Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics, on line, pp. 476-488,

Metadata, Legal and Rights

This document, as well as any data and map included herein, are without prejudice to the status of or sovereignty over any territory, to the delimitation of international frontiers and boundaries and to the name of any territory, city or area. Extracts from publications may be subject to additional disclaimers, which are set out in the complete version of the publication, available at the link provided.

© OECD 2023

The use of this work, whether digital or print, is governed by the Terms and Conditions to be found at