Improving reproducibility of artificial intelligence research to increase trust and productivity

O.E. Gundersen
Norwegian University of Science and Technology

Several recent studies have shown that many scientific results cannot be trusted. While the “reproducibility crisis” was first recognised in psychology, the problem affects most if not all branches of science. This essay analyses the underlying issues causing research to be irreproducible – with a focus on artificial intelligence (AI) – so that mitigating policies can be formulated.

Studies presented at leading conferences and published in high-impact journals have shown that AI research has not escaped the reproducibility problem. Ioannidis (2022) suggested that 70% of AI research was irreproducible. He pointed to the immaturity of the field compared to more mature sciences such as physics. This accords with two findings of Gundersen and Kjensmo (2018): only 6% of research published at top AI conferences explicitly stated which research questions were being answered, while only 5% stated which hypotheses were tested.

Problems of reproducibility have been documented in image recognition, natural language processing, time-series forecasting, reinforcement learning, recommender systems and generative adversarial neural networks (Henderson et al., 2018; Lucic et al., 2018; Melis, Dyer and Blunsom, 2018; Bouthillier, Laurent and Vincent, 2019; Dacrema, Cremonesi and Jannach, 2019; Belz et al., 2021). Application domains of AI have not been spared: problems have been documented in medicine and social sciences.

Many investigations have sought to identify what causes these irreproducible results. A proper understanding of the concept of reproducibility is required to capture the causes of irreproducibility. Yet, although reproducibility is a cornerstone of science, it has no commonly agreed definition. Plesser (2018) even holds that “reproducibility” is a confused term.

Without an agreed definition, the crisis will not be mitigated and many irreproducible findings will be published. This will reduce trust in science, which is already in decline. Conversely, increasing the rate of published reproducible findings will increase the productivity of science, and more importantly, increase trust in it.

Surprisingly, the most prevalent definitions of reproducibility are not helpful when designing, conducting and evaluating results of a reproducibility experiment. Here, the term “reproducibility experiment” refers to an independent experiment that seeks to validate the results of a previous study, here called the “original study”. The prevailing definitions neither specify what a reproducibility experiment entails nor what it means to reproduce results. Below, the essay presents several criteria for a more compelling definition of reproducibility.

A definition of reproducibility must help scientists specify the similarities and differences between the original and reproducibility experiments. It should provide insights into what independent researchers used from the original experiment to reproduce it and what they can and cannot change. More concretely, the definition should help answer several questions. Is an AI reproducibility experiment different enough from an original experiment if the same code is executed on a different computer but uses the same data? Did the reproducibility experiment in AI use different code or different data?

The definition of reproducibility should help inform when the results of an experiment have been reproduced and to what degree. In computer science, including AI, the output of the computational execution of a reproducibility experiment can be identical to the original experiment. This is due to the inherent determinism of some computational experiments. In contrast, producing identical results is highly unlikely in such domains as medicine, biology and psychology. In these domains, experiments involve humans and living material, and are far from deterministic.

A definition should help show if experimental results are reproducible in one of three ways: either the same analysis has yielded the same conclusions from a different set of outputs; the same conclusion was drawn from a different analysis; or the reproducibility experiments produced an identical output. Today, the most prevalent definitions do not help researchers sort out such issues.

A good definition of reproducibility should also be generalisable to all scientific disciplines. This would be achieved if the definition is intimately related to a definition of science. While the scientific literature agrees that reproducibility is a cornerstone of science, few if any previous definitions make this relationship explicit.

Reproducibility has no meaning outside the context of empirical studies. Reproducibility is different from repeatability, which simply means doing the same thing again. Consequently, a reproducibility experiment should be similar to, but different from, the original experiment. In addition to being generalisable, a definition of reproducibility should also help scientists pinpoint what was different between an original experiment and an experiment that confirms reproducibility (and why the researchers are justified in concluding that previous results have been reproduced). Over several years and in several publications, the author arrived at the following definition of reproducibility:

Reproducibility is the ability of independent researchers to draw the same conclusions from an experiment by relying on the documentation shared by the original researchers when following the scientific method. The documentation relied on by the independent researchers specifies the type of reproducibility study, and the way the independent researchers reached their conclusion specifies to which degree the reproducibility study validated the conclusion (Gundersen, 2021).

The above definition of reproducibility differs from others in the literature in several ways. First, it emphasises that reproducibility requires independent researchers to redo a study. Second, it emphasises that an experiment from which conclusions are drawn must be described in some form of documentation shared with third parties. Third, it defines “documentation” and “drawing the same conclusions” concisely and in relation to the scientific method. Finally, it distinguishes between the type of reproducibility study and the degree to which such a study validated the original results.

For non-computational experiments, experimental documentation is written and shared in the form of reports. Analyses are typically done using statistical software such as SPSS or Excel or written as code in languages like R, Matlab and Python, which can also be shared with third parties. As reports only write about the experiments themselves, they often leave out details that could affect results.

Computational experiments, such as those mostly reported in research using AI and machine learning, have a clear advantage over non-computational experiments. Computational experiments, and their complete workflow, can often be fully captured and documented in code. This removes any ambiguity about which steps were performed in which sequence and which parameters and thresholds were used.

Computational experiments are still not fully described by code, even if all steps of an experiment are implemented in code. This is because they also depend on ancillary software, such as libraries, frameworks and operating systems, as well as hardware to run on. A computational experiment is not completely documented unless ancillary software, hardware and data are specified in the documentation, in addition to the code describing the experiment. A complete documentation, which can be supported through technical solutions, must include these descriptions as well. This documentation can support packaging all ancillary software so they can be shared with independent researchers and help capture the hardware used in experiments.

Many computational experiments rely on observations in the form of digitised data or fully digitised simulations. Images used for training machine-learning algorithms to recognise objects, such as handwritten digits, are one example of digitised data. Meanwhile, self-play in games, such as used when training AlphaZero, is an example of a digitised simulation (Schrittwieser et al., 2020).

In both cases, the experiments are fully executed on a computer and can be reproduced with relative ease. If the analyses and their interpretation also exist as code, the complete experiment is computational, and conclusions can be drawn without human intervention. Computers are still unable to formulate interesting research questions, design proper experiments, and understand and describe their limitations. However, efforts to fully automate the scientific process are underway.

In non-computational sciences, the importance of specifying the equipment used when performing the experiment is widely understood. It is the first thing chemistry and physics students learn as part of their laboratory assignments. Doing the same is equally important for computational experiments because the choice of hardware and software can introduce biases (Gundersen, Shamsaliei and Isdahl, 2022; Zhuang et al., 2022).

In non-computational experiments, results can also be influenced by who did the experiment. By contrast, computational experiments do not depend on the person executing the code if the experiment is fully automated as code. In other words, whether one person or another presses a button that executes a computational experiment is irrelevant to the outcome. The datasets, on the other hand, can be biased (Torralba and Efros, 2011). If another dataset is used without the same bias, it would render the experiment irreproducible.

The degree to which an experimental result has been reproduced depends on how much the reproducibility study is generalisable. If the same code is executed on the same ancillary software using the same data, but on a different computer, to produce the exact same output, the results are not highly generalisable. In this case, the reproducibility experiment has only shown that the results of the original experiment are generalisable to different computers.

This contrasts with reproducibility experiments that only rely on written documentation. In these cases, the independent researchers must write all code themselves, collect new data and execute the experiment on a different computer. If the results are the same, such a reproducibility experiment is much more generalisable, and the hypotheses can be trusted more deeply. In such a situation, the outcome of the re-implemented experiment should not be expected to be identical to that of the original experiment.

While reproducing an experiment by reimplementing all code and collecting new data makes the results more generalisable, it is more work for the independent researchers (Gundersen, 2019). Transparent research is easier to trust, as the researchers have nothing to hide.

A reproducibility study is outcome reproducible if the reproducibility and original experiments have the same outcome, while the analysis and interpretation are the same as for the original experiment. This could be exemplified by an image classification experiment where a machine-learning algorithm must classify a set of images of cats and dogs. The reproducibility experiment is outcome reproducible if it produces the exact same classes for each image in the test set as the original experiment. For all practical purposes, non-computational studies will not be outcome reproducible. For example, there is a low probability that survey respondents will give the same answer if they redo the survey.

An experiment is analysis reproducible if the outcome differs between the original and reproducibility experiments but uses the same analysis and leads to the same conclusion. Again, the reproducibility experiment could produce a different set of classes for the set of test images of cats and dogs. However, if the same analysis gives the same result, the reproducibility study is analysis reproducible. For example, the same result could be that a given algorithm performs significantly better than another one when relying on the same statistical test used in the original study.

Finally, an experiment is inferentially reproducible if a different analysis of the same or different outcome leads to the same conclusion in both the original and reproducibility studies. Again, using the example of image classification, the reproducibility study is inferentially reproducible if one statistical test is changed for another one when analysing the classes that the machine-learning algorithms have produced for the images in the test set. Even when the analysis is different, the result is reproducible if the same conclusion is reached.

Inferentially reproducible studies are more generalisable than analysis-reproducible studies, which in turn, are more generalisable than outcome-reproducible studies. Outcome reproducibility is a narrow interpretation of reproducibility. It cannot be achieved unless the reproducibility experiment uses the exact same data. A more robust conclusion and thus more generalisable result is made if the same analysis is done on an outcome produced by running the experiment on a different dataset. If the conclusion is still valid despite using a different analysis on the outcome, the result is even more general.

Still, while a better understanding of reproducibility is a good start, it alone will not mitigate the reproducibility crisis. One must understand what causes irreproducibility.

Experiments have many sources of irreproducibility that could invalidate the conclusions drawn from an experiment. While some sources of irreproducibility are the same between sciences, others only apply to specific sciences. This essay is most interested in those that affect AI and machine learning.

Figure 1 shows graphically the entire scientific workflow, and the forms in which, and points at which, AI is brought to bear. Broadly speaking, for such experiments, the sources of irreproducibility can be divided into six types (Gundersen, Coakley and Kirkpatrick, 2022).

Study design factors capture the high-level plan for how to conduct and analyse an experiment to answer the stated hypothesis and research questions. For example, baselines could be poorly chosen (comparing a state-of-the-art deep-learning algorithm for a given task to one that is not state of the art). This would provide a false proof of improving state of the art.

Algorithmic factors are design choices of machine-learning algorithms and training processes that exploit randomness in different ways. Examples include data randomised in different ways during training, learning algorithms that rely on random initialisation of features, features that are selected randomly and algorithms that are optimised by using optimisation techniques that rely on randomness. The randomness introduced by design choices leads to differences in performance. Researchers can choose results that best suit them over those that better reflect the true performance of an algorithm. This makes the study irreproducible for researchers that do not choose to “cherry pick” results.

Implementation factors are consequences of choices related to the software and hardware used in the experiment. Examples include which software (e.g. operating systems, software libraries and frameworks) are used in the computational experiment, and whether computations are executed in parallel. All of these can affect the outcome to such a degree that opposite conclusions can be drawn.

Observation factors are related to the data or the environment of an intelligent agent. This includes how data are generated, processed and augmented, but also properties of the environments, such as the physical laws present in the benchmarking environment of the agent. The agent learns from patterns in the data. If the data represent the physical world, which in most cases they do, the agent will bring this idea of the world with it when deployed.

These issues have been most discussed when considering sex and race. However, problems can arise from even innocuous features of a dataset, as when a system fails to recognise images of coffee mugs simply because some have handles pointing in different directions than others (Torralba and Efros, 2011). Such issues can be reduced but also emphasised during data pre-processing and data augmentation (where a dataset is increased by slightly modifying samples and adding theming them to the dataset).

Other issues relate to different distributions of classes in the training and test set. For example, the test set could have more images of dogs than the training set when the algorithm has a lower error rate on dogs. The annotation quality of target values in the training data is another consideration. Different human annotators, for example, can label the same instance differently.

Evaluation factors relate to how investigators reach their conclusion. Examples include selective reporting of results where only datasets that show the wanted results are used in the study, over-claiming of results where conclusions go beyond the evidence, poor estimation of error and misuse of statistics when analysing results. Evaluation factors can be uncovered by reading scientific reports and thoroughly understanding the presented research and knowing the state of the art deeply.

Documentation factors capture how well the documentation reflects the actual experiment. For an experiment to be documented perfectly, all choices that are made and could be a source of irreproducibility should be documented and the motivations for the choices explained. These could include up to 42 different types of choices (Gundersen, Coakley and Kirkpatrick, 2022). The readability of the documentation is of course important, as is the detail provided on the experiment’s design, implementation and workflow. For computational experiments, publishing code and data will in many cases sort out the ambiguities.

The different sources of irreproducibility affect the conclusion in different ways (Gundersen, Coakley and Kirkpatrick, 2022). Some identified sources affect the outcome of an experiment, such as whether an algorithm uses randomness in the training process. This means that every time a model is trained on a dataset, the outcome will be slightly different when run on the same test set. Some decisions, related to how the model is evaluated, could change the analysis, leading to a different conclusion. For example, some error metrics will emphasise some characteristics of a model over others. This can be illustrated by using the mean value or the median value for comparing two sets of numbers. A couple of extreme outliers will affect the mean value but not the median value. The choice of metric can affect the conclusion. Finally, some factors affect the inference, such as leaving out some results that do not support a researcher’s desired conclusion.

False results are expected in science. Indeed, AI has a high rate of false results (Ioannidis, 2022). An achievable goal is to reduce the number of irreproducible studies to be on par with physics, which has the lowest rate of false findings. This could be done through increasing methodological rigour, such as explicitly considering the sources of irreproducibility. Individual researchers cannot do this alone. All actors in the research system must share the responsibility. Broadly, this system includes researchers, their institutions, the publishers of the research and funding agencies. The following provides some insights into the responsibilities for each actor.

Individual researchers should ensure they understand and describe the limitations of their studies, taking the sources of irreproducibility into account. They must design studies that genuinely test their hypotheses and treat all algorithms that are investigated equally. They should also discuss the limitations of the experiments related to algorithmic, implementation and observation factors that can affect the conclusion. The choice of evaluation should be clearly reasoned, demonstrating clearly why it will provide trustworthy evidence for the conclusion. Finally, the researchers must document the research properly and share as much information as possible about the experiment, including code and data in addition to good descriptions.

Research institutions should ensure that best practices for AI research are followed. This includes training employees and providing quality assurance processes for the research under their responsibility. They should also ensure that research projects set aside enough time for quality assurance of the experiments. Finally, they should emphasise quality and transparent research practices as part of the process of hiring researchers.

Publishers must assure the quality of the research they publish, a job often outsourced to third-party researchers. Few publishers standardise the review process and provide instructions that reviewers should follow. The peer review that occurs as part of AI and machine-learning conferences is an exception. Here, reviews are provided as checklists and forms that reviewers must answer. This contrasts with journals, where reviews are typically written in free form. Guidelines are provided but not enforced in any way. This can be improved using forms that cover different sources of irreproducibility. Furthermore, they should encourage publishing code and data as part of a scientific article. That said, publishers should not be expected to enforce the sharing of code and data.

Funding agencies obviously evaluate the quality of project proposals selected for funding. While they cannot actively avert or control many sources of irreproducibility, they can significantly influence some of them.

First, funding agencies can select evaluators with a good track record of open and transparent research. As such research is easier to audit and check for reproducibility, researchers that publish open and transparent research would seemingly set a high bar for themselves.

Second, for the same reason, funding agencies can require the research they fund to be published in open-access journals and conferences.

Finally, and most importantly, they can require both code and data to be shared freely with third parties. Governmental funding agencies in particular should require sharing of publicly funded research so it is available to the public. In January 2021, OECD countries adopted an updated Recommendation of the Council concerning Access to Research Data from Public Funding (OECD, 2021). This legal instrument, in force since 2006, now addresses new technologies and policy developments, and provides comprehensive policy guidance. The revision expands the scope of the earlier Recommendation to go beyond coverage of research data. It now covers related metadata, as well as bespoke algorithms, workflows, models and software (including code), which are essential for their interpretation. 

Making research infrastructure available for third parties could be an option. However, it may be less important as computational experiments should be reproducible regardless of the hardware and ancillary software involved. Requiring that code and data are available publicly for third parties would allow them to run experiments on different hardware. However, this will not solve all issues with reproducibility. Producing the same outcomes does not ensure that datasets have not been chosen based on how well certain methods perform on them; other datasets could be left out for the same reason. Making code and data available for third parties will enable third parties to check the validity of published research with less effort.

If science involves standing on the shoulders of previous generations of scientific giants, as Newton put it, then reducing the number of false results will help scientists to see even further. This means that the productivity of science will increase. AI research needs to continue focusing on reproducibility, openness and transparency. Most high-impact research conferences care about this and have started using a reproducibility checklist as part of the submission and review process. Community-driven publications, such as the Journal of AI Research, have adopted this focus. However, funding agencies can also require researchers to share code and data as a condition of their funding. In addition, they should require funded researchers to publish in open-access journals and conferences that have clear guidelines and forms for evaluating research.


Belz, A. et al. (2021), “A systematic review of reproducibility research in natural language processing”, in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 381-393,

Bouthillier, X., C. Laurent and P. Vincent (2019), “Unreproducible research is reproducible”, in Proceedings of Machine Learning Research 97, pp. 725-734,

Dacrema, M.F, P. Cremonesi and D. Jannach (2019), “Are we really making much progress? A worrying analysis of recent neural recommendation approaches”, in Proceedings of the 13th ACM Conference on Recommender Systems, pp. 101-109,

Gundersen, O.E. (2021), “The fundamental principles of reproducibility”, Philosophical Transactions of the Royal Society A, Vol. 379/2197,

Gundersen, O.E. (2019), “Standing on the feet of giants – Reproducibility in AI”, AI Magazine, Vol. 40/4, pp. 9-23,

Gundersen, O.E., K. Coakley and C. Kirkpatrick (2022), “Sources of irreproducibility in machine learning: A review”, arXiv, preprint, arXiv:2204.07610,

Gundersen, O.E., S. Shamsaliei and R.J. Isdahl (2022), “Do machine learning platforms provide out-of-the-box reproducibility?”, Future Generation Computer Systems, Vol. 126, pp. 34-47,

Gundersen, O.E., Y. Gil and D.W. Aha (2018), “On reproducible AI: Towards reproducible research, open science and digital scholarship in AI publications”, AI Magazine, Vol. 39/3, pp. 56-68,

Gundersen, O.E. and S. Kjensmo (2018), “State of the art: Reproducibility in artificial intelligence”, in Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32/1,

Henderson, P. et al. (2018), “Deep reinforcement learning that matters”, in Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32, No 1,

Ioannidis, J.P. (2022), “Why most published research findings are false”, PLOS Medicine, Vol. 2/8, p. e124,

Lucic, M. et al. (2018), “Are GANS created equal? A large-scale study”, Advances in Neural Information Processing Systems, Vol. 31,

Melis, G., C. Dyer and P. Blunsom (2018), “On the state of the art of evaluation in neural language models”, in Proceedings of the International Conference on Learning Representations 2018,

OECD (2021), Recommendation of the Council concerning Access to Research Data from Public Funding, OECD, Paris,

Plesser, H.E. (2018), “Reproducibility vs. replicability: A brief history of a confused terminology”, Frontiers in Neuroinformatics, Vol. 11/76,

Schrittwieser, J. et al. (2020), “Mastering Atari, Go, chess and shogi by planning with a learned model”, Nature, Vol. 588/7839, pp. 604-609,

Torralba, A. and A.A. Efros (2011), “Unbiased look at dataset bias”, in CVPR 2011, pp. 1521-1528,

Zhuang, D. et al. (2022), “Randomness in neural network training: Characterizing the impact of tooling”, in Proceedings of Machine Learning and Systems, Vol. 4, pp. 316-336,

Metadata, Legal and Rights

This document, as well as any data and map included herein, are without prejudice to the status of or sovereignty over any territory, to the delimitation of international frontiers and boundaries and to the name of any territory, city or area. Extracts from publications may be subject to additional disclaimers, which are set out in the complete version of the publication, available at the link provided.

© OECD 2023

The use of this work, whether digital or print, is governed by the Terms and Conditions to be found at