A framework for evaluating the AI-driven automation of science

R. King
Cambridge University
United Kingdom
H. Zenil
Cambridge University
United Kingdom

The fundamental goal of science is to construct models that predict what will happen in the real world. This provides a natural objective function for artificial intelligence (AI) systems used in science to optimise how well they predict what happens in experiments. This essay looks at the future of AI-led science, presenting a roadmap of challenges for AI in scientific discovery and then proposing a framework for evaluating AI in science.

The traditional name for the application of AI to science is “discovery science”, which dates to the 1960s and the work of Joshua Lederberg (Herzenberg, Rindfleisch and Herzenberget, 2008). Lederberg won the Nobel Prize in Physiology or Medicine but taught himself to program. He was very interested in AI and how to formalise science using logic. Lederberg’s Meta-Dendral project was designed as part of the Viking probes to Mars (Klein et al., 1976); Mars was so distant that an automated system was needed to do the science there. While computer science and computers were not up to the task, this initiative turned out to be influential on AI. Also closely involved in Meta-Dendral was Ed Feigenbaum who won the Turing Award for his work on expert systems (Feigenbaum, 1992). Carl Djerassi was the main chemist, famous for his work on the birth control pill. The machine learning pioneer Bruce Buchanan was also involved.

Another highlight in the development of discovery science is the Bacon system. The driving force here was Herbert Simon, the only person to have won both a Nobel Prize and a Turing Award. It was claimed that Bacon rediscovered scientific laws such as Kepler’s laws of planetary motion (Qin and Simon, 1990). This was controversial because the data were in effect cleaned up, and Bacon fitted equations to these clean data; this was very different from what Kepler had to do in reality. Nevertheless, Bacon was an important and influential project.

Science is a well-suited task for AI systems (Kitano, 2021). Science is abstract like the games of chess and Go, where AI systems can beat the best human players. Scientific problems are also restricted in scope. If an AI system is working on a scientific problem, it does not need to know about vegetables or politics or anything else. It just needs to know about the scientific domain in question. Moreover, nature is honest. Whether a human or a robot does a scientific experiment, for instance, the real world can be trusted; it is not trying to fool us about how nature works. This is quite different from AI systems in business or war, where many agents are dishonest, in nature or by design.

Much of the AI developed for science to date has limitations (Castelvecchi, 2016). There are many examples of “black-box” applications of AI in science which lead to little understanding. AlphaFold (and AlphaFold 2), for example, produced impressive results on the problem of protein folding. However, they did not generate much understanding of the underlying mechanisms of protein folding (Pinheiro et al., 2021; David et al., 2022). Most current AI applications to science have only contributed indirectly to understanding; domain experts, in fact, do the crucial prior work (Hedlund and Persson, 2022). This may have negative consequences. For example, if the problem of protein folding is considered solved, funding for basic science in molecular dynamics may be reduced. This problem is compounded by the fact that measuring scientific progress is both difficult and often controversial (Zenil and King, forthcoming).

How much knowledge is gained by applying AI to science? This is an important aspect of measuring scientific progress. The issue is not about the speed of discovery or how this may be increased but rather about how much is actually gained. This is difficult to know because there are no universally accepted criteria for gauging originality or relevance of a scientific discovery. Nobel Prizes, for example, reward contributions to science of the highest order, and yet bias and subjectivity are at play in the award process.

In this section, the essay explores issues that related papers do not seem to address, particularly the inclusion of lab automation in a closed-loop experimental cycle led by AI, and its full implications. AI is making important contributions to many aspects of science. However, there are few examples of AI being used to complete the experimental cycle (King et al., 2018), or to fully automate it from beginning to end in a generalised manner.

One of the main bottlenecks in the automation of science is the question of knowledge extraction and knowledge representation, both in a domain-specific and a general sense (Ataeva et al., 2020). If an artificial general intelligence (AGI) existed, it could extract and represent knowledge from all domains. AGI is yet to happen, though it likely will in time.

Currently, even the most automated systems are usually given a hypothesis to test. In other words, the best available AI today cannot enable systems capable of defining their own hypothesis space and own experiment design. Some forms of AI-driven instrument automation for experimental acceleration do exist and have proven fruitful (King et al., 2018; Frueh, 2021). These closed systems require contextual rejection, validation and verification of hypotheses and models. Ideally, they should be capable, just as humans are, of interpreting accidents as a source of inspiration and innovation.

Several groups have done research on “computational serendipity” (Niu and Abbas, 2017; Abbas and Niu, 2019). Such research may prove to be essential to reproduce human scientific practice. It may even improve its efficiency (negative results are often transformed into positive discoveries under a different context). In a closed-loop approach to scientific discovery, and as part of its definition, it is important to consider how to really close the experimental loop. A system should incorporate a result into a knowledge database and consider it for the next iteration of the discovery cycle.

AI-led closed-looped automation is arguably the future of science. In other words, in future, a robot scientist (AI scientist, self-driving lab or AI robotic system) will do simple forms of scientific research autonomously. Such a system has background knowledge about an area of science, which it represents in the best way possible – through logic and probability theory.

A robot scientist can autonomously discover and form a new hypothesis about the area of science in question. It can also autonomously identify efficient experiments to test these hypotheses. It may then control and program a laboratory robot to physically execute experiments.

The robotic system can examine the results of these experiments, analyse them and change the probabilities of hypotheses being correct based on the experimental observations. It can then repeat the cycle until some resources run out or only one theory is consistent with the background knowledge and experimental evidence. Such robotic systems area already accelerating science in genetics and drug discovery (King et al., 2019; 2018; Frueh, 2021).

The motivations for building robot scientists are both epistemological and technological (King et al., 2018).

The epistemological motivation is to better understand how science works. If one can create an engine that can do human-like science, then this is informative about how the practice of human science may work. As was written on Richard Feynman’s blackboard at the time of his death, “what I cannot create I do not understand”.

The technological motivation is to increase the efficiency and quality of science. Robot scientists can work faster, more cheaply, more accurately and for longer than human beings: 24 hours a day, 7 days a week (King et al., 2018). Robot scientists can also be more easily multiplied than human scientists. If one robot scientist can be built, then thousands, even millions, could also be quickly built.

The science produced by robot scientists is expected to be of higher quality and more reproducible than that of human scientists. With robot scientists, the whole scientific cycle is semantically explicit, and potentially also declarative (i.e. expressing the logic of computations). Robot scientists also record experiments in much greater detail than is possible for most human scientists. This makes the experiments more likely to be reproducible in other laboratories.

Finally, robot scientists are more robust to pandemics than human scientists. For example, a “robot chemist” at Liverpool University made headlines in the United Kingdom by working through the COVID pandemic (Burger et al., 2020).

Evidence for the feasibility of the hugely ambitious project of building such robot systems comes from the success of AIs at playing the most intellectual of games (chess, Go, poker, etc.), and the analogy between such games and science. In chess and Go, for example, there is a continuum of playing ability ranging from novices to grand masters. Over AI history, AI game-playing programs repeated this path: they began playing poorly and went on to easily beat the human world champions (Strogatz, 2018).

The analogy between games and science suggests that AI systems for science may follow the same trajectory as game-playing systems. They may move from simple forms of science that existing autonomous systems can do through to science that average human scientists can do, and end as “Grand Masters” of science (Newton, Darwin, Einstein). If one accepts a continuum of ability in science, then AI systems will likely get better and better at science as hardware and AI software improve and more data become available. Indeed, 10 years ago, physics Nobel laureate Frank Wilczek predicted that in 100 years the best physicist would be a machine (Wilczek, 2016).

Achieving success in the Turing Challenge requires overcoming huge technical challenges. AI systems would need the capacity to:

  • make a strategic choice about its research goals

  • form exciting and novel hypotheses that move beyond a restricted area

  • design novel protocols and experiments to test hypotheses beyond use of prototypical experiments

  • notice and characterise a significant discovery in terms that human scientists can comprehend.

Is it necessary to solve the problem of general-purpose AI (GPAI) to develop AI systems that can do Nobel Prize-level scientific research? It was once widely believed that building machines able to beat the world chess champion would require GPAI. Indeed, that was the main motivation for studying computer chess. GPAI turned out to be unnecessary, and it was possible to build machines that are world class at chess/Go but able to do nothing else intelligent. Any future AI system that could do Nobel Prize- quality research would certainly have to be much more general and intelligent than chess/Go playing machines. However, it is not clear if they need to be as general and intelligent as a GPAI.

Achieving the Nobel Turing Challenge would have profound effects on almost everything. Modern society is built on a foundation of science and technology. Most people in developed countries now live better than kings did in the past: they have better food, medical care, transport, etc. This miracle has been made possible through better technology based on science. Success in the Nobel Turing Challenge would result in almost unlimited amounts of new science and technology. This power could then be used for the benefit of all the world’s inhabitants, human and non-human.

The problem of communicating scientific results has to do with language-based modelling. Future milestones may range from generating summaries of scientific articles to producing a critique of a whole scientific field. In other words, AI will perhaps be able to pinpoint where humans have been biased or else highlight areas of a domain that humans have failed to explore. If AI can explore a full hypothesis space, and even enlarge the space itself, then it may show that humans have only been exploring small, delimited areas of the hypothesis space, perhaps as a result of their own scientific biases.

Explorations of regions of science could be encouraged that are neither entirely favoured for attention by humans nor random. Instead, they would be AI-led or a hybrid of human-guided and computer- proposed (e.g. as in assisted theorem proving). It could be that areas humans have chosen to explore are the only ones relevant to human challenges. Over time, it may become apparent that humans have neglected areas of discovery that could have positive social impacts.

Reasoning needs to be able to move from capabilities such as generating scientific questions to passing any open-ended scientific or professional exam. This kind of investigation should also be able to explore some of the algorithms in use. This need not be in great depth, but it must at least be aware of broad categories of algorithms. Statistical data-driven approaches (Kim et al., 2020) dominate the current AI and machine-learning scene. Consequently, model-driven approaches (Shlezinger et al., 2021) would be a small subset of the broader categories.

As with many such areas of science, much remains speculative. However, a combination of methods will probably help achieve the relevant goals. This will include approaches similar to what currently happens in hybrid human-AI interaction but doing so more explicitly (Maadi et al., 2021). Some researchers have also found the best results may come from combining the best of both worlds. Statistically data-driven approaches, such as most of the deep-learning space, could go together with methods based on cognitive or symbolic computing (Pisano et al., 2020). Among other things, this would permit a better handling of aspects of causality (Zenil et al., 2019).

Measuring acceleration or deceleration of progress in science is difficult because each is likely to be highly domain- and method-dependent. Imagine an attempt to quantify the acceleration of research in a single scientific domain. Obviously, there could be different methods for addressing the problem, and each could find a different rate of acceleration. Unfortunately, today, there is no universally agreed-upon way to measure progress or productivity in science.

Assessing the contribution of AI to science and the evolution of science towards full automation requires identifying and advancing measures for evaluating progress. The Society of Automotive Engineers, for example, developed a classification from Zero to Five to assess progressive degrees of autonomy in cars.

One way to think of these levels is how much human input the car requires to navigate. The higher the level, the less human input is required. Thus, Zero signifies no automation: these are regular cars where the human driver is in charge of every aspect of driving. Level One signifies driver assistance. Level Two involves automated steering and acceleration. Some may consider having a GPS and other such aides as amounting to partial automation, but this combines automation and human effort, comparable to the process of acceleration. The same could be said of cruise control, which requires human intervention. At any rate, autonomy between One and Two signifies assistance plus automation (Badue et al., 2021). In Level Three, some responsibility of driving is transferred to an AI system. Level Five is full automation (with no human intervention). This remains elusive, despite the hype of the last decade. Level Four is Level Five but restricted in scope, e.g. in time and space, or to certain situations.

This essay proposes a similar scheme for evaluating AI in science, since this too involves a transfer of responsibility from humans to machines. It too is about progressively consigning aspects of the scientific endeavour to machines, until humans are no longer involved. Any adopted classification needs to be useful, understandable, specifically measurable, achievable, relevant and robust. In other words, assigning a level must be easy and classification would not require constant updating.

The proposed classification has an associated staged process that the automation of science by machines and AI might follow. However, the framework is itself a work in progress.

Level Zero is simple because it designates the absence of automation in science. Most traditional human science, before the advent of computers, belongs here. It is led, driven and undertaken by human minds.

In Level One, human scientists still describe the problem in full, but machines do some data manipulation or calculation. Some commentators date Level One, machine assistance, to the beginning of the last century. Others trace it to the advent of data science in the 1980s, or even to the 1990s with the emergence of statistical machine learning. A case might also be made for dating the achievement of Level One to the 1950s and 1960s, when the first theorem provers appeared (Harrison, Urban and Wiedijk, 2014).

Level Two would signify that an important aspect of the discovery cycle is fully automated. This could include, for example, the simulation or extraction of knowledge, or the testing of propositions. This means that humans are still required for some of the most important aspects of the full experimental cycle but that at least one pathway has been fully automated. For example, some AI systems are able to read databases and provide this input to another human or to a machine system.

Level Three would signify a state where AI can perform model selection and generation (Hecht, 2018). This would be equivalent to having a knowledgeable system appear, with some agency, which could receive a set of hypotheses and then follow the consequences. For example, a scientist might be able to provide the system with a selection of problems and data, and the AI would then match them to provide a solution. In this case, human scientists are still giving the AI the hypothesis and solution spaces.

Theorem proving may belong to Level Three, being quite advanced in some ways. Still, today’s theorem provers are also limited because they do not learn over time; they are deterministic. In other words, they start from scratch every time, unless humans add new knowledge to the theorem database. In a few systems, this may have been automated, but some level of human curation still needed.

Level Four would entail closing the loop – as it were – with AI being able to generate and explore the hypothesis space. However, at least one aspect of the discovery cycle would not be fully automated: humans would still need to feed the AI system with all the initial information and data it needs. A Level Four system, for example, would be a theorem prover. Without new data inputs after the initial cycle of analysis, the theorem prover could continue to explore a hypothesis space to generate new theorems without human intervention.

Level Five corresponds to full automation, covering all levels of discovery and with no human intervention. An automated system operating at Level Five will be equivalent, if not superior, to a human scientist. This type of system would not require any human input. What follows are examples illustrating the various levels.

Level One is perhaps the state of the art today, with one or two processes entrusted to machines, data science, data analytics, etc. The Kepler space telescope, for example, generates a lot of data. Use of a computer to analyse the data is needed to extract all the information about exoplanets embedded in the data. Given the sheer quantity of data and the weakness of the signals, little to nothing might be accomplished without computers.

Machine learning arguably fits Level Two. Weather forecasting, which is very much based on dynamic systems, would be a good example of more physics-driven (Chowdhury and Subramani, 2020) and model-driven approaches. With weather forecasting, a physical representation requires little human intervention because obviously these weather sensors put out information almost in real time. Nearly the entire process has been automated, except for model creation. The model is already determined and of human devising. Humans generate a model and implement it and then let the system do the simulation and ingest the data, almost all of this occurring in real time.

The placement of AlphaFold 2 at either Level One, Two or Three is an open question. Level Three seems inappropriate because it would designate something more like auto machine learning (He et al., 2021), which is about trying to pick the model that best fits the observations. Auto machine learning is in an early stage of development and is perhaps also domain-specific. The authors believe that capabilities in science are moving towards Level Three where AIs can choose the best model for the data.

Only the robot scientists may be said to have reached Level Four. This is the stage where science, especially experimental science, can be greatly accelerated. For such machines, this involves almost no human intervention except for providing consumables.

Participants at the first workshop on the Nobel Turing Challenge, organised by the Alan Turing Institute in 2020, estimated that widespread uptake of Level Two and Level Three systems will happen within the next five years. They considered that Level Four systems could become widespread in the next 10-15 years, and Level Five in the next 20-30 years. Indeed, a fully automated experiment recently tested systematic research reproducibility from literature papers for the first time (Roper et al., 2022). It shows higher Levels (4-5) are becoming possible. If the estimates of the experts cited here are even broadly correct, then science will shortly be transformed.

This essay argues that the future of science lies in AI-led closed-looped automation systems. These run the full scientific cycle autonomously, iterating continuously from hypothesis generation to experimental validation and re-interpretation of results. These systems will emulate the human scientific process but work faster and more precisely. They will be less biased and able to open up ever-larger regions of scientific discovery. To achieve this requires well-defined key performance indicators grounded in a framework of automation levels based on the quantity and quality of input and execution required from human scientists. Human scientists will decide how to work with the AI scientists, and how much room AI will have to define its own problems and solutions.


Abbas, F. and X. Niu (2019), “Computational serendipitous recommender system frameworks: A literature survey”, in 2019 IEEE/ACS 16th International Conference on Computer Systems and Applications (AICCSA), pp. 1-8, https://ieeexplore.ieee.org/abstract/document/9035339

Ataeva, O. et al.(2020), “Ontological approach: Knowledge representation and knowledge extraction”, Lobachevskii Journal of Mathematics, Vol. 41/10, pp. 1938-1948, www.azooov.ru/index.php/ljm/issue/view/81.

Badue, C. et al.(2021), “Self-driving cars: A survey”, Expert Systems with Applications, Vol. 165/113816, https://doi.org/10.1016/j.eswa.2020.113816.

Burger, B. et al.(2020), “A mobile robotic chemist”, Nature, Vol. 583, pp. 224-237, https://doi.org/10.1038/s41586-020-2442-2.

Castelvecchi, D. (2016), “Can we open the black box of AI?”, Nature News, 5 October, Vol. 538/7623, www.nature.com/news/can-we-open-the-black-box-of-ai-1.20731.

Chowdhury. R. and D.N. Subramani (2020), “Physics-driven machine learning for time-optimal path planning in stochastic dynamic flows”, in International Conference on Dynamic Data Driven Application Systems, pp. 293-301, https://dl.acm.org/doi/abs/10.1007/978-3-030-61725-7_34.

David, S. et al.(2022), “The alphafold database of protein structures: A biologist’s guide”, Journal of Molecular Biology, Vol. 434/2, p. 167336, https://doi.org/10.1016/j.jmb.2021.167336.

Feigenbaum, E.A. (1992), “A personal view of expert systems: Looking back and looking ahead”, Knowledge Systems Laboratory, Department of Computer Science, Stanford, https://stacks.stanford.edu/file/druid:dp864rk0005/dp864rk0005.pdf.

Frueh, A. (2021), “Inventorship in the age of artificial intelligence”, SSRN, https://dx.doi.org/10.2139/ssrn.3664637.

Harrison, J., J. Urban and F. Wiedijk (2014), “History of interactive theorem proving”, Computational Logic, Vol. 9, pp. 135-214, www.cl.cam.ac.uk/~jrh13/papers/joerg.pdf.

He, X. et al.(2021), “Automl: A survey of the state-of-the-art”, arXiv, arXiv:1908.00709 [cs.LG], https://doi.org/10.1016/j.knosys.2020.106622.

Hecht, J. (2018), “Lidar for self-driving cars”, Optics and Photonics News, Vol. 29/1, pp. 26-33, https://doi.org/10.1364/OPN.29.1.000026.

Hedlund, M. and E. Persson (2022), “Expert responsibility in AI development”, AI and Society, https://doi.org/10.1007/s00146-022-01498-9.

Herzenberg, L., T. Rindfleisch and L. Herzenberget (2008), The Stanford Years (1958-1978), Annual Review of Genetics, Vol. 42, pp. 19-25, https://doi.org/10.1146/annurev.genet.072408.095841.

Kim, Y. and M. Chung (2019). “An approach to hyperparameter optimization for the objective function in machine learning”, Electronics, Vol. 8/11, pp. 1267-2019, https://doi.org/10.3390/electronics8111267.

Kim, H. et al. (2020), “Artificial intelligence in drug discovery: A comprehensive review of data-driven and machine learning approaches”, Biotechnology and Bioprocess Engineering, Vol. 25/6, pp. 895-930, https://doi.org/10.1007/s12257-020-0049-y.

King, R.D. et al. (2018), “Automating sciences: Philosophical and social dimensions”, IEEE Technology and Society Magazine, Vol. 37/1, pp. 40-46, https://doi.org/10.1109/MTS.2018.2795097.

Kitano, H. (2021), “Nobel Turing Challenge: Creating the engine for scientific discovery”, NPJ Systems Biology and Applications, Vol. 7/1, pp. 1-12, https://doi.org/10.1038/s41540-021-00189-3.

Klein, H.P. et al. (1976), “The Viking mission search for life on Mars”, Nature, Vol. 262/5563, pp. 24-27, https://doi.org/10.1038/262024a0.

Maadi, M. et al. (2021), “A review on human–AI interaction in machine learning and insights for medical applications”, International Journal of Environmental Research and Public Health, Vol. 18/4, https://doi.org/10.3390/ijerph18042121.

Niu, X. and F. Abbas (2017), “A framework for computational serendipity”, in Adjunct Publication of the 25th Conference on User Modeling, Adaptation and Personalization, Association for Computing Machinery, New York, pp. 360-363, https://doi.org/10.1145/3099023.3099097.

Pisano, G. et al. (2020), “Neuro-symbolic computation for XAI: Towards a unified model”, in WOA, Vol. 1613, pp. 101-117, https://ceur-ws.org/Vol-2706/paper18.pdf.

Pinheiro, F. et al. (2021), “Alphafold and the amyloid landscape”, Journal of Molecular Biology, Vol. 433/20, pp. 167059, https://doi.org/10.1016/j.jmb.2021.167059.

Qin, Y. and H.A. Simon (1990), “Laboratory replication of scientific discovery processes”, Cognitive Science, Vol. 14/2, pp. 281-312, https://doi.org/10.1016/0364-0213(90)90005-H.

Roper, K. et al. (2022), “Testing the reproducibility and robustness of the cancer biology literature by robot,” Royal Society Interface, Vol. 19/189, https://doi.org/10.1098/rsif.2021.0821.

Shlezinger, N. et al. (2021), “Model-based deep learning: Key approaches and design guidelines” in 2021 IEEE Data Science and Learning Workshop (DSLW), pp. 1-6, https://arxiv.org/pdf/2012.08405.pdf.

Strogatz, S. (2018), “One giant step for a chess-playing machine”. 26 December, New York Times, pp. 1-6, www.nytimes.com/2018/12/26/science/chess-artificial-intelligence.html.

Wilczek, F. (2016), “Physics in 100 years”, Physics Today, Vol. 69/4, pp. 32-39, https://doi.org/10.1063/PT.3.3137.

Zenil, H. and R. King (forthcoming), “The far future of AI in scientific discovery”, in AI For Science, Choudhary F. and T. Hey (eds.), World Scientific Publishing Company/Imperial College Press.

Zenil, H. et al. (2019), “Causal deconvolution by algorithmic generative models”, Nature Machine Intelligence, Vol. 1/1, pp. 58-66, https://doi.org/10.1038/s42256-018-0005-0.

Metadata, Legal and Rights

This document, as well as any data and map included herein, are without prejudice to the status of or sovereignty over any territory, to the delimitation of international frontiers and boundaries and to the name of any territory, city or area. Extracts from publications may be subject to additional disclaimers, which are set out in the complete version of the publication, available at the link provided.

© OECD 2023

The use of this work, whether digital or print, is governed by the Terms and Conditions to be found at https://www.oecd.org/termsandconditions.