Artificial intelligence in scientific discovery: Challenges and opportunities

R. King
Cambridge University
United Kingdom
H. Zenil
Cambridge University
United Kingdom

There have been cycles of hype surrounding the contributions of artificial intelligence (AI) to science (scientific discovery). However, progress has accelerated over the last decade, with machine learning (ML) now arguably one of the most exciting technologies. Indeed, the largest companies in the world have ML at the core of their technology, including Google, Facebook, Microsoft and Amazon. This essay explores challenges and opportunities associated with various forms of ML.

There are two main forms of ML: statistical and model-driven. Statistical ML, the most commonly used and successful form, is based upon complex pattern learning. It finds regularities in data, whose meaning can then be interpreted or studied further.

Statistical ML, including deep learning, is still dominant (deep learning is a type of statistical ML based on neural networks with many layers). This dominance occurs even in cases where statistical ML is ill-equipped to deal with basic symbol manipulation such as algebra and causality.

Despite the continued dominance of statistical ML, there is a trend towards approaches that construct an abstract model or representation, as humans do, rather than the statistical fitting of high-dimensional data. Such approaches are referred to as “causal” and “model-driven”.

Model-driven approaches generate mechanistic models from the data consistent with the data themselves that can be tested against newly generated data. “Mechanistic” means they can be followed state by state, as in a dynamic system, through a chain of cause and effect.

The distinction between model-driven and statistical ML is not always clear in the literature. Indeed, some statistical ML models are called “causal” or “model-driven”. This essay distinguishes between them based on the abstraction and generalisation capabilities of methods, and their ability to build mechanistic models from first principles, as scientists do.

One way to grasp the promise of AI in science is to understand its current limitations and challenges.

The limitation of scale is especially relevant to science. Current statistical ML approaches require large amounts of data, which are often unavailable in science. This is especially the case in areas remote from the social and economic sciences, in theoretical areas or in areas with a strong descriptive component (e.g. astrophysics or genetics).

As another challenge associated with scale, many data sources must be annotated and labelled to be useful. This is difficult for several reasons. First, it takes time and resources to label large databases by hand. Second, variation in the data in some areas of science may not allow generalisations and translation across fields.

The wide range of sizes of stars in the galaxy, for example, requires a large dataset before analyses can yield results with statistical significance. The same is true in health care, where data set-ups may involve only healthy individuals (e.g. from fitness or wellness applications) or only unhealthy individuals (e.g. in a hospital) but rarely both. This makes translating findings from one to another more difficult. By contrast, applications of ML in industry usually work with much less variable data; think, for instance, of data coming from sensors on an assembly line.

For specialised scientific databases, one of the main challenges is how to capture the data using a symbolic representation that can also help with calculations. Much of the mathematics of ML is based on operations on data arrays such as matrices. Consequently, symbols such as words, images and sounds can be recorded as computable matrices or vectors that computers can manipulate.

Representing data with symbols matters because they have “meaning” for computers. Symbols can be manipulated and dealt with in predictable ways. For example, they can be used to calculate distances between data features to define a similarity metric. Likewise, symbols can help modify an image to produce a larger training set showing the ways in which an object in an image could be coloured under different lighting.

While the Internet has provided businesses with millions of pictures of everyday objects and faces, scientific data are much rarer. Take the challenge of protein folding. This requires data, of course, but also the use of models to process vast matrices of data that could represent, for example, distances between molecules.

It is highly inefficient to learn about molecule distance in proteins simply by learning basic patterns in data. This is because proteins are subject to external forces, such as the laws of thermodynamics, that can add “noise” and make patterns hard to find. What is required, in such cases, is symbolic representation and understanding of causation, which is a struggle for current approaches. This is where model-driven approaches could prove more useful and powerful, eventually replacing the work of scientists.

Model-driven methods can explain more observations with less training data just as human scientists do when deriving models from sparse data (Zenil et al., 2019). For instance, Newton and others derived the classical theory of gravitation from relatively few observations. ML approaches in scientific discovery have often combined statistical and symbolic approaches in a hybrid manner. The symbolic approach mostly comes from a human intervention, particularly to add symbols that represent generalisations and abstractions.

The ability of human scientists to reason rationally, to do abstract modelling and to make logical inferences (deduction and abduction) are central to science. However, these abilities are handled poorly by the most popular approaches to AI (statistical ML and deep learning). Current AI involves mostly “black-box” techniques: methods that successfully accomplish a task but provide little to no insight or explanation of how they do so.

Most neural network approaches have this black-box character. Their inner workings reveal a mass of data correlations but no evident relationship to any abstract or physical states of a dynamic system in the real world (such as a weather pattern, for example). The internal parameters of a neural network model do not directly correspond to any independent variables of the phenomenon they intend to model (e.g. the features that describe a cat). The underlying mathematical and computational representation of the neural network after training does not directly correspond to any physical state-to-state behaviour of the objects learnt. This is also why simulation-based approaches (Kulkarni et al., 2020; Piprek, 2021; Lavin et al., 2021) are garnering greater attention: they force AI to be model- and state-to-state-relevant.

For science, the black-box nature of current AI is a major challenge. Scientific textbooks and literature typically pertain to first principles (foundational knowledge, axioms, scientific laws, etc.) and step-by-step models executable by, for example, a computer or a mechanical process. These are the quintessence of scientific explanation (King et al., 2018).

Current approaches to AI in science rely on domain experts to generate understanding sometimes after statistical ML has helped shed light on the phenomena being investigated (Pinheiro, Santos and Ventura, 2021). For instance, after applying AI to the field of drug discovery by traditional means, domain experts may need to interpret the results to understand the mechanisms of a drug’s effectiveness.

Science is more about cumulating knowledge and finding causal explanations than seeking to classify. Classification tasks, for example, are useful in a field such as industry (e.g. for movie or song recommendations, for example), where current AI algorithms excel. While classification is an important first step in science, detecting meaningful regularities or irregularities has been foreign to statistical ML approaches, including deep learning.

The problem of bias (in the everyday non-technical sense) also affects human science. Indeed, bias in AI is a legacy of human science because AI is traditionally trained on a set of examples labelled by humans. For example, in using ML to categorise different types of astronomical images, humans might need to feed the system with a series of images they have already categorised and labelled. This would allow the system to learn the differences between the images. However, those doing the labelling might have different levels of competence, make mistakes and so on. AI could be used to detect and to some extent redress such biases.

One of the most trivial types of weakness of ML revolves around classification. ML or deep learning can correctly classify a large set of images. However, after just a single pixel is changed, a large number of those images is classified both wrongly and, at the same time, with a high degree of confidence (Wang et al., 2021).

Among other uses, Generative Adversarial Network (GAN) (Cai et al., 2021) has been used to mitigate this classification problem. GANs apply a type of neural network to reduce dependence on statistical weaknesses in ML. For example, GANs can generate new examples of an image that could plausibly have been drawn from an original dataset (of pixels).

However, GANs are limited. Producing too many modified examples of an image for training data can indeed make the image-identification system work. However, this is only because the system has been fed with so many possible examples of relevant images in the training data. In other words, GANs can be used to generate new training data – new images in this case – but do not yield a model immune to the same problematic pixel-flipping effect.

GANs can help produce quite realistic but fake image data (i.e. with a similar pattern distribution of pixels). This can enlarge the training set to make the network better at classification. However, they lead to a combinatorial explosion of images producible by changes in all possible combinations of pixels (not only one, but also two and their combinations, and then three and so on).

Statistical ML operates differently from the human mind. GANs are not easily scalable because they follow a brute force approach to a problem that humans can address more effectively. Humans do not need to be fed with all sorts of fake images with insignificant small changes to minimise margins of error. First, humans would be unable to think of all possible images produced by flipping all possible combinations of pixels as a GAN may do. Second, humans appear to operate in a fundamentally different way. Humans build abstract models of the world, which allow mental simulations on the fly of how an object can be modified. They can also generalise even if they have never encountered the same situation before. Humans do not need to drive millions of miles to pass their driving tests or to witness millions of counter examples to know that hitting people on the road is a bad idea.

Think of a school bus. Humans know both its shape and its function. They are also more resilient than machines at identifying a school bus by its abstract properties. For example, they know it is a transportation system for children independent of its colour, shape or picture angle, the things that a statistical ML will focus on.

Arithmetic operations provide another example of why statistical ML falls short of human reasoning. Learning to add two numbers does not work if the arithmetic operation and the concept of a number system is not “understood”. This is because one cannot feed a purely statistical network with enough examples of all possible sums between any two numbers.

It has recently been claimed that some systems, such as GPT-3, can synthesise the processes of addition, subtraction and multiplication by learning from examples (Brown, 2020). GPT-3 is a type of neural network that operates over vectors of words. It has been tested to see if learning from unstructured text could lead to some sort of deeper learning of basic arithmetic. Most tests were performed on only two- to three-digit numbers, with positive results.

However, it soon became apparent that GPT-3 was relying on previously seen examples of those exact operations as any young child would do. There was no deeper “understanding” or generalisation of arithmetic. Thus, neural networks, of which GANs are one example, are an important step forward for ML. However, they also illustrate the fundamental limitation and challenges of ML in modelling the world, learning and generalising as humans do. Systems based on GPT-3 such as ChatGPT, are giant lookup tables crawled from and combined with what has already been written on the Internet. They match inputs and outputs in the form of ever-growing vectors of words (sentences), Their often remarkable capabilities can give a false impression of intelligence.

GPT-3 was also trained on 175 billion parameters, which suggests possible overfitting (i.e. solving examples only because the AI had seen them all). This is different from how human intelligence works. There is no obvious reason why a natural language, unstructured model like GPT-3 should be good at a symbolic task such as learning arithmetic.

Statistical ML is limited in its capacity for symbolic systems. A convolutional network – a popular type of neural network used to classify images – comes up against two challenges. First, it must learn to recognise numbers (or perform any simple arithmetic problem) between any two numbers. Second, it must deal with a numerical positioning system such as the decimal.

Both challenges require training sets of large numbers of examples of, say, digits, numbers, or additions and subtractions. However, there is an infinite number of such examples (e.g. all possible arithmetical operations). No countable finite training set will ever cover this infinite universe of numbers and arithmetic operations.

No neural network can thus be trained over all possible arithmetic operations. Therefore, it cannot learn to add just from being shown a large set of examples of addition. It needs to be able to identify numbers and tell them apart from the relevant arithmetic symbols. In other words, there can be no successful attempt to train a neural network with a traditional statistical architecture to learn symbolic operations from numerical examples.

No matter how much data have been supplied, a neural network needs a “symbolic engine” – such as those in calculators that can deal with basic arithmetic. This illustrates the danger of the big data dogma – the belief that enlarging the training set will solve all learning challenges of a neural network.

For computer scientists, “loss function” refers to the distance between the prediction of an AI and the factual truth. Zenil, Kiani and Tegnér (2017) showed that the limitations of loss functions based on statistical measures, such as ones widely used in deep learning, can always be exploited. This results from the lack of “invariance results”.

Invariance here means that the representation of an object has no bearing on the ability of a system to recognise and identify it by its most salient properties. For example, in geometry, any object remains the same under certain “linear” transformations such as rotation, translation or reflection. Thus, an ML system should recognise objects regardless of how they are depicted. However, as noted earlier, statistical ML is highly sensitive to small changes, even in just a few pixels.

As one of its main achievements, ML has enlarged the set of transformations that it achieves when facing new data. For example, it recognises a dog as a dog even if all previously seen images of dogs have been of other breeds. However, that set of transformations remains small and rigid.

Neural networks have often been credited for some part of invariance. However, their invariance is often not robust or the result of features alien to the object of invariance. This is what GANs and changes in single pixels show. Changing a single pixel (often referred to as an “attack”, even if not deliberate) can undermine a network's ability to recognise objects. The reflection of light of any part of a school bus, for example, may make a neural network classify it as a firefighter truck.

AI for science should go beyond a focus on the size of data (or big data). It should also devote more resources to developing the methodological framework most relevant to the AI needed for the specific domain in the process of scientific discovery (Zenil, 2017). Two examples are explored below.

Alphafold 2, the Google DeepMind approach to the scientific problem of protein folding, has greatly advanced prediction in the field. There is a debate, however, over how much AI is responsible for this accomplishment. Human designers, for example, decided how to represent the problem and the ML’s major processing steps. Meanwhile, the domain-expert team deployed their knowledge of protein structure and statistical ML to accomplish the task.

In the domain of self-driving cars, companies compete over how many millions of miles their cars have driven unaided. However, the right measure should be how few miles they need to drive in order to infer and understand the basic rules of driving (e.g. not to hit a pedestrian).

This problem of attribution and misaligned metrics (with autonomous agency) is not exclusive to Alphafold 2 or self-driving cars. There is generally a close relationship between the choice of statistical model and ML engineers’ pre-existing knowledge of the underlying structure of the data, and their own biases regarding how to deal with such data. Consequently, the contribution of domain-expert teams and statistical ML is generally intertwined.

Some model-driven approaches involve a technique known as “symbolic regression”. Simply stated, this means the capability to manipulate symbols (unlike simple statistical regression). For instance, Udrescu and Tegmark (2020) use a library of equations so the AI system will find an equation that fits the observational data. While this approach has generated interesting results, it may suggest the underlying method is actually symbolic; in fact, it still has a strong classification component.

The most interesting system of this kind would be one where the library does not exist; with an existing library of equations, the problem becomes largely one of matching and classifying. Some research groups are trying to combine the worlds of statistical ML and symbolic regression. Statistical ML is best at representing and classifying data numerically, while symbolic computation excels at inference and rule-based reasoning.

No matter how abundant the supply of data, the problem of understanding and transfer learning (generalisation) cannot be solved simply by applying ever-more powerful statistical computation. Too little attention, research effort, conference venues, journals and funds are available to AI approaches that differ from statistical ML and deep learning. This is a consequence of the dominant role of some academic actors and corporate AI research and development that are now almost one and the same.

A return to first principles is needed to work out how to generate some sort of equivalence to a mental model of the subjects that science is exploring. Indeed, a distinctive feature of human intelligence is the ability to take just a small fraction of a potentially infinite number of cases of something and comprehend that thing with a mental model. AI systems for science need similar abilities.


Brown, T.B. et al. (2020), “Language models are few-shot learners”, Advances in Neural Information Processing Systems, Vol. 33/159, pp. 1877-1901,

Cai, Z. et al. (2021), “Generative adversarial networks: A survey toward private and secure applications”, ACM Computing Surveys (CSUR), Vol. 54/6, pp. 1-38,

King, R.D. et al. (2018), “Automating sciences: Philosophical and social dimensions”, IEEE Technology and Society Magazine, Vol. 37/1, pp. 40-46,

Kulkarni, S. et al. (2020), “Accelerating simulation-based inference with emerging AI hardware”, in 2020 International Conference on Rebooting Computing (ICRC), pp. 126-132,

Lavin, A. et al. (2021), “Simulation intelligence: Towards a new generation of scientific methods”, arXiv, arXiv:2112.03235 [cs.AI],

Pinheiro, F., J. Santos and S. Ventura (2021), “Alphafold and the amyloid landscape”, Journal of Molecular Biology, Vol. 433/20:167059,

Piprek, J. (2021), “Simulation-based machine learning for optoelectronic device design: Perspectives, problems, and prospects”, Optical and Quantum Electronics, Vol. 53/4, pp. 1-9,

Udrescu, S.M. and M. Tegmark (2020), “AI Feynman: A physics-inspired method for symbolic regression”, Science Advances, Vol. 6/16, p. 4,

Wang, P. et al. (2021), “Detection mechanisms of one-pixel attack”, Wireless Communications and Mobile Computing, Vol. 2021,

Zenil, H. (2020), “A review of methods for estimating algorithmic complexity: Options, challenges, and new directions”, Entropy, Vol. 22/6, p. 612,

Zenil, H. (2017), “Algorithmic data analytics, small data matters and correlation versus causation”, in Berechenbarkeit der Welt? Philosophie und Wissenschaft im Zeitalter von Big Data (Computability of the World? Philosophy and Science in the Age of Big Data), Ott, M., W. Pietsch and J. Wernecke (eds.), pp. 453-475, Springer Verlag.

Zenil, H., N.A. Kiani and J. Tegnér (2017), “Low-algorithmic-complexity entropy-deceiving graphs”, Physical Review E, Vol. 96/1:012308,

Zenil, H. et al. (2019), “Causal deconvolution by algorithmic generative models”, Nature Machine Intelligence, Vol. 1, pp. 58-66,

Metadata, Legal and Rights

This document, as well as any data and map included herein, are without prejudice to the status of or sovereignty over any territory, to the delimitation of international frontiers and boundaries and to the name of any territory, city or area. Extracts from publications may be subject to additional disclaimers, which are set out in the complete version of the publication, available at the link provided.

© OECD 2023

The use of this work, whether digital or print, is governed by the Terms and Conditions to be found at