Lessons from shortcomings in machine learning for medical imaging

G. Varoquaux
Institut national de recherche en sciences et technologies du numérique (INRIA)
V. Cheplygina
IT University of Copenhagen

The application of machine learning (ML) to medical imaging has attracted a lot of attention in recent years. Yet, for various reasons, progress remains slow. This essay builds upon earlier work by the authors which explores how larger datasets and more deep-learning algorithms have not yet provided practical improvements in addressing clinical problems. It recommends how researchers and policy makers can improve the situation.

Many opportunities exist to improve patients’ health by applying ML to medical imaging. Through computer-aided diagnosis, for example, an algorithm is trained on existing images such as brain scans of people with and without dementia. It is later applied to unseen images to predict which group they likely belong to. There are now numerous reports of ML algorithms recognising medical images more accurately than human experts (for an overview see Liu et al., 2019).

Despite this potential, incentives in (ML) research are slowing progress in the field. For example, the impact on clinical practice has not been proportional to claims. Roberts et al. (2021) found that none of the 62 published studies on ML for COVID-19 had potential for clinical use. Studies for other clinical applications of ML have also failed to find reliable published prediction models. Two examples are for prognosis after aneurysmal subarachnoid haemorrhage (Jaja et al., 2013) and stroke (Thompson et al., 2014).

Table 1 describes key concepts, some of which might differ in their use depending on the community. The following sections summarise examples of lack of progress. The essay then provides recommendations for researchers and policy makers on how to move forward.

The increased popularity of ML in recent years is often explained by two developments. First, larger datasets are available. Second, deep-learning techniques permit development of algorithms without specialised domain knowledge, allowing more researchers into a field. However, the state of ML in medical imaging is not as positive as many believe for the three reasons noted below.

There is a tendency to expect that a clinical task can be “solved” if the dataset is large enough. After all, from prior research, large and diverse datasets help an algorithm generalise better to previously unseen data. There are several problems here. First, not all clinical tasks translate neatly into ML tasks. Second, creating larger and larger datasets often relies on automatic methods that may introduce errors and bias into the data (Oakden-Rayner, 2020). For example, a machine might label x-rays as showing the presence or non-presence of pneumonia based on words appearing in the associated radiology reports. In such a case, a phrase like “no history of pneumonia” might result in an x-ray wrongly labelled as showing the presence of pneumonia.

Finally, while large datasets improve algorithm training and generalisation, they also allow for better evaluation of algorithms. This is because more data are available for creating an estimate of performance on previously unseen future data. Analysis of predictions of Alzheimer’s disease across six surveys and more than 500 publications in Figure 1 shows that studies with larger sample sizes tend to report worse prediction accuracy. This is worrying since these studies are closer to real-life settings.

A lot of research within medical imaging focuses on algorithm development, but the practical benefits of the reported accuracy gains are not always clear. For this essay, the authors studied eight medical imaging competitions on Kaggle, a platform where algorithm developers can compete to solve classification tasks, and where winning can involve significant incentives. Indeed, the most famous competition on lung cancer prediction had a prize of USD 1 million. The analysis compared two quantities: 1) the gap between the performance of the top algorithms; and 2) the expected variability in performance if a different subset of the data was used for evaluation. In other words, it tried to quantify the meaning of the final ranking. Would ranking of the winners alter if other images were used from the same or a different subset of the data? In most cases, the performance of the top algorithms is within the expected variability, and algorithms are thus not practically better or worse than one another (Varoquaux and Cheplygina, 2022).

Deep-learning studies are computationally intensive, and several ML studies have noted how this affects who gets to do research. A method may win just because more computational resources were available (Hooker, 2020). Meanwhile, the representation of prestigious labs and tech companies at conferences is increasing (Ahmed and Wahed, 2020). At a large medical imaging conference –MICCAI 2020 – only 2% of accepted papers were from underrepresented regions (Africa, Latin America, South/South-East Asia and the Middle East) (MICCAI Society, 2021). However, the need for medical AI might be even greater in these regions.

There are a number of things that researchers concerned with these questions can do already, especially those organising conferences, and/or editing or reviewing papers.

It may not always be feasible to collect more data. However, it is important to understand the limitations of the data that are available, such as the sample size and characteristics of different patient groups. On this note, datasets should include a report of the data characteristics, as well as the potential implications for models trained on the data. Such a practice would be similar to providing “model cards”, a short document that accompanies a trained ML model and details benchmarked model performance under different conditions (Mitchell et al., 2019).

Benchmarking the performance of algorithms alone is not sufficient to advance the field. Papers focusing on understanding, replication of earlier results and so forth are also valuable. If benchmarking the performance of algorithms is deemed essential in a publication, comparisons need to include both recent-and-competitive and traditional-yet-effective methods.

Furthermore, comparisons need to consider the range (rather than a single estimate) of each method’s performance. Ideally, they should use multiple, well-motivated metrics and statistical procedures (Bouthillier et al., 2021). More real-life effects of an algorithm might also be considered. This might include, for example, its carbon footprint, or how it affects the people it was designed to help (Thomas and Uminsky, 2020).

Many want to believe that publishing a novel algorithm with state-of-the-art results is the only way to create impact, but such results may be overly optimistic. In the practice of psychology, registered reports are prepared. In this approach, a planned study is reviewed and published before any experiments are done. More widespread adoption of this practice could reduce publication bias since “negative results” would also be published. From an institutional perspective, one could support different types of papers focusing on different forms of insight. These could include replications or retrospective analyses of methods, incentivising and rewarding (e.g. through research funding, hiring decisions) such practices.

As research positions and funding are often tied to the output of publications, researchers have strong incentives to optimise for publication-related metrics. With the additional focus on achieving novelty and state-of-the-art results, the publication of papers using methods that are over-engineered but under-validated is perhaps not surprising. While some researchers might choose to opt out of this dynamic and/or try to change things, many in less secure positions may pursue publication-related metrics to benefit their career. It is therefore important that external incentives are created to speed up the change towards methods with greater validation.

Several of the current problems stem from the way researchers are evaluated when applying for academic positions or for research funding. The focus on metrics like the h-index needs to be reduced in favour of other practices, such as, for example, an evaluation of five selected publications. Such a shift could reduce the pressures that lead to publication of research with diminishing returns. The need for new approaches for evaluating research also holds when evaluating researchers based on previously acquired funding, which can entail the propagation of existing biases.

Funding should focus less on perceived novelty, and more on rigorous evaluation practices. Such practices could include evaluation of existing algorithms, replication of existing studies and prospective studies. This would provide more realistic evaluations of how algorithms might perform in practice. Ideally, such funding schemes should be accessible to early career researchers, for example, by not requiring a permanent position at application.

It should be more attractive to work on curated datasets and open-source software that everybody can use. It is difficult to acquire funding, and often to publish, when working on such projects. Many team members are therefore volunteers. This creates biases against groups that are already underrepresented but that might have innovative ideas that would be vital for the field. Such groups could include, for example, women who take on a greater share of household responsibilities and lower-income countries who cannot afford to take on unpaid jobs. More regular funding and consequently more secure positions would help to improve on the status quo.

This essay has presented insights on a number of problems that may be slowing the progress of ML in medical imaging. These insights are based on both a review of the literature and the authors’ previous analysis. In summary, not everything can be solved by having larger datasets and by developing more algorithms. The focus on novelty and state-of-the-art results creates methods that often do not translate into real improvements. The essay proposes a number of strategies to address this situation, both within the research community and at the level of research policy. Given the huge efforts invested in AI research, failure to address these issues could mean significant waste.


Ahmed, N. and M. Wahed (2020), “The de-democratization of AI: Deep learning and the compute divide in artificial intelligence research”, arXiv, preprint arXiv:2010.15581, https://doi.org/10.48550/arXiv.2010.15581.

Bouthillier, X. et al. (2021), “Accounting for variance in machine learning benchmarks” in Proceedings of Machine Learning and Systems, Vol. 3, pp. 747-769, https://proceedings.mlsys.org/paper/2021/hash/cfecdb276f634854f3ef915e2e980c31-Abstract.html.

Hooker, S. (2020), “The hardware lottery”, arXiv, preprint arXiv:2009.06489, https://doi.org/10.48550/arXiv.2009.06489.

Jaja, B.N. et al. (2013), “Clinical prediction models for aneurysmal subarachnoid hemorrhage: A systematic review”, Neurocritical Care, Vol. 18/1, pp. 143-153, https://doi.org/10.1007/s12028-012-9792-z.

Liu, X. et al. (2019), “A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: A systematic review and meta-analysis”, The Lancet Digital Health, Vol. 1/6, pp. e271-e297, https://doi.org/10.1016/S2589-7500(19)30123-2.

MICCAI Society (2021), “MICCAI Society News”, 18 August, MICCAI Society, www.miccai.org/news/.

Mitchell, M. et al. (2019), “Model cards for model reporting”, in FAT* '19: Proceedings of the Conference on Fairness, Accountability and Transparency, pp. 220-229, https://doi.org/10.1145/3287560.3287596.

Oakden-Rayner, L. (2020), “Exploring large-scale public medical image datasets”, Academic Radiology, Vol. 27/1, pp. 106-112, https://doi.org/10.1016/j.acra.2019.10.006.

Roberts, M. et al. (2021), “Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans”, Nature Machine Intelligence, Vol. 3/3, pp. 199-217, https://doi.org/10.1038/s42256-021-00307-0.

Thomas, R. and D. Uminsky (2020), “The problem with metrics is a fundamental problem for AI”, arXiv, preprint arXiv:2002.08512, https://doi.org/10.48550/arXiv.2002.08512.

Thompson, D. et al. (2014), “Formal and informal prediction of recurrent stroke and myocardial infarction after stroke: A systematic review and evaluation of clinical prediction models in a new cohort”, BMC Medicine, Vol. 12/1, pp. 1-9, https://doi.org/10.1186/1741-7015-12-58.

Varoquaux, G. and V. Cheplygina (2022), “Machine learning for medical imaging: Methodological failures and recommendations for the future”, Nature Digital Medicine, in press. https://doi.org/10.1038/s41746-022-00592-y.

Metadata, Legal and Rights

This document, as well as any data and map included herein, are without prejudice to the status of or sovereignty over any territory, to the delimitation of international frontiers and boundaries and to the name of any territory, city or area. Extracts from publications may be subject to additional disclaimers, which are set out in the complete version of the publication, available at the link provided.

© OECD 2023

The use of this work, whether digital or print, is governed by the Terms and Conditions to be found at https://www.oecd.org/termsandconditions.