copy the linklink copied!1. Data for science, technology and innovation: Definitions, scope and objectives

Abstract

This chapter introduces the definitions, scope and objectives of the report.

It starts by defining the overall significance of data for private-sector innovation, scientific research and society at large. A primer on enhanced access to data starts with a definition, a description of opportunities arising from enhanced access to publicly funded data for science, technology and innovation, and a rationale in favour of open access to data. It concludes with the objectives and structure of the report.

    

copy the linklink copied!Significance and scope of data for STI

As near-real-time analysis accelerates knowledge and value creation across society, data are seen as a key resource of the knowledge economy. Data-intensive scientific discovery transforming the field of science, technology and innovation (STI), and data-driven innovation is transforming society through its far-reaching effects on resource efficiency, productivity and competitiveness. It also helps address many global challenges, such as climate and demographic changes, and scarce resources. In this context, the issue of text and data mining is a hot topic. This technique allows extracting valuable knowledge and information from large digital datasets, but is restricted, subject to personal privacy and intellectual property rights (IPRs) in many countries (OECD, 2015a).

Overall significance of data to private-sector innovation, scientific research and society

Private-sector innovation

As companies obtain detailed, timely and comprehensive information about their customers, processes and employees, data themselves become a key driver of innovation (OECD, 2015a). The availability of data triggers the emergence of new products and services, such as location-based services on smartphones or emerging home automation applications based on Internet of Things (IoT) devices. Data about consumers have revolutionised marketing techniques by allowing personalised marketing, experienced daily on social networks. Data flows facilitate the establishment and operation of global value chains, driving organisational innovation. In a 2014 survey, corporate chief executive officers stated that big data would increase their operational efficiency (51%); inform strategic direction (36%); improve customer service (27%); help identify and develop new products and services (24%); and enhance customer experience (20%) (Philip Chen and Zhang, 2014).

Private actors are increasingly aware of the potential of data within the broader category of knowledge-based capital:1 in 13 OECD member countries, companies invest more in knowledge-based capital than in physical capital (OECD, 2017a).

Scientific research

In the research sector, scientific disciplines have become increasingly data-driven thanks to the development of data-acquisition, -storage and -analysis capabilities. Scientific data are very diverse: they include observational data, which record natural phenomena (in fields such as astronomy, geoscience and demography); experimental data, which record the outcomes of man-made experiments, such as laboratory experiments in physics, chemistry and biology, or clinical trials; computational data, which are generated through large-scale simulations; and reference data, which are highly curated datasets, such as the human genome. Simulation is used to generate data based on theoretical predictions; the results are compared to actual experiments, to verify the validity of theoretical concepts and adjust them accordingly. The development of artificial intelligence (AI) will increasingly enable algorithms to detect patterns by themselves, but requires well-tended data to be trained (OECD, 2015a). In the science and technology sector, open access to data is commonly linked to open science (OECD, 2015b).

Data-intensive science is seen as the fourth paradigm (Box 1.1). Traditional science uses human intelligence to create theoretical models, which are then compared to experimental observations. Computational science uses data as the model and seeks patterns humans may not be able to detect (OECD, 2018a). Big data have already boosted the pace of discovery in disciplines such as astronomy, high-energy physics and genomics (Gordon Bell, 2009). They are spreading to environmental and health sciences (Hey, Tansley and Tolle, 2009) and are accelerating the discovery of new materials; they were even instrumental in a biochemistry discovery by Karplus, Levitt and Warshel that led to the 2013 Nobel Prize (Towns et al., 2014). Big data have the potential to make social sciences more predictive and deterministic, for example by transforming sociology into a “hard” science. The ability to exploit ubiquitous data on human behaviour – made available through tracking personal-smartphone use – could result in considerable hard data. These could enable sociologists to develop deterministic laws of human behaviour analogous to the laws of physics – thereby establishing “social physics”, which could guide policy making by predicting human reactions to specific reforms (Pentland, 2014).

copy the linklink copied!
Box 1.1. Science paradigms

Data-intensive scientific discovery as the fourth paradigm of science

  1. 1. First paradigm (since Antiquity): empirical science, describing natural phenomena.

  2. 2. Second paradigm (since the Renaissance): theoretical science, constructing models and generalisations in order to establish predictions.

  3. 3. Third paradigm (20th century): simulations of complex phenomena.

  4. 4. Fourth paradigm (today): data exploration/eScience

    • data captured by instruments or generated by simulator

    • processed by software

    • information/knowledge stored in computer

    • scientist analyses database/files using data management and statistics.

Source: Hey, Tansley and Tolle (2009), “The fourth paradigm: Data-intensive scientific discovery”, https://www.microsoft.com/en-us/research/wp-content/uploads/2009/10/Fourth_Paradigm.pdf.

Society

Data also create spillover effects and positive externalities, such as socialisation and behavioural change, cultural and scientific exchange, and greater levels of trust induced by transparency (OECD, 2015a).

The significance of data for society will undoubtedly further increase over the next decade. The volume of data produced globally amounted to 16 zettabytes (ZB) in 2016 and is projected by International Data Corporation to grow to 163 ZB by 2025. This exponential growth will be driven by the following trends:

  • The evolution of data from business background to life-critical: 20% of the data produced in 2025 will be critical to life.

  • Embedded systems and the IoT will multiply both volumes and flows of data: by 2025, a connected person will interact with connected devices 400 times per day.

  • Mobile and real-time data will grow: by 2025, 25% of data will be produced in real time.

  • Cognitive/AI systems will change the landscape, enabling real-time analysis and decision-making.

  • The need for digital security will increase (Reinsel, Gantz and Rydning, 2017).

The importance of AI is also expected to grow significantly. In pharmaceuticals, AI is set to become the primary drug-discovery tool by 2027. In materials, AI systems can use historical data to radically shorten the time needed to discover new industrial materials. In science, AI could enable novel types of discovery, based on strengths and weaknesses that complement the capabilities of human scientists (OECD, 2018a). Access to well-managed data is a key enabler of this development, as training algorithms requires large amounts of data. In image recognition, for example, AI by Microsoft and Google has achieved human-level performance after training on 1.2 million labelled images. Even though progress is expected to create less data-hungry algorithms, the progress of AI will continue to rely on large quantities of data (Simonite, 2016).

Scope of data covered: Public data for STI

The data used for science, technology and innovation (STI) fall under three broad categories:

  1. 1. public-sector information (PSI) as a broad category of information produced, curated and managed by or for government entities (Box 1.2)

  2. 2. data from publicly funded research (including data from citizen science)

  3. 3. privately owned or commercial data.

It is important to note that the distinction between PSI and publicly funded research data is often unclear. PSI is broadly defined as “information, including information products and services, generated, created, collected, processed, preserved, maintained, disseminated, or funded by or for the Government or public institution” (Box 1.2), and may include information stemming from higher education institutions and public research organisations (Table 1.1).

The recent trend is to encompass data from publicly funded research within PSI. For example, even though the original PSI Directive of the European Commission did not address research data, it has recently been extended to cover data “resulting from” publicly funded research (European Commission, 2019). The argument for this approach is that publicly funded research produces high-value datasets using public monies and should therefore be treated according to the same principles as government data.

On its data portal,2 the US Government aggregates access to all (not only science-related) government open-data resources in one location; it provides tools and resources to conduct research, develop web and mobile applications, design data visualisations and track metrics about data usage.

However, the nature of financing can raise some issues, since most, but not all, scientific research is publicly funded. In the case of public-private partnerships, clear rules should be defined to preserve the interests of private partners. The case study contributed by Korea, for example, reports that some research data is released as government data, while some is disclosed as public-research outputs (Shin, 2018).

Research data itself are not always defined consistently, and can mean either data “resulting from” research or “used for” research, which are widely different data sets, since clearly researchers use data which go beyond data produced by research itself (Figure 1.1). The definition employed in the 2006 OECD Recommendation of the Council concerning Access to Research Data from Public Funding (OECD, 2006) distinguishes the broader category of “Research data” which is “used as primary sources for scientific research” and “Research data from public funding” as research data “obtained from research conducted by government agencies or departments, or conducted using public funds provided by any level of government” (Box 1.2).

copy the linklink copied!
Figure 1.1. Variations of definitions for “research data”
Figure 1.1. Variations of definitions for “research data”

Notes: PSI = public-sector information. The light blue box represents “data for publicly funded research”; the orange box represents “data from publicly funded research”. The light orange box includes data from public-private partnerships.

copy the linklink copied!
Box 1.2. Definition

Research data

In the context of the OECD Recommendation of the Council Concerning Access to Research Data from Public Funding (OECD, 2006), “research data” are defined as factual records (numerical scores, textual records, images and sounds) used as primary sources for scientific research and that are commonly accepted in the scientific community as necessary to validate research findings. A research data set constitutes a systematic, partial representation of the subject being investigated.

This term does not cover the following: laboratory notebooks, preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or personal communications with colleagues or physical objects (e.g. laboratory samples, bacteria strains and test animals, such as mice). Access to all of these products or outcomes of research is governed by different considerations than those dealt with here.

This Recommendation principally concerns research data in a digital, computer-readable format. It is indeed in this format that the greatest potential lies for improvements in the efficient distribution of data and their application to research because the marginal costs of transmitting data through the Internet are close to zero. The Principles within the Recommendation could also apply to analogue research data in situations where the marginal costs of giving access to such data can be kept reasonably low.

Research data from public funding

Research data from public funding is defined as the research data obtained from research conducted by government agencies or departments, or conducted using public funds provided by any level of government.

Given that the nature of “public funding” of research varies significantly from one country to another, the Recommendation recognises that such differences call for a flexible approach to improving access to research data (OECD, 2006).

PSI

PSI is broadly defined as “information, including information products and services, generated, created, collected, processed, preserved, maintained, disseminated, or funded by or for the Government or public institution” (OECD, 2008). Table 1.1 describes the different components of PSI.

In addition, there is a lack of consensus about the relevant depth to which research data should be made open. A first category concerns data directly underpinning the scientific results published in journals – access to this information is critical for the reproducibility of the scientific findings. Beyond that first level, there are further layers of data and intermediate results, all the way to the raw data which were initially collected, and the hypotheses which have guided the data collection.

Such a decision is also linked to the availability of algorithms and workflows needed to analyse the data. Raw data are difficult to reuse if the analysis software is not disclosed at the same time. This in turn raises issues of the capacity of other researchers to master the overall workflow and reproduce the final result. A sophisticated example was provided by CERN, who offered the possibility to re-discover the Higgs Boson using a simplified version of the original datasets, and providing the adequate software to analyse it (Jomhari, Heiser and Bin Annuar, 2017). The European Commission Open Research Pilot requires publication of the data needed to validate the results presented in scientific publications; other data can also be included by the beneficiaries on a voluntary basis.

The use of PSI for innovation is well established. A recent meta-analysis shows that innovation is the most prominent destination of open-government data utilisation. This applies to both business-driven innovation aiming to create economic value and innovation in public services (Safarov, Meijer and Grimmelikhuijsen, 2017); 73% of respondents to a European public consultation agreed that PSI increasingly provided a basis for innovative services and products (European Commission, 2017).

Respondents to a 2017 survey by the OECD Committee for Scientific and Technological Policy (CSTP) were asked to evaluate the relevance of broader sources of PSI to scientists. Figure 1.2 summarises the survey results; it shows that 74% of respondents assessed the relevance of PSI to public research as “high” or “very high”.

copy the linklink copied!
Figure 1.2. Relevance of PSI to public research
Figure 1.2. Relevance of PSI to public research

Source: Survey results from OECD and partner delegations.

 StatLink https://doi.org/10.1787/888934112519

copy the linklink copied!
Table 1.1. PSI components

Type of data

Examples

Geographic information

Maps, spatial and topographic data, cadastre, boundaries

Meteorological and environmental information

Meteorological, hydrographic, atmospheric, oceanographic, environmental-quality data

Economic and business information

Financial, company, industry and trade information

Social information

Demographic, health, education, labour data, attitude surveys

Traffic and transport information

Transport networks, transport and traffic, vehicle-registration data

Tourism and leisure information

Tourism statistics, hotel and entertainment data

Agriculture, farming, forestry and fisheries

Cropping/land use, farm incomes, fish farming and livestock data

Natural-resource information

Energy and natural-resource stock and consumption, biodiversity and geological data

Legal information

Crime/conviction data, legislation, jurisprudence, patent and trademark data

Scientific-research information

Information stemming from higher education institutions and public research organisations

Educational content

Academic papers and studies, lecture materials

Political content

Government press releases, proceedings, green papers

Cultural content

Museum material, archaeological sites, library resources, public archives

Source: OECD (2006), Recommendation of the Council concerning Access to Research Data from Public Funding, https://legalinstruments.oecd.org/en/instruments/OECD-LEGAL-0347.

Open-government data are increasingly used as inputs to scientific research. A pioneering empirical study found that researchers use open-government data from 96 different open-government portals – principally the UK and US government portals, but also a number of open-government portals in emerging countries, such as India and Kenya (Yan and Weber, 2018).

Governments are both producers and users of data; these can be used by the governments themselves, or by researchers, businesses and citizens. Governments can also use big-data techniques to understand and predict citizens’ behaviour. Data are used in policy implementation, for example to model the effects of policies on human behaviour, thereby optimising policy making to achieve predictable and favourable outcomes (Pentland, 2014).

This report clearly focuses on publicly funded data. It does not address private-sector data, even though such data evidently play a large role in science and innovation. The following section reviews PSI and data from publicly funded research.

copy the linklink copied!Enhanced access to data: A primer

Definition of enhanced access to data

The OECD Principles and Guidelines for Access to Research Data from Public Funding define access arrangements as “the regulatory, policy and procedural framework established by research institutions, research-funding agencies and other partners involved, to determine the conditions of access to and use of research data” (OECD, 2006).

It is important to note that access to data is not a binary concept – rather, it can be staged to different degrees of openness, depending on the community of stakeholders involved (OECD, 2015a). “As open as possible, as closed as necessary” is gradually replacing the “open-by-default” mantra associated with the early days of the open-access movement. Although opening up data can help advance the STI agenda, this needs to be balanced against issues of costs, privacy, security, IPRs and preventing malevolent uses. The term “enhanced access to data” is increasingly used in relation to public-sector data and captures some of these important caveats around openness.

The more sensitive the data, the more difficult it is to open them to the general public, with the underlying risk of privacy breaches and malevolent use. Hence, different degrees of openness may include: i) open access with open licence; ii) public access with a specific licence that limits use; iii) group-based access through authentication; and iv) named access explicitly assigned by contract (OECD, 2019).

A recent survey of Australian research-data repositories shows that 86% provide open access to at least part of the data, 12% offer exclusively restricted access, and 2% propose a combination of closed and restricted access. Out of the 86% repositories that are at least partly open, 50% are fully open, 32% have restricted parts of the datasets, 6% have embargoed datasets, and 6% have closed datasets (Kindling et al., 2017).3

Open data can be defined simply as “data that can be accessed and reused by anyone without technical or legal restrictions” (OECD, 2015b). This does not necessarily mean the data are free of cost, although in the context of open science, it is normally assumed the user bears no charges. Different models include institutional subscription to research databases; open access in the “author pays” variant – authors or their employers pay for the cost of publishing in order to provide free access to the community; open-access archives and repositories, where organisations support institutional repositories and/or subject archives, and authors make their work freely available to anyone with Internet access; and a number of hybrid solutions, such as delayed open access and open choice (Houghton and Sheehan, 2009; OECD, 2017b).

More restricted access to data can be organised within the framework of safe environments, such as the Five Safes framework (Table 1.2). These environments rely on specific safe-software platforms, where only approved researchers can access the data within a specific environment, analyse them without extracting the actual sensitive data and then submit the results of their research for approval. The results will then be investigated to test whether they risk disclosure. If they are considered “safe”, the researchers will be authorised to use them; if they are considered unsafe, the researchers will need to devise a way of further anonymising the result.

copy the linklink copied!
Table 1.2. Five Safes framework

Safe projects

Is this use of the data appropriate, lawful, ethical and sensible?

Safe people

Can the researchers be trusted to use it in an appropriate manner?

Safe data

Does the data itself contain sufficient information to allow confidentiality to be breached?

Safe settings

Does the access facility limit unauthorised use or mistakes?

Safe outputs

Are the statistical results non-disclosive?

Source: Office of National Statistics UK (n.d.), “Secure research service”, webpage, https://www.ons.gov.uk/aboutus/whatwedo/statistics/requestingstatistics/approvedresearcherscheme.

Opportunities from enhanced access to publicly funded data for STI

Enhanced data access and use offer opportunities for individuals, businesses and governments alike. Individuals are able to access valuable services at a negligible cost. For example, using the satellite navigation system included in any smartphone at virtually no marginal cost provides a service that is superior to the previous – and costly – option of buying a paper map and painstakingly finding one’s way. Such services are based on geospatial data collected by the public sector.

copy the linklink copied!
Figure 1.3. Stakeholder expectations from open science: Results of a 2016 European stakeholder consultation on “Science 2.0”1
Figure 1.3. Stakeholder expectations from open science: Results of a 2016 European stakeholder consultation on “Science 2.0”

1. “Science 2.0” was used as a generic term used in 2014 to designate the next generation of science. Since then, “open science” has become the standard term.

Note: Stakeholders identified most with “open science” when referring to the future of science.

Source: European Commission (2014), “Validation of the results of the public consultation on science 2.0: Science in transition”, http://ec.europa.eu/research/consultations/science-2.0/science_2_0_final_report.pdf.

Businesses use data to learn about consumer preferences, create new products and services, streamline their business processes and increase overall productivity. Moreover, data themselves are the commodity being sold. An estimate of the impact of PSI, commissioned by the European Commission, has estimated the aggregate economic impact of PSI at about EUR 140 billion in 2008, or about 1.1% of EU27 gross domestic product (GDP) (Vickery, 2011).

In the field of science and technology, open access to data is commonly linked to open science and open innovation (OECD, 2015b). An extensive European stakeholder consultation conducted in 2014 showed that open access to publications and open access to data within the context of open science were the top issues requiring policy intervention, ahead of research infrastructure and research quality. Open science is expected to make science more accountable, more efficient and more responsive to societal needs (Figure 1.3) (European Commission, 2014).

A well-known issue in science is publication bias, whereby negative results or results that are not deemed sufficiently significant are not published, because such publication is not worth the time and effort of the researchers, who will receive little or no recognition for a “non-result”. However, failure to publish such data causes additional time and effort elsewhere, as such experiments may be duplicated because it is not known that avenue of research does not lead to positive discovery (Rothstein, Sutton and Borenstein, 2005). The adoption of open access to scientific data would help resolve this bias.

Drivers and barriers for open access to data

Provided that legitimate concerns about privacy, intellectual property (IP), national security and other public interests are addressed,4 enhanced access to public data can provide great benefits to the economy, the research community and society at large. The economic benefit of enhanced access to data is quite significant, amounting to 1% or more of GDP (Box 1.3).

copy the linklink copied!
Box 1.3. Impact of open access to data

Estimates of the economic impact of enhanced access to data vary.

  • The McKinsey Global Institute estimated the potential value creation from open access to data in seven sectors to USD 3 trillion to USD 5 trillion, or 4% to 7% of global GDP (Manyika et al., 2016).

  • The OECD estimates the aggregate economic impact of PSI-related applications in OECD member countries at around USD 500 billion in 2008, equivalent to about 1.1% of cumulated GDP (OECD, 2015a).

  • An Australian study estimated that data from research alone amounted to 0.15% to 0.4% of GDP in 2012, with potential upsides to 0.3% to 1% of GDP (Houghton, 2014).

At least seven main rationales exist in favour of enhanced access to publicly funded data:

  1. 1. Create opportunities for new scientific insights: Data reuse increases the efficiency of science and optimises impact and return on investments. For instance, there are more papers published using data retrieved from the archive of the Hubble Space Telescope than by the people who originally proposed and analysed observations. Providing broader access to data allows more researchers (and citizens) to analyse and link those data to other data sources in order to respond to different scientific questions. For example, the health-research community working on emerging diseases is increasingly relying on biodiversity data. Enhancing access to and sharing of data also encourages meta-analysis, which combines the results of different related studies (e.g. clinical trials of a drug) to provide greater statistical power.

  2. 2. Promote innovation and economic growth: allowing commercial companies to access and use public-research data accelerates innovation on products (e.g. new drugs) or new data services (e.g. weather forecasting). Data are the essential enabler for AI and related innovations.

  3. 3. Enhance social welfare for individuals and society at large: publicly funded research is a public good, therefore data from publicly funded research should, in principle, be available to researchers, citizens and commercial actors who wish to use and derive value from them. Transparency and accountability are sometimes an issue.

  4. 4. Increase reproducibility of scientific results: sharing access to the data underpinning scientific publications allows peers to test and reproduce scientific results. In practice, data alone are often insufficient to test reproducibility, and enhanced access to analysis software is also necessary.

  5. 5. Enhance education and training: enhanced access to data provides opportunities for richer educational content.

  6. 6. Avoid duplication: sharing datasets leading to positive or negative results can prevent duplication of research efforts (Rothstein, Sutton and Borenstein, 2005).

  7. 7. Improve governance in public research: open access to data can promote transparency, democratic accountability, citizen empowerment, better delivery of public research, innovation and use of crowd wisdom, as well as prevent duplication of data-collection efforts, optimise administrative processes and enhance access to external problem-solving capacity (Janssen, Charalabidis and Zuiderwijk, 2012). Taken together, these rationales provide for a better STI ecosystem and contribute to society as a whole. Access to data alone is not sufficient to achieve all these expectations, but lack of access is a major barrier to achieving them.

Simultaneously, enhanced access to data introduces legitimate concerns about privacy, IPRs, national security and other public interests, such as the protection of rare and endangered species. Chapter 4 addresses these risks. When, how and under what conditions public-research data should be made accessible are important policy questions that cut across the issues discussed in this report.

Progress towards achieving enhanced access to data has been uneven: the latest edition of the Open Data Barometer shows that only 7% of the data are fully open, with many fragmented, incomplete and outdated datasets, which also lack the necessary metadata (World Wide Web Foundation, 2017). Open access-to-data catalogues or portals are informally maintained; the most complete datasets are often found in other sources than the official open-data portal. Some categories of datasets are particularly important for innovation, such as map data, public transport timetables, international trade data and crime data, which entrepreneurs can use to provide specific services to end users. The degree of openness is also low in these categories (only 8% to 11% of all datasets are open) and is reported to be declining (World Wide Web Foundation, 2017).

The Global Open Data Index presents similar conclusions: i) data are hard (or even impossible) to find owing to insufficient indexation; ii) data are not readily exploitable, owing to non-standard formats, lack of machine readability (e.g. stemming from use of the HTML format) and failure to publish the raw data favoured by topical experts; and iii) open licensing is rare and jeopardised by a lack of standards, risk aversion and fear of unlawful data use, leading to ambivalent or unclear clauses that create incompatibilities between licences and hamper data use (Lämmerhirt, Rubinstein and Montiel, 2017).

The OECD Open-Useful-Reusable Government Index (OURdata Index) identifies: i) implementation gaps in late adopters of open-government data policies; ii) a need to strengthen support for reuse, both outside the public sector (through data-awareness initiatives, hackathons and co-creation events) and inside the public sector (through information sessions and regular training for civil servants); iii) an opportunity to develop platforms that allow users to actively monitor data quality and add to available data; and iv) the need to better monitor the impact of open-government data (Lafortune and Ubaldi, 2018).

copy the linklink copied!
Figure 1.4. Institutional policies and barriers to promoting research-data management and/or open access to research data in European universities
Figure 1.4. Institutional policies and barriers to promoting research-data management and/or open access to research data in European universities

Source: Morais and Borrell-Damian (2018), “Open access – 2016-2017 EUA Survey results”, www.eua.be/Libraries/publications-homepage-list/open-access-2016-2017-eua-survey-results.

In the area of research, access to data currently lags behind access to publications: although more than 92% of universities in Europe have open-access policies for publications in place or plan to have them in the near future, fewer than 28% have open access to data guidelines in place (Morais and Borrell-Damian, 2018). This is clearly not an infrastructure issue: over 83% of institutions either have their own repository or participate in a shared repository; 65% have their repository aggregated by the OpenAIRE portal/infrastructure, which aims to link the aggregated research publications to the accompanying research and project information, to enhance the reproducibility of scientific results (OpenAire, 2016). Institutional barriers to promoting research-data management include internal factors (e.g. different “scientific cultures”), limited awareness of the benefits of research data and structural elements (such as the absence of policy guidelines at the national level, the lack of incentives to promote research data and increased costs), as well as the lack of adequate infrastructure (Figure 1.4).

A 2016 OECD survey of scientific authors revealed that only 20% to 25% of corresponding authors had been asked to share data after publication. If asked, a significant number of authors (30% to 50%) said they would grant access to the data, or at least take steps to grant access, and about 30% said they would seek to clarify the request. Depending on the discipline, between 10% and 20% of authors would refuse to share data on legal grounds (Boselli and Galindo-Rueda, 2016). Authors of scientific papers are more reluctant to share their data openly than to obtain access to data from other research groups (Elsevier and CSTS, 2017).

The most recent edition of the OECD International Survey of Scientific Authors (ISSA2) shows that on average, 67% of scientific production results in new data or code (in about 24% cases it is both). Out of these, an average of about 40% get stored in repositories, and this varies from a high of over 50% for multidisciplinary research, agricultural and biological sciences and material sciences to a low of 20% to 25% in sociology, psychology, business and management. Authors seem to be more likely to share their data than code. Code was archived on a repository or delivered to a journal as supportive material in about 20% of cases, whereas around 45% of authors shared their data using these means. Re-use is yet another barrier to be overcome, since even when shared, data is not always FAIR accompanied by relevant metadata nor compliant with relevant standards, and even fewer are the cases where an object identifier is assigned. Payment of a fee is required in about 12% of the cases. The main drivers for sharing of data identified were career objectives and peer expectations, rather than formal sharing requirements from funders. The most significant barriers identified were high dissemination costs, as well as intellectual property issues (Bello and Galindo-Rueda, forthcoming).

The Wiley Open Science Researcher Insights show that 69% of researchers shared data in 2016. Their top motivations for doing so were: increasing the impact and visibility of their research (39%); furthering the public benefit (35%); promoting transparency and reuse (31%); and meeting journal requirements (29%). Conversely, the top reasons researchers hesitated to share their data were concerns about IP or confidentiality (50%), ethics (31%), misuse and misinterpretation (23%) and research being “scooped” (22%) (Wiley, 2016).

Although intermediate forms of enhanced access to data within semi-open or closed communities with registration and certification requirements have been insufficiently assessed, anecdotal evidence suggests such sharing remains relatively limited, and that sharing across borders faces particular barriers. One example of controlled access is the Secure Research Service provided by the UK Office of National Statistics, which grants access to sensitive datasets to certified researchers (Office of National Statistics UK [n.d.]). Chapter 4 further discusses access to sensitive data.

copy the linklink copied!Objectives and structure of the report

This report focuses on enhanced access to publicly funded research data for STI. The objective is to take stock of current policy practices, achievements and challenges, and identify outstanding policy issues to be addressed by policymakers in the future.

Chapter 2 reviews international initiatives promoting enhanced access to data for STI. These include initiatives at the intergovernmental level – such as the OECD, the Group of Eight (G8), the European Union and UNESCO initiatives – and community-driven initiatives – such as Committee on Data of the International Council for Science, the FAIR initiative and the Research Data Alliance.

Chapter 3 reviews current policies promoting enhanced access to data in OECD member countries and partner economies. It is based on responses to the 2017 CSTP survey provided by the committee’s delegates, responses to the 2017 EC/OECD STI Policy survey (EC/OECD, 2018) and the policy case studies contributed by 18 countries in 2018. The full text of the case studies is accessible on the dedicated website (OECD, 2018b).

Chapter 4 develops the main policy issues hindering enhanced access to and sharing of data: i) Balancing the benefits and risks of data sharing; ii) technical standards and practices – keeping up with the pace of technological progress; iii) defining responsibility and ownership; iv) providing incentives and rewards for data authors and stewards; v) developing business models and funding for enhanced access; vi) building human capital and institutional capabilities to manage, create, curate and reuse data; and vii) exchanging sensitive data across borders.

Finally, Chapter 5 draws conclusions from the preceding work and proposes scenarios for the future of access to data for STI.

References

Bello, M. and F. Galindo-Rueda (forthcoming), “Charting the digital transformation of science: Findings from the 2018 OECD International Survey of Scientific Authors (ISSA2)”, OECD Science, Technology and Industry Working Papers, OECD Publishing, Paris.

Boselli, B. and F. Galindo-Rueda (2016), “Drivers and implications of scientific open access publishing: Findings from a pilot OECD International Survey of Scientific Authors”, OECD Science, Technology and Industry Policy Papers, No. 33, OECD Publishing, Paris, https://doi.org/10.1787/5jlr2z70k0bx-en.

Corrado, C., C. Hulten and D. Sichel (2004), “Measuring capital and technology: An expanded framework”, https://www.federalreserve.gov/pubs/feds/2004/200465/200465pap.pdf (accessed on 26 July 2019).

EC/OECD (2018), “STIP Compass”, https://stip.oecd.org/stip.html (accessed on 9 March 2020).

Elsevier and CSTS (2017), “Open data: The researcher perspective”, https://www.elsevier.com/__data/assets/pdf_file/0004/281920/Open-data-report.pdf (accessed on 9 March 2020).

European Commission (2019), “Directive (EU) 2019/1024 of the European Parliament and of the Council of 20 June 2019 on open data and the re-use of public sector information”, https://eur-lex.europa.eu/legal-content/EN/TXT/?qid=1561563110433&uri=CELEX:32019L1024 (accessed on 26 July 2019).

European Commission (2017), “Consultation on PSI Directive review”, http://ec.europa.eu/newsroom/dae/document.cfm?doc_id=51544 (accessed on 28 February 2020).

European Commission (2014), “Validation of the results of the public consultation on science 2.0: Science in transition”, http://ec.europa.eu/research/consultations/science-2.0/science_2_0_final_report.pdf (accessed on 26 February 2020).

Gordon Bell, T. (2009), “Beyond the data deluge”, Science, Vol. 323/5919, pp. 1297-8, https://doi.org/10.1126/science.1170411.

Hey, T., S. Tansley and K. Tolle (2009), “The fourth paradigm: Data-intensive scientific discovery”, Microsoft Research, https://www.microsoft.com/en-us/research/wp-content/uploads/2009/10/Fourth_Paradigm.pdf (accessed on 28 November 2019).

Houghton, J. (2014), “Open research data report to the Australian National Data Service (ANDS)”, https://www.ands.org.au/__data/assets/pdf_file/0019/393022/open-research-data-report.pdf (accessed on 20 December 2019).

Houghton, J. and P. Sheehan (2009), “Estimating the potential impacts of open access to research findings”, Economic Analysis and Policy, Vol. 39, No. 1, March, https://doi.org/10.1016/S0313-5926(09)50048-3.

Janssen, M., Y. Charalabidis and A. Zuiderwijk (2012), “Benefits, adoption barriers and myths of open data and open government”, Information Systems Management, Vol. 29/4, pp. 258-268, https://doi.org/10.1080/10580530.2012.716740.

Jomhari, N., A. Heiser and A.A. Bin Annuar (2017), “Higgs-to-four-lepton analysis example using 2011-2012 data”, CERN Open Data Portal, https://doi.org/10.7483/OPENDATA.CMS.JKB8.RR42.

Kindling, M. et al. (2017), “The landscape of research data repositories in 2015: A re3data analysis”, D-Lib Magazine, Vol. 23/3/4, https://doi.org/10.1045/march2017-kindling.

Lafortune, G. and B. Ubaldi (2018), “OECD 2017 OURdata Index: Methodology and results”, OECD Working Papers on Public Governance, No. 30, OECD Publishing, Paris, https://dx.doi.org/10.1787/2807d3c8-en.

Lämmerhirt, D., M. Rubinstein and O. Montiel (2017), “The state of open government data in 2017 – Creating meaningful open data through multi-stakeholder dialogue”, https://blog.okfn.org/files/2017/06/FinalreportTheStateofOpenGovernmentDatain2017.pdf (accessed on 14 December 2019).

Manyika, J. et al. (2016), “Digital globalization: The new era of global flows”, report, McKinsey Global Institute, https://www.mckinsey.com/business-functions/mckinsey-digital/our-insights/digital-globalization-the-new-era-of-global-flows (accessed 24 February 2020).

Morais, R. and L. Borrell-Damian (2018), “Open access – 2016-2017 EUA Survey results”, report, European University Association, www.eua.be/Libraries/publications-homepage-list/open-access-2016-2017-eua-survey-results (accessed on 19 June 2019).

OECD (2019), Enhanced Access to and Sharing of Data: Reconciling Risks and Benefits for Data Re-use across Societies, OECD Publishing, Paris, https://doi.org/10.1787/276aaca8-en.

OECD (2018a), AI: Intelligent Machines, Smart Policies, Conference Summary, OECD Digital Economy Papers, No. 270, OECD Publishing, Paris, https://doi.org/10.1787/f1a650d9-en.

OECD (2018b), “Enhanced Access to Publicly Funded Data for Science, Technology and Innovation”, webpage, OECD, Paris, https://community.oecd.org/community/cstp/enhanced-data-access (accessed on 9 January 2020).

OECD (2017a), OECD Science, Technology and Industry Scoreboard, OECD Publishing, Paris, https://doi.org/10.1787/9789264268821-en.

OECD (2017b), “Business models for sustainable research data repositories”, OECD Science, Technology and Industry Policy Papers, No. 47, OECD Publishing, Paris, https://doi.org/10.1787/302b12bb-en.

OECD (2015a), Data-Driven Innovation – Big Data for Growth and Well-Being, OECD Publishing, Paris, https://dx.doi.org/10.1787/9789264229358-en.

OECD (2015b), “Making open science a reality”, OECD Science, Technology and Industry Policy Papers, No. 25, OECD Publishing, Paris, https://doi.org/10.1787/5jrs2f963zs1-en.

OECD (2008), Recommendation of the Council for Enhanced Access and More Effective Use of Public Sector Information, OECD, Paris, https://legalinstruments.oecd.org/en/instruments/OECD-LEGAL-0362.

OECD (2006), Recommendation of the Council concerning Access to Research Data from Public Funding, OECD, Paris, https://legalinstruments.oecd.org/en/instruments/OECD-LEGAL-0347.

Office of National Statistics UK (n.d.), “Secure research service”, webpage, https://www.ons.gov.uk/aboutus/whatwedo/statistics/requestingstatistics/approvedresearcherscheme.

OpenAire (2016), “OpenAIRE’s mission and vision”, webpage, https://www.openaire.eu/mission-and-vision (accessed on 19 December 2019).

Pentland, A. (2014), Social Physics: How Good Ideas Spread-the Lessons from a New Science, Scribe Publications Pty Limited, Melbourne, London.

Philip Chen, C. and C. Zhang (2014), “Data-intensive applications, challenges, techniques and technologies: A survey on Big Data”, Information Sciences, Vol. 275, pp. 314-347, Elsevier, https://doi.org/10.1016/J.INS.2014.01.015.

Reinsel, D., J. Gantz and J. Rydning (2017), “Data age 2025: The digitization of the world – From edge to core”, https://www.seagate.com/files/www-content/our-story/trends/files/Seagate-WP-DataAge2025-March-2017.pdf (accessed on 13 July 2019).

Rothstein, H., A. Sutton and M. Borenstein (2005), “Publication bias in meta-analysis”, in Publication Bias in Meta-Analysis: Prevention, Assessment and Adjustments, Ch. 1, John Wiley & Sons, Ltd, Hoboken, NJ, https://doi.org/10.1002/0470870168.ch1.

Safarov, I., A. Meijer and S. Grimmelikhuijsen (2017), “Utilization of open government data: A systematic literature review of types, conditions, effects and users”, Information Polity, Vol. 22/1, pp. 1-24, https://doi.org/10.3233/IP-160012.

Shin, E. (2018), “Korean case report on enhanced access to research data”, case study for the OECD project on enhanced access to data, https://community.oecd.org/servlet/JiveServlet/downloadBody/141310-102-4-263210/korean%20case%20report.pdf.

Simonite, T. (2016), “Algorithms that learn with less data could expand AI’s power”, MIT Technology Review, 24 May, Boston, https://www.technologyreview.com/s/601551/algorithms-that-learn-with-less-data-could-expand-ais-power/ (accessed on 28 February 2020).

Towns, J. et al. (2014), “XESDE: Accelerating Scientific Discovery” in Computing in Science & Engineering, Volume 16, No. 5, IEEE, Sept-Oct., https://doi.org/10.1109/MCSE.2014.80.

Vickery, G. (2011), “Review of recent studies on PSI Re-use and related market developments”, https://ec.europa.eu/newsroom/dae/document.cfm?doc_id=1093 (accessed on 24 February 2020).

Wiley (2016), “Wiley global data sharing infographic”, June 2017, Wiley Open Science Researcher Insights Survey, https://authorservices.wiley.com/asset/photos/licensing-and-open-access-photos/Wiley%20Global%20Data%20Sharing%20Infographic%20June%202017.pdf (accessed on 17 December 2019).

World Wide Web Foundation (2017), “Open data barometer global report”, fourth edition, https://opendatabarometer.org/doc/4thEdition/ODB-4thEdition-GlobalReport.pdf (accessed on 24 February 2020).

Yan, A. and N. Weber (2018), “Mining open government data used in scientific research, in International Conference on Information, pages 303-313, https://arxiv.org/pdf/1802.03074.pdf/ (accessed on 4 March 2020).

Notes

← 1. Knowledge-based capital comprises computerised information, innovative property and economic competencies (Corrado, Hulten and Sichel, 2004).

← 2. www.data.gov.

← 3. The total exceeds 86% because some categories overlap (e.g. a same repository can have partly open, partly restricted and partly closed datasets).

← 4. Chapter 4 will address those concerns.

Metadata, Legal and Rights

This document, as well as any data and map included herein, are without prejudice to the status of or sovereignty over any territory, to the delimitation of international frontiers and boundaries and to the name of any territory, city or area. Extracts from publications may be subject to additional disclaimers, which are set out in the complete version of the publication, available at the link provided.

https://doi.org/10.1787/947717bc-en

© OECD 2020

The use of this work, whether digital or print, is governed by the Terms and Conditions to be found at http://www.oecd.org/termsandconditions.