3. Standards of Evidence: Mapping the experience in OECD countries

The approaches to evidence standards found across OECD countries are focused on a variety of forms of evidence and cover seven standards of evidence. This chapter is structured as follows: i) the four main functions of the standards of evidence; ii) distribution of standards of evidence across the OECD countries; iii) forms of evidence that are addressed by the standards; iv) and, an introduction to the seven standards of evidence reviewed in this report.

The standards of evidence reviewed in this report vary in terms of the ‘unit of analysis’ they focus on. Some standards focus on the entirety of the existing evidence base (evidence synthesis). This includes standards focused on assessing the quality of existing evidence syntheses and standards focused on the generation of new evidence syntheses (see Figure 3.1).

Other standards are focused on generation of new evaluation evidence. This includes standards for assessing the evidence for an individual intervention as well as standards for supporting the development of an intervention. The approaches also vary in whether they address one key function or whether they address multiple functions. More than half of the approaches have one key function, versus the rest with multiple functions.

Assessing the strength of evidence of an intervention refers to standards that examine the quality design and the robustness of findings of single studies in order to determine the strength of evidence for individual interventions. This assessment also involves an analysis of the findings and impacts in the study/studies. For instance, What Works for Kids (Australia) rates the evidence according to the evaluation(s) that has been conducted on each programme (Nest What works for kids, 2012[1]). Another approach that focuses on individual intervention is Social Programmes That Work (USA) , that seeks to identify those social programmes shown in rigorous studies to produce sizable, sustained benefits to participants (Coalition for Evidence-Based Policy, 2010[2]).

Assessing bodies of evidence refers to standards for appraising the totality of evidence included in a review of an evidence base. This includes: (a) the nature of the totality of evidence; (b) the extent and distribution of that evidence; and (c) the methods for undertaking a review (Gough and White, 2018[3]). Health Evidence provides a Quality Assessment Tool to evaluate systematic reviews or bodies of evidence (Health Evidence, 2018[4]). Another example is the Practice Scoring Instrument from Crime Solutions (USA). They present guidelines to identify a body of evidence (e.g. what qualifies as an eligible meta-analysis?) and evaluate it: eligibility criteria; comprehensive literature search; methodological quality; publication bias (Crime Solutions, 2013[5]).

Reviewing the evidence base for an intervention refers to standards for creating evidence synthesis. This can include methods to identify, select, appraise, and synthesize high quality research. For instance, Clearinghouse for Labor Evaluation and Research (CLEAR), reviews the evidence base for interventions, providing information related to the selection process, key features of all the relevant research identified for a given topic area, and reference documents of the review process (Clearinghouse for Labor Evaluation and Research, 2017[6]). An additional example is The Community Guide approach, which has a guide for the execution of systematic reviews (Zaza et al., 2000[7]).

Supporting the development of an intervention refers to standards that focus on creating guidelines for the use of entities to better understand how an intervention fits into an implementing site’s existing work and context (National Implementation Research Network, 2018[8]). The European Monitoring Centre for Drugs and Drug Addiction (EMCDDA) has developed the European drug prevention quality standards, which outline the steps to be taken when planning, conducting, or evaluating programmes. These standards inform the development of interventions and serve as a reference framework for professional development (2011[9]).

The mapped approaches cover a variety of different types of evidence, including quantitative research, impact evaluation, systematic review, and qualitative research (See Figure 3.2). Approaches can focus specifically on one type of evidence or may focus on more than one type. A small number only cover one kind of evidence, versus the majority with multiple use of these types of evidence. Education counts describes the four types (Alton-Lee, 2004[10]).

  • Most of the approaches assess evidence impact evaluations. This can include Randomized Control Trials (RCTs) and Quasi-Experimental Designs (QEDs).

  • Thirty-four approaches concern quantitative methods. This includes approaches that assess correlational analyses, single case studies and pre-post studies without control groups.

  • Twelve approaches evaluate qualitative methods. This includes approaches that include interviews, focus groups, panels of experts, or ethnographies. Qualitative research is often concerned with the implementation process

  • Ten approaches assess systematic review or meta-analysis.

The seven standards of evidence are: Evidence synthesis; Theory of Change and Logic underpinning the Programme; Design and Development of Policies and Programmes; Efficacy; Effectiveness; Cost (effectiveness); and, Implementation and scale up of intervention.

Evidence synthesis informs policy makers of what is known from research, making it fundamental for informing policy decisions and for promoting the uptake and use of evidence from evaluations and other evidence (Oliver et al., 2018[11]; Shemilt et al., 2010[12]). Evidence syntheses come in a variety of forms and of varying quality (as with primary studies), so standards to enable readers to appraise the quality of evidence synthesis are critical.

Evidence synthesis is an important tool for good knowledge management. Given the breadth of literature, including impact evaluations and RCTs, being published each year, knowledge management is essential as it becomes more difficult for policy makers and practitioners to keep abreast of the literature. Furthermore, policies should ideally be based on assessing the full body of evidence, not single studies, which may not provide a full picture of the effectiveness of a policy or programme.

Evidence syntheses provide a vital tool for policy makers and practitioners to find what works, how it works – and what might do harm. Evidence syntheses are also critical in informing what is not known from previous research. As with primary studies, readers can (and should) appraise the quality and relevance of evidence synthesis (Gough, Thomas and Oliver, 2019[13]).

Since the early 2000s, across many sectors and countries, there has been an increase in the number of impact evaluations, including Randomization Control Trials (RCTs), being published each year (White, 2019[14]). For example, in education around ten RCTs were published each year in the early 2000s, growing to over 100 a year by 2012. As the number of studies increases, it becomes more difficult for policy-makers and practitioners to keep abreast of the literature. Furthermore, evidence synthesis allows for the amalgamation of findings and easier navigation of bodies of literature (Gough, Thomas and Oliver, 2019[13]), not single studies—which may not provide a full picture of the effectiveness of a policy or programme.

Evidence synthesis can come in a variety of forms, depending on the research questions and resources available, such as:

  • Map of maps: provide reports from other evidence and gap maps in that policy space and by doing so act as a navigation tool (Gough, Thomas and Oliver, 2019[13]);

  • Mega-maps: show other maps and reviews, but not primary studies;

  • Evidence and gap maps: are even broader in scope but report a far more limited range of information about the reviews and primary studies they include (Saran and White, 2018[15]);

  • Review of reviews: may be broader in scope but may be more restricted in the depth of analysis. This method only includes existing reviews, preferably systematic, rather than primary studies (Saran and White, 2018[15]);

  • Systematic reviews: are narrow in scope but provide in-depth analysis (Saran and White, 2018[15]). This is the most robust method for reviewing, synthesising and mapping existing evidence on a particular policy topic. It is more resource-intensive, as it can take up to 8 to 12 months minimum and requires a researcher team (The UK Civil Service, 2014[16]). Systematic reviews have a number of stages, including: defining the review question; conceptual framework; inclusion criteria; search strategy; screening; coding of information from each study; quality and relevance appraisal; and, synthesis of study findings to answer the review question (Gough, Thomas and Oliver, 2019[13]).

  • Meta-analysis: It refers to the use of statistical methods to summarise the results from individual programme evaluations on a given topic. A meta-analysis produces a weight-of-the-evidence summary to achieve a specific outcome or the relationship between one outcome and another; and therefore, to draw an overall conclusion about the average effectiveness of a programme (Washington State Institute for Public Policy Benefit, 2017[17]).

  • Rapid Evidence Assessment (REA): It is a quick overview of existing research on a (constrained) topic and a synthesis of the evidence provided by these studies to answer a specific policy issue or research question. REAs tend to be rigorous and explicit in method and thus systematic but make concessions on the depth of the process by limiting particular aspects of the systematic review process such as the screening stage (e.g. only electronically available texts) or considering using less developed search strings (The UK Civil Service, 2014[16]).

  • Quick Scoping Review: It consists of a quick overview of the available research— from accessible, electronic and key resources, going up to two bibliographical references on a specific topic —to determine the range of existing studies on the topic. This non-systematic method can take from 1 week to 2 months strings (The UK Civil Service, 2014[16]).

Evidence synthesis also has a critical role to play in evidence-informed recommendations and guidance. In a number of policy areas, notably in health, formal processes have been developed for interpreting research evidence in order to develop and make recommendations (Ferri and Griffiths, 2015[18]) (see Box 3.1). At the European level, the EMCDDA (2020[19]) with its experience in monitoring and disseminating best practice promotes and supports guideline adaptation. An inventory of national guidelines and standards in treatment, prevention and harm reduction functions as a tool for ensuring that there are processes for translating the evidence base into appropriate recommendations and guidelines. At a global level the WHO produces guidelines that are underpinned by evidence synthesis (Oxman, Lavis and Fretheim, 2007[20]).

Many OECD countries have a strong focus on producing systematic reviews and evidence-informed recommendations and guidance, following the long established practice of the Cochrane centres in the health area. The Danish government funds ‘Cochrane Denmark’, which supports synthesizing and dissemination of the best available evidence for health professionals, researchers and decision-makers (The Cochrane Collaboration, 2021[21]). The Norwegian Institute of Public Health, a government agency under the Ministry of Health and Care Services, has a strong focus on producing evidence synthesis to support decision making. Recent reviews include a live map of COVID-19 evidence (Norwegian Institute of Public Health, 2020[22]) and a review of weight reduction strategies among adults with obesity (Norwegian Institute of Public Health, 2021[23]).The Swedish Agency for Health Technology Assessment and Assessment of Social Services (SBU) is an independent national agency, tasked by the government with assessing health care and social service interventions, covering medical, economic, ethical and social aspects. SBU conducts health technology assessments and systematic reviews of published research to support key decisions in health, medical care and social services. These approaches are now being extended to other areas, beyond health, such as social policy, The SBU is currently pioneering a new international initiative in this area.

Of the approaches included in the mapping, around half concern evidence synthesis. These could be divided into two broad categories, standards for assessing the quantity and quality of existing evidence syntheses and standards for executing evidence synthesis.

Around half of the approaches concerned with evidence synthesis were primarily concerned with providing standards for the quantity and quality of existing syntheses.

Some approaches were primarily concerned with completing a review of existing reviews and translating this into conclusions about the strength of the evidence base for a policy or programme. These approaches include the Education Endowment Foundation’s Teaching and Learning Toolkit (UK), What Works for Health (USA) and the European Monitoring Centre for Drugs and Drug Addiction. For example, What Works for Health (2010[24]) has a rating of ‘Scientifically Supported’ which is awarded to interventions that have one or more systematic review(s). The Education Endowment Foundation has developed a ‘padlock’ rating system to rank the practices within the Teaching and Learning Toolkit (see Box 3.2).

A small number of approaches go further in providing tools that can be used to rate the quality of existing evidence syntheses in order to reach conclusions about the strength of evidence of the body of evidence underpinning a policy or programme. These approaches include ROBIS, Crime Solutions and the EMMIE framework used by the What Works Centre for Crime Reduction and the What Works Centre for Children’s Social Care. Some of these tools originate in the academic literature but have not yet been used by international clearinghouses. ROBIS is one of the most comprehensive and is described in detail in Box 3.3.

Several clearinghouses and What Works centres have also developed their own frameworks to assess existing evidence syntheses. The EMMIE framework score focuses on five dimensions which should be covered in any systematic reviews intended to inform crime prevention (Johnson, Tilley and Bowers, 2015[27]). These are the Effect of intervention, the identification of the causal Mechanism(s) through which interventions are intended to work, the factors that Moderate their impact, the articulation of practical Implementation issues, and the Economic costs of intervention. In the US, the Crime Solutions clearing house has also developed a detailed scoring system that is applied to existing systematic reviews (See more in Box 3.4).

Seventeen of the approaches focus on standards for executing evidence synthesis. Figure 3.3 provides an example of the general stage for conducting reviews. First, standards for setting up and scoping a review (Stage 1), followed by standards for searching research (Stage 2), and standards rating the quality of evidence and strength of recommendations (Stage 4). Details of the standards in stage 1, 2 and 4 will be presented below. For example, The Equator Network contains a comprehensive searchable database of reporting guidelines and also links to other resources relevant to research reporting, this includes guidelines for systematic reviews from deciding the scope and the title of the review, to drawing conclusions (The EQUATOR Network, 2020[28]).

Several approaches recommend the development of a protocol to define the conceptual framework for the review, the main review question, the inclusion and exclusion criteria, the review methods, and its documentation. These approaches include the Campbell Collaboration (2019[30]), What Works for Wellbeing (UK) (2017[31]), and EIF (UK) (2018[32]).

Other approaches also stipulate the use of the PICOS (Participants, interventions, comparisons, outcomes, and study design) framework to determine the inclusion and exclusion criteria of the review. For example, Campbell Reviews stipulate that the inclusion criteria should be stated specifically enough, with key terms clearly defined, to be applied with consistent results by anyone screening studies.

Some approaches stipulate that a search process should be transparent, comprehensive, and replicable. For instance, Education counts (NZ) (2004[10]) emphasises the importance of being transparent in approach, including the use of language when making claims as a fundamental tool to support both rigour and effective communication through each synthesis report.

Several approaches go further and provide guidelines to develop the search protocol, the search strategy, and how to document the search process. For instance, What Works for Wellbeing (UK) (2017[31]) specifies what to include in the search protocol (e.g. electronic sources to be searched, and restrictions), and refers to the need to balance sensitivity (ability to identify relevant information) and precision (ability to exclude irrelevant documents).

Other approaches also stipulate considering grey literature in the review to avoid publication bias. For instance, Evidence Based Teen Pregnancy Programmes (USA) (2016[33]) identifies new studies through public calls for new and unpublished research. Early Childhood Foundation (2018[32]), on the other hand, provides methodologies in practice to measure the publication bias in their systematic reviews.

Most of the approaches agree that quality assessment is a critical stage of the evidence review process. Some approaches provide checklists to evaluate the evidence, including the issues of efficacy and effectiveness addressed in the following sections (Efficacy of an Intervention and Effectiveness of Interventions), including evidence for ESSA (USA) (2019[34]), the Strengthening Families Evidence Review (USA) (2018[35]), and the Community Guide (USA) (2018[36]). Another example is the European Food Safety Agency (2010[37]) which has produced guidance on the application of systematic review methodology to food safety assessments to support decision-making, which includes a number of key conclusions concerning the importance of methodological quality assessment:

  • In a systematic review, each study should undergo a standardised assessment, checking whether it meets or not a predefined list of methodological characteristics, to assess the degree to which it is susceptible to bias.

  • There are many stages of the review at which the validity of the individual studies is considered.

  • Common types of bias that can occur in many different study designs are often classified as selection, performance, detection, attrition and reporting biases.

  • Assessment of methodological quality involves using tools (e.g. Checklists) to identify those aspects of study design, execution, or analysis, which induce a possible risk of bias.

  • It is important to distinguish between the quality of a study and the quality of reporting the study, although both may be correlated.

Other approaches not only stipulate a rating for the quality of evidence, but also provide information about the overall impact regarding the multiple outcomes in an intervention or policy. For instance, GRADE has developed a method for creating a clear separation between quality of evidence and strength of recommendations and presents a rating for each of these categories (See below Box 3.5).

Whereas the previous section of this chapter focused on the synthesis of existing evidence, the remaining sections focus on standards for various aspects of primary evidence generation using a monitoring and evaluation framework. This section focuses on the theory of change and logic model underpinning a programme.

A theory of change can be defined as a set of interrelated assumptions explaining how and why an intervention is likely to produce outcomes in the target population (European Monitoring Centre for Drugs and Drug Addiction, 2011[9]). A logic model sets out the conceptual connections between concepts in the theory of change to show what intervention, at what intensity, delivered to whom and at what intervals would likely produce specified short term, intermediate and long term outcomes (Axford et al., 2005[39]; Epstein and Klerman, 2012[40]). In some cases, a single theory of change might be difficult to identify due to multiple and complex interactions, it might be difficult to identify a unique course of action and the underlying policy goals could be multiple and conflicting; this should not impede the activation of evidence processes.

. Engaging in the process of developing a theory of change leads to better policy planning and implementation, because the policy or programme activities are linked to a detailed and plausible understanding of how change happens; while a logic model is a critical tool to allow detailed coherent and realistic policy planning.

Although a theory of change and a logic model are often expected to be developed during the planning stage, putting them in practice can be tied to time settings, political context, etc. However, they can also be useful in the monitoring and evaluation stage. For instance, to identify key indicators for monitoring or gaps in available data (Better Evaluation, 2012[41]). A full list of the benefits of developing both a theory of change and logic model is reproduced in Box 3.6.

Although this report focusses on standards of evidence as they apply to discrete interventions, many of the concepts relevant to theory of change are also relevant to the discussions about policy evaluation, and around results-oriented policies, such as the importance of clearly distinguishing between concerns of input, output, outcome/result and impact (European Commission, 2011[43]; Gaffey, 2013[44]) For example, the EU Cohesion Policy (European Commission, 2018[45]) sets out several important changes in the understanding and organisation of monitoring and evaluation, notably the emphasis on a clearer articulation of policy objectives (see Box 3.7).

Of the approaches included in the mapping, around half include some coverage of either intervention theory of change or logic model.

All of the approaches stipulated that the intervention should be underpinned by a theory of change, but the approaches vary in terms of how rigorous the theory of change must be to meet the required standard. For example, the Level 1 standard from Project Oracle requires a theory of change and an evaluation plan (2018[46]). Similarly, the Level 1 standards from Nesta (2013[47]) stipulates that the intervention should specify what it does and why it matters in a logical, coherent and convincing way. The Nesta standards identify that this standard represents a low threshold, appropriate for early stage innovations, which may still be at the idea stage.

Around half of the approaches go further in stipulating that the theory of change needs to be explicitly based on scientific theory and/or evidence. These approaches include the Canadian Best Practices Portal, the EU-Compass for Action on Mental Health and Well-being, Blueprints and SUPERU. Blueprints, for example stipulates that for an intervention to meet the ‘Promising Programs’ category it must clearly identify the outcome the programme is designed to change and the specific risk and/or protective factors targeted to produce this change in outcome (Blueprints for Health Youth Development, 2015[48]). The EU-Compass for Action on Mental Health and Well-being (2017[49]) stipulates that for an intervention to be considered ‘evidence and theory based’ it must be built on a well-founded programme theory which is evidence based, with the effective elements in the intervention stated and justified.

A number of standards go further in providing detailed criteria against which an intervention’s theory of change could be rated. These criteria also facilitate comparisons between different interventions according to the quality of their theory of change. These approaches include the Early Intervention Foundation (2018[32]), the Green List Prevention (2011[50]), and the EMCDDA’s European drug prevention quality standards (2011[9]). For example the Green List Prevention has a number of criteria for Conceptual Quality described in Box 3.8.

A small number of approaches turn these criteria into a numerical scale against which a theory of change is assessed. There are differences in the approaches taken according to whether the approach looks at the evidence underpinning a discrete intervention, such as the Office of Juvenile Justice and Delinquency Prevention Model Programs Guide or whether they assess systematic review evidence underpinning a practice such as the What Works Centre for Crime Prevention (2017[51]). Further details of these two contrasting approaches are in Box 3.9.

Only some of the approaches stipulate that the theory of change should be accompanied by a logic model. The majority stipulate that a logic model is necessary but do not provide detailed guidance on what it should contain. For example, SUPERU (2017[52]) has a category for pilot initiatives which have ‘a plausible and evidence-based logic model or theory of change that describes what the intervention is, what change it hopes to achieve and for whom, and how the intervention is supposed to work (how its activities will cause change)’.

A small number of standards go further in providing detailed criteria against which a logic model could be assessed. These include the Society of Prevention Research Standards (2015[53]) which stipulate that they must be described at a level that would allow others to implement/replicate it, including the content of the intervention, the characteristics and training of the providers, characteristics and methods for engagement of participants, and the organisational system that delivered the intervention. The EMCDDA’s European drug prevention quality standards also provide very detailed criteria concerning logic models and the description of the intervention described in Box 3.10.

The Office of Juvenile Justice and Delinquency Prevention Model Programs Guide was unique amongst the approaches in turning the detailed criteria into a numerical scale against which the programme logic model could be assessed as described in Box 3.11.

Standards concerning the design and development of policies and programmes focus on evidence that tests the feasibility of delivering a policy in practice. At the design and development stage, analysts are often doing important work in testing theories of change and logic models, carrying out process evaluations and pre/post studies.

Most of the approaches at this stage do not attempt to assess the casual impact of an intervention. Instead, standards concerning design and development aim to identify promising interventions that may be suitable or merit further investigation, at a later stage, for efficacy testing. Efficacy studies (discussed in the next chapter) are complex, time-consuming and expensive to carry out, especially where the collection of new data is required. Therefore, feasibility and pilot studies are an important way of providing information with which to make programme refinements and to inform the design of efficacy studies.

Thirty approaches recognise a phase of design and development of policies and programmes. Most of the approaches categorize these interventions using descriptions such as “emerging”, “delivery and monitoring”, “exploration and development”, or “probable effectiveness”. The approaches can be divided into two broad categories, those establishing the feasibility of an intervention and those focused on piloting the outcomes of the intervention.

A feasibility study typically evaluates whether a range of activities in an intervention, or key components of an intervention’s logic model – including its resources, activities and population reach– are practical and achievable (See Box 3.12). This allows researchers to investigate whether an intervention can work by systematically testing the intervention’s progress towards its intended outputs as it is being implemented (Early Intervention Foundation, 2019[55]).

Feasibility studies can use a variety of quantitative methods (to determine whether the intervention is reaching its delivery and recruitment targets), and qualitative research (to understand the views of the intervention’s recipients and whether these views are consistent with the intervention).

Some of the approaches that recognize qualitative methods at this stage are SUPERU (NZ), What Works for Health (USA), and the Agency for Healthcare Research and Quality (USA). For instance, SUPERU (2017[52]) includes personal experiences from individuals participating in the intervention, such as: interviews, case studies, and ethnographic research. What Works for Health (USA) recognizes studies that describe the intervention, and studies that ask respondents or experts about the intervention (e.g. descriptive, anecdotal, expert opinion). Finally, the Agency for Healthcare Research and Quality (USA) (2012[56]), includes non-comparative case studies or anecdotal reports in its “suggestive” category.

A pilot study is a preliminary and often small-scale investigation conducted to assess the feasibility of the methods to be used in a larger and more rigorous evaluation study. These studies may also focus on which measures are most appropriate for testing the target outcomes (Early Intervention Foundation, 2019[55])

A variety of different approaches are used to provide preliminary support for programme outcomes. These include administrative data, pre-post-test design and correlational analysis. Most of the approaches agree on implementing pre-post-test at this stage such as Project Oracle (UK) (2018[46]), the European Platform for Investing in Children (EU), and What Works for Health (USA). For instance, the European Platform for Investing in Children (2017[57]) recognizes evaluations using at the minimum pre/post design with appropriate statistical adjustments, and What Works for Health (2010[58]) considers studies comparing outcomes before and after an intervention, and with a statistical analysis. The Green List Prevention (Germany), includes benchmark or non-references-studies (Groeger-Roth and Hasenpusch, 2011[50]) which is described in detail in Box 3.13.

Design and development standards consider several recommendations regarding the study sample, including its representativeness, the sample design, the sample size, and processes for dealing with study drop out.

Representativeness of the sample. Most of the approaches demand study samples that accurately represent the target population and will be relevant to the research question.

Sampling approach. Most approaches specify that the sampling approach should be well-defined and mention its restrictions. For instance, the Housing Associations' Charitable Trust (UK) (2016[59]) mentions that the sampling design should include the setting and location where the data are planned to be collected; and a comprehensive description of the eligibility criteria used to select the study participants and the recruitment methods.

Sample size. Some of the approaches specify a minimum sample size threshold required for the research, but the threshold can be set at different sizes. For instance, the European Platform for Investing in Children (EU) (2017[57]) requires a sample size of at least 20 in each study group. Another example is Project Oracle (2018[46]), which considers that a reasonable sample is at least 30 individuals.

Study drop out. Most of the approaches recognise the relevance of study drop out but most of them do not specify any rate beyond which the strength of evidence is compromised. For example, the Clearinghouse for Labor Evaluation and Research (USA) asks if the researchers took steps to reduce study drop out to resolve these issues (2014[60]). Only a few approaches highlight acceptable rates of drop out. In the EIF (UK) (2018[32]) recommends that overall study attrition should not be higher than 40% (i.e. with at least 60% of the sample retained).

Most design and development standards stipulate that an evaluation must use valid and reliable measurements. There are some differences between the specifications across standards about how to specify the validity, reliability, and the independence of measurement. See further information in Box 3.14.

Reliability and validity. Most of the approaches stipulate that measurements should be valid and reliable measures of an outcome. For instance, SUPERU (2017[52]) specifies that the evaluation should use valid and reliable methods and measurement tools that are appropriate for participants and relevant to what the intervention is trying to achieve (See Box 3.15). Project Oracle (2018[46]) also stipulates that valid and reliable measurement tools have been used that are appropriate for the participants in the research.

Other standards provide further specification on the technical requirements that the measurements should meet, such as the European Drug Prevention Standards (2011[9]), which requires that measures should demonstrate internal consistency, test-retest, inter-rater reliability; and construct validity.

Independence from the intervention. Some approaches request independency of a measurement from participants and data collecters. For instance, Clearinghouse for Labor Evaluation and Research (USA) (2014[63]) indicates that data collection must reflect methods that produce unbiased results such as independency and objectivity of the outcomes from the research team. The European Drug Prevention Standards (EU) (2011[9]) also agrees that measures must produce results independently of who uses the instrument.

Design and development standards highlight the importance of well executed and described analysis, which covers: the data collection, hypothesis testing, and methods of address missing data or other sources of bias.

Data collected. Most approaches stipulate that a complete report should be able to explain and justify why and how the analysis was conducted. For instance, Housing Associations' Charitable Trust (2016[59]) requires the study protocol, recording of any deviations, and a structured report of findings.

Hypothesis testing. Most of the approaches stipulate a clear description of the analysis methods selected to test the research question. For example, Clearinghouse for Labor Evaluation and Research (USA) (2014[63]) demands analysis methods that are very well-described, relevant to the research question, sufficiently rigorous, and correctly executed.

Missing data. Many of the approaches consider issues of missing data, with most requesting that the analysis specifies how these issues were managed, and how this could affect the interpretation of the findings. For instance, Project Oracle (2018[46]) asks if the research provides all the details concerning the data analysis, or any weaknesses of the design, and their impact on the results. Another example is European Drug Prevention Standards (EU) (2011[9]), which requests reporting and appropriate handling of missing data.

Design and development standards request coherence between the programme’s theory of change, the data analysis, and findings. Some of the approaches go further and specify that findings should be statistically significant on at least one of the outcomes; and not have harmful effect. Other approaches define the findings in this stage as unclear/undetermined effects.

Statistical significance. Among the approaches dealing with quantitative research, there are variations between the information required about statistical significance. Some approaches only recommend that the results were tested for statistical significance, such as Project Oracle (UK) (2018[46]). Other approaches stipulate that the findings must be significant. For instance, What Works for Health (USA) (2010[58]) scores pre/post studies with statistically significant favourable findings higher, and Evidence for ESSA (USA) (Every Student Succeeds Act - ESSA, 2019[34]), which requires findings of a statistically significant effect for correlational studies. Some other approaches require a specific level of significance, such as European Platform for Investing in Children (EU) (2017[57]), which asks for positive results at 10 % of significance.

No Harmful effects. Many of the approaches expect that the intervention does not constitute a risk of harm. For example, The Centers for Disease Control and Prevention suggest that studies should indicate any negative effect. (Puddy and Wilkins, 2011[64]).

Unclear/undetermined effects. Most of the approaches accept unclear or undetermined effects given the type of evidence, and rigorous on the study design. For example, SUPERU (NZ) (2017[52]) mentions that at this stage an evaluation (pre/post study) indicates some effect, but it may not yet be possible to directly attribute outcomes to it. Another example is the Housing Associations' Charitable Trust (UK) (2016[59]), where the lack of a good design limits any conclusion of causality.

Subgroup analysis1. Only a few approaches discuss subgroup analysis to verify for whom the effects are claimed. For example EIF (UK) (2018[32]) stipulates that subgroup analysis is used to verify for whom the intervention is effective and under what conditions. The Clearinghouse for Labor Evaluation and Research (USA) (2014[60]) discusses if the sample analysis allows generalizing the results to a wider population, or if it is presented the limitations of this inference.

Once an intervention has been identified as ‘promising’ in preliminary research, many standards of evidence emphasise the need for rigorous efficacy testing. Efficacy studies typically privilege internal validity, which pertains to inferences about whether the observed correlation between the intervention and outcomes reflect and underlie causal relationship (Society for Prevention Research Standards of Evidence, 2015[53]). In order to maintain high internal validity, efficacy trials often test an intervention under ‘ideal’ circumstances. This can include a high degree of support from the intervention developer and strict eligibility criteria thus limiting the study to a single population of interest.

A critical goal of standards of evidence is to facilitate the communication of which policies and programmes are efficacious. A statement of efficacy should be of the form that Intervention X is efficacious for producing Y outcomes for Z population at time T in setting S (Society for Prevention Research Standards of Evidence, 2015[53]). In order to maintain high internal validity, efficacy trials often test an intervention under ‘ideal’ circumstances, and tell us little about the impact of an intervention in ‘real world conditions”, because the evaluation is often overseen by the developer of the policy or programme, with a carefully selected example. This can include a high degree of support from the intervention developer and strict eligibility criteria thus limiting the study to a single population of interest.

Therefore, standards of evidence often stipulate that a policy or programme demonstrates effectiveness, in studies where no more support is given than would be typical in ‘real world’ situations. This requires flexibility in evaluation design to address cultural, ethical and practice challenges. Systematic reviews, observational studies and participatory evaluations which gather attitudinal and experiential considerations from the main beneficiaries can still be considered useful evidence and guide improvements in the design or implementation of the intervention.

Determining the efficacy of an intervention is a complex process, involving considerations on the evaluation design, sample, measurements, methods of analysis, and findings. There is wide variety of specification standards that an evaluation must meet for an intervention to be deemed efficacious.

All the standards consider Randomized Control Trials (RCTs) as an appropriate study design to generate a counterfactual as the basis for making efficacy claims. However, there is wide variation across standards regarding Quasi-Experimental Design (QEDs). Some approaches consider that QEDs can be used to generate comparable samples as RCTs, whereas other standards only recognise that QEDs are better than pre/post studies.

Sixteen of the approaches privilege the use of RCTs over QEDs. For example, Nest What Works for Kids (Australia) (2012[1]) ranks programmes or policies with well-implemented RCTs in the highest levels. Evidence for Every Student Succeeds Act (USA) (Every Student Succeeds Act - ESSA, 2019[34]) defines a programme or policy as Strong evidence when it has at least one well-designed and well-implemented RCT.

Thirty of the approaches consider both RCTs and QEDs as robust evaluation designs to support causal inference. For instance, the European Platform for Investing in Children (EU) (2017[57]) defines both evaluation designs as methodologies that can be used to construct convincing comparison groups to identify policy impacts. Other approaches that treat suitably designed RCTs and QEDs as equivalent are SUPERU (NZ) (2017[52]) and the Green List Prevention (Germany) (Groeger-Roth and Hasenpusch, 2011[50]).

Among the approaches that accept QEDs, some of them distinguish between how rigorous different type of designs are, such as: difference in difference (DD); propensity score matching (PSM); and Regression Discontinuity Designs (RDD). A few approaches also provide a score according to the rigour and limitations of QEDs. For example, What Works Centre for Local Economic Growth (UK) (2016[65]) presents a guide scoring evidence using the Maryland Scientific Methods Scale to evaluate the different type of designs, from PSM, Panel Methods, DD, RDD to Instrumental variables (IV), see Box 3.16.

Many approaches recognise that, whilst RCTs might in theory present the ‘gold standard’ in reducing threats to internally validity, in practice randomisation might not be practicable for a range of policy challenges, including ethical concerns. In the health policy area, the famous “Rand experiment”, which allowed for computing the price elasticity of the demand for health, could probably not be replicated today. (Newhouse J.P., 1993[66]) . For example, in OECD’s work on Regional Development Policy (OECD, 2017[67]), it is recognised that randomisation is not always possible, and quasi-experimental designs can be used as an alternative method to identifying causal effects. In addition, the development of econometric methods with the use of Difference in Differences with instrumental variables in econometrics, has helped to diffuse the use of alternative quasi experimental approaches to producing reliable estimates.

Intention to treat (ITT). Although the importance of ITT in the academic field is well-established (Hollis and Campbell, 1999[68]), there is variation within the approaches concerning their treatment of ITT. Some of them clearly request in their criteria that analysis must be based on ITT. For instance, What Works Centre for Children’s Social Care (UK) (2018[69]) establishes that acceptable quality study must have an intent-to-treat design. Social Programmes that Work (USA) (2019[70]), which stipulate an ITT approach for the intervention group, and Child Trends (USA) (2018[71]) affirms that only results based on an intent-to-treat analysis can be reported.

Most standards present clear conditions regarding the nature of the sample required in providing an appropriate basis for the analysis. The standards specify issues concerning a baseline equivalence, attrition, and risks of contamination.

Baseline equivalence. Some of the standards focus on baseline characteristics of the treatment and comparison-groups before running a programme or policy. For instance, Nest What Works for Kids (AU) (2012[1]) requests clear analysis of baseline characteristics. And Social programmes that Work (USA) stipulates that the intervention and control groups must be highly similar in key characteristics prior to the intervention (Coalition for Evidence-Based Policy, 2010[2]).

Other standards treat baseline equivalence differently according to whether the study design is an RCT or QED. Evidence Based Teen Pregnancy Programs stipulates that an RCT must control for statistically significant baseline differences and QEDs must establish baseline equivalence of research groups and control for baseline outcome measures (Mathematica Policy Research, 2016[33]).

Attrition. Some of the standards recognise an attrition threshold. For instance, European Platform for Investing in Children (EU) (2017[57]), which states that attrition must be less than 25% or that it has been accounted for using an acceptable procedure. Another example is the Clearinghouse for Military Family Readiness (USA) (2012[72]), which stipulates an attrition at immediate post-test, of less than 10%.

Other standards also stipulate conditions for overall and differential attrition. For example, Darlington Service Design Lab requests no evidence of significant differential attrition (Graham Allen, 2011[73]). Other approaches go further in stipulating specific attrition thresholds. For instance, What Works for Clearinghouse (USA) (2020[74]) defines that for studies with a relatively low overall attrition rate of 10%, a rate of differential attrition up to approximately 6% is acceptable. For studies with a higher overall attrition rate of 30%, a lower rate of differential attrition, at approximately 4% is acceptable.

Risk of contamination. Only few standards highlighted the issues around risk of contamination. For example, What Works Centre for Local Economic Growth (UK) (2016[65]) stipulates no occurrence of contamination of the control group for the treatment. Crime Solutions (2013[54]) assesses the degree to which internal validity is threated, within other aspects, by contamination.

Some efficacy standards stipulate that evaluations must use valid and reliable measurements. In general, these standards tend to be broadly equivalent to those already discussed at the design and development phase. For example, Blueprints for Healthy Youth Development (2015[48]) demands use of valid and reliable measures, and California Evidence-Based Clearinghouse for Child Welfare (2019[75]) provides a measurement tools rating scale based on the level of psychometrics (e.g., sensitivity and specificity, reliability and validity) in peer review studies using QEDs or RCTs.

A few approaches go further and recommend the independency of the measurement with the participants of an intervention. For instance, EIF (UK) (2018[32]) requests that measurements are blind to group assignment if possible. The European Monitoring Centre for Drugs and Drug Addiction (EU) (2011[9]) specifies that an instrument is objective if it produces results independently of who uses the instrument to take measurements. The Dartington Service Design Lab (UK), stipulates that outcome measures must not depend on the unique content of the intervention, and they are not rated solely by the person or people delivering the intervention (Graham Allen, 2011[73]).

A few of the approaches provide details on the appropriate analysis required in order to establish the efficacy of a policy. These standards focus on establishing baseline conditions, and the analysis of the effects at the correct level of assignment.

Baseline conditions. Most of the approaches require that evaluations use statistical models to control for baseline differences between treatment and control group. For instance, Strengthening Families Evidence Review (USA) (2019[76]) requests statistical adjustment when treatment and comparison groups are not equivalent. Another example is HomeVEE (USA) (2018[77]), which requests that the analysis should control for differences in baseline characteristics and baseline measures.

Level of analysis. Only a few of the standards demand that the analysis needs to be appropriate according to whether the assignment is at the individual or group (or cluster) level. For instance, Evidence Every Student Succeeds Act (USA) (Every Student Succeeds Act - ESSA, 2019[34]) stipulates that clustered designs must use Hierarchical Linear Modelling (HLM), or other methods accounting for clustering. A second example is the Society of Prevention Research (Society for Prevention Research Standards of Evidence, 2015[53]), which specifies that the analysis must assess the treatment effect at the level at which randomization took place.

Most of the standards focus on impact effects and their statistical significance whereas other standards that also request effect size measures.

Statistical significance. Across the standards, the majority demand information regarding whether effects are statistically significant. For instance, Clearinghouse for Military Family Readiness (2012[72]) in the Promising Programme Category requests specific conditions for significant Effects—Two-tailed tests of significance are preferable to one-tailed tests.

Impact effects. Most of the standards claim that an intervention is efficacious when the findings of an intervention are positive and significant, and there is no evidence of harmful effects. Other standards request reporting of mixed effects or null effects. The standards may present these criteria as part of a one ranking; or independently with a ranking solely focused on impact.

  • Forty-one of the standards consider positive impact effects to claim efficacy. For instance, Be you (AU) (2020[78])— the new integrated national initiative of the Australian government to promote mental health from early years through evidence-based, flexible online professional learning, complemented by a range of tools and resources to turn learning into action (Early Childhood Australia, 2020[79])—requests that a programme have at least one research or evaluation study which demonstrates a positive impact on mental health outcomes for children or young people.

  • Nineteen of the standards also request reporting on whether there are harmful effects. For example, EU-Compass for Action on Mental Health and Well-being (EU) (European Commission - Directorate-General for Health and Food Safety, 2017[49]) requires that the evaluation outcomes demonstrate beneficial impact, and that possible negative effects be identified and stated.

  • Eighteen of the standards consider mixed effects or null effects. For example, EIF (UK) (2018[32]) has a Not effect level (NE). This level is reserved for programmes where there is evidence from a high-quality evaluation of the programme that did not provide significant benefits for children.

Some of the standards present an independent ranking or score to assess impact. For instance, Darlington Service Design Lab (UK) evaluates impact according to interventions with positive effect size, and no harmful effects or negative side–effects of intervention (Graham Allen, 2011[73]). Another example is Evidence Based Teen Pregnancy Programs (USA), which classifies the programme evidence as positive, mixed, indeterminate or negative (Mathematica Policy Research, 2016[33]).

Magnitude of the findings. Some standards recognise the importance of reporting effect size. For example, Blueprints (2015[48]) stipulates that effect sizes should be reported, along with the significance levels of those differences, or that it should be possible to calculate the effect size from the data reported (means and standard deviations).

Other standards establish an effect size threshold. For instance, the European Platform for Investing in Children (EU) (2017[57]) stipulates an effect size of at least 0.1 of a standard deviation. A second example is Best Evidence Encyclopaedia (USA), which assesses an intervention by the sample size and effect size (Johns Hopkins University School of Education’s Center for Data-Driven Reform in Education - CDDRE[80]), in the following order:

  • Moderate evidence level requests specifically studies with weighted mean effect size of at least +0.20;

  • Limited evidence level a study can meet the criteria except that the weighted mean effect size is +0.10 to +0.19; or the weighted mean effect size is at least +0.20, but the study is insufficient in number or sample size.

Efficacy trials often tell us little about the impact of an intervention in ‘real world’ conditions, because the evaluation is often overseen by the developer of the policy or programme, with a carefully selected sample. What are the benefits or damages, independently from the policy goals? Therefore, standards of evidence often stipulate that a policy or programme demonstrates effectiveness, in studies where no more support is given than would be typical in ‘real world’ situations.

Demonstrating effectiveness of a policy or programme in ‘real world’ situations requires flexibility in evaluation design to address cultural, ethical and practice challenges. During policy implementation, evidence is useful to understand for whom it works and for whom it does not work. Therefore it is important to learn how to maximize benefits and minimize damages, also within a no policy change scenario.

For the majority of standards, in order for an intervention to claim effectiveness, the evaluations should meet all of the conditions of efficacy studies discussed in the previous section as well as the following criteria: generalizability of the findings, long term impacts, positive average effect across studies and no reliable iatrogenic effect observed on important outcomes.

Generalizability. In order to translate the findings of efficacy evaluations into a wider range of population and settings, the standards concerning effectiveness stipulate that the generalizability of intervention effects should be tested across the following dimensions: a replication; and population subgroup analysis.

Replication. Most of the standards typically request two or more RCTs or QEDs conducted in different locations. Some of the standards consider that before an intervention is judged as effective and ready for scaling up data collection and analysis should be carried out by an independent evaluator who does not have any involvement with the developer of the intervention. For instance, the Clearinghouse for Military Family Readiness (USA) (2012[72]) requests at least one replication involving an external implementation team at a different site. The National Dropout Prevention Center (USA) (2019[81]) gives the highest score to programmes that were evaluated using an experimental or strong quasi-experimental design conducted by an external evaluation team. Further details on the approach adopted by Blueprints are described in Box 3.17.

Other standards do not necessarily request or establish any “independency” condition for a replication. Most of them specify the number of evaluations of the intervention and attention to the transferability of the findings to different context. For instance, CEBC (USA) (The California Evidence-Based Clearinghouse for Child Welfare, 2019[75]) requests at least two RCTs in different settings. Another example is Education counts (NZ), which considers the degree of applicability to New Zealand contexts; and specificity or generalisability of findings (Alton-Lee, 2004[10]).

Population subgroup analysis. Other standards explore generalizability through an analysis of the population subgroups (e.g. race, gender, social class). For instance, Society for Prevention Research Standards of Evidence (2015[53]) requests a statistical analysis of subgroup effects for each important subgroup to which intervention effects are generalized. SUPERU (NZ) (2017[52]) asks for evidence of the impact of the intervention on different subgroups in the target population.

The standards concerning effectiveness assume that the studies have the same robust approach about the sampling, as already was specified in the efficacy section. 

Most standards stipulate that effectiveness evaluations must meet the same requirements for measurement as in efficacy standards. Some of them request additionally the independency of the measurements from the participants and from the person delivering the intervention. For instance, EIF (UK) (2018[32]) request that at least one evaluation use a form of measurement that is independent of the study participants and independent of those who deliver the programme.

Most of the standards stipulate that at effectiveness evaluations must meet all the methodological requirements previously discussed in the efficacy standards including appropriate statistical analysis (e.g. Intent-to-treat) and baseline equivalence adjustments.

Most of the standards concerning effectiveness agree on requiring positive average effects and no evidence of negative effects or risk of harm. Other standards also consider the sustainability of the effect at the long term.

  • Positive average effect and no reliable iatrogenic effect observed. Most of the standards agree on requesting positive average effect across studies and reporting no reliable iatrogenic effects observed on important outcomes. For instance, Society for Prevention Research Standards of Evidence (2015[53]) specifies that effectiveness can be claimed only for intervention populations, times, settings, and outcome constructs for which the average effect across all effectiveness studies is positive and for which no reliable iatrogenic effect on an important outcome has been observed. Another example is Dartington Service Design Lab (UK), which requests evidence of a positive effect and an absence of iatrogenic effects from the majority of the studies (Graham Allen, 2011[73]).

  • Other standards adjust the programme’s rating according to the average effects of the studies. These standards recognize a category for each of the possible results found in multiple studies. For example, CEBC (USA) (The California Evidence-Based Clearinghouse for Child Welfare, 2019[75]) presents three categories for the overall weight of evidence from several studies:

    • Well supported category: At least two RCTs have found the benefit of the practice;

    • Evidence Fails to Demonstrate Effect: Two or more RCTs have found the practice has not resulted in improved outcomes;

    • Concerning Practice: The overall weight of evidence suggests the intervention has negative effect.

  • Long term effects. Most of these approaches agree on requesting sustained effects for at least 12 months. A few of them also accept effects for at six least months. For example, the Nest What Works for Kids (AU) (2012[1]) consider different periods of time: Supported level, the effect should be maintained at a 6-month follow-up; and for Well supported an effect must be maintained for at least one study at one-year follow-up. Another example is HomVEE (USA) (2018[83]), which evaluates the evidence across diverse outcome domains, such as duration of Impacts (information on the length of follow-up) and Sustained Impacts (impacts were measured at least one year after programme enrolment).

One important caveat to the standards reviewed in the efficacy and effectiveness chapters is that they primarily originate from traditional approaches to impact evaluation. These can be contrasted with systems-based approaches to evaluation (see Table 3.1). System based approaches to evaluation start from challenges faced when dealing with the open‐ended nature of problems and issues including innovation and the goal complexity of the connected processes (Askim, Hjelmar and Pedersen, 2018[84]; Tõnurist, 2019[85]).

OECD has been also moving towards a systems approach to public sector innovation and has developed a model to look at innovation activities from an individual, organisational and systemic lens, which can then also feed into approaches to evaluation (Tõnurist, 2019[85]). The tensions between the traditional approaches impact evaluation and the system based approaches has been further discussed by Tõnurist (2019[85]) and it is acknowledged that integrating these insights would be a useful next step for approaches to standards of evidence.

Measuring effective interventions requires not only evidence of their impacts, but evidence of their cost and value for money. Cost data provide information relevant to the financial planning and sustainable scale-up (Levin and Chisholm, 2016[86]); while a variety of methodological tools look to assess the benefits and costs associated with an intervention.

Positive impacts at a very high price may not be in the interests of governments and citizens. Using economic evidence is important to demonstrate value for money for public programmes in a context of continued fiscal constraints. Increased understanding of interventions that achieve impact at a too high price would enable decision-makers to make more efficient decisions.

A variety of different methodologies is taken by the existing standards. Some of them focus on reporting the existence of cost or economic evaluations of an intervention. Other standards request in their criteria the presence of cost information and related analysis for a policy or programme; whilst a final set of standards provide detailed guidance on carrying out and interpreting economic evaluations.

The Box 3.18 provides a first general description of the different types of economic evaluations used to understand their complexity and usefulness when a particular organization or government is taking an investment decision.

In the same line, the National Academies of Sciences, Engineering, and Medicine provides a decision tree to determinate if an intervention is ready for an economic evaluation (See Figure 3.5). According to this tool, the first step corresponds to reviewing the available information about an intervention to determine if it is enough to answer the research question (e.g. are the counterfactuals well defined? are resources required to implement the intervention known?). At this stage, a researcher or policy maker should be able to conduct a CA. If there is evidence of intervention impacts, the researcher should consider conducting a CEA or CBA, which also relies on whether the interventions’ impacts can be monetarized. If they can be, the researcher should conduct a CBA; otherwise, a CEA would be the best option. Other economic evaluations can be in consideration such as QALY, or DALY. This will depend on the perspective of the evaluation.

A first step towards being able to carry out economic evaluation is to collect information on the costs of an intervention; most of the approaches (31 of 50) report information in term of the economic resources in materials and training in an intervention. For instance, Clearinghouse for Military Family Readiness (USA) (2012[72]) presents in the evidence summary information related to the cost of training per participants, and further available information related to the programme implementation.

Other approaches report in the evidence reviews if an intervention or programme has developed an economic evaluation. For example, What Works for Kids (Australia) (2012[1]) asks specifically if a cost benefit study has been undertaken and published. This information is also identified in the evidence portal of What Work Centre for Children’s Social Care (UK) (2018[69]). A further example is Social Program that works (USA), which provides summaries of the available programme’s benefits and costs in their evidence reviews.

Some of the approaches establish in their criteria a cost rating or assessment condition, separately from the quality design, to assess if an intervention provides cost information, or the settings to present this type of information. For instance, EMCDDA (EU) (2011[9]) suggests planning financial requirements in terms of cost estimations for the programme, and a detailed and comprehensive breakdown of costs. Other approaches provide tools to determine the resources required in future interventions or additional activities, such as EEF (See Box 3.19) and the Toolkit produced by the What Works Centre for Crime Reduction (UK) (2017[90]), which distinguish direct and indirect costs, and allow users to make a comparison of costs prior to the intervention being implemented and after the intervention, or in a different context.

Whereas few of the approaches requesting a Cost Benefit Analysis or Cost Effectiveness Analysis in their criteria. For instance, EPIC (EU) (2017[57]) assesses if the programme has been found to be cost-effective/cost-beneficial (i.e. the practice can deliver positive impact at a reasonable cost).

To determine how best to invest public or private resources in social policies, decision makers require the use of economic evaluations to answer relevant questions, such as What does it cost to implement this intervention in a particular context and what are its expected returns? To what extent can these returns be measured in monetary or nonmonetary terms? Who will receive the returns and when? Is this investment a justifiable use of scarce resources relative to other investments? (National Academies of Sciences, Engineering, and Medicine, 2016[89]).

In this section additional standards that are mainly focused on establishing guidelines to evaluate economic evaluations will be introduced, particularly, with regard to criteria relating to cost analysis (CA), cost effectiveness analysis (CEA), and cost benefit analysis (CBA).

In the following sub-sections the standards for framing the evaluation will be outlined: identifying the impacts, determining the cost, valuing benefits and cost, and presenting the results.

Most of the approaches stipulate the importance of clearly stating the objectives of an economic evaluation regarding the information and resources available for a given intervention in order to establish which evaluation method to use. For instance, The National Academies of Sciences, Engineering, and Medicine (2016[89]) suggest that in order to determinate whether an intervention is ready for economic evaluation, this will depend on the question(s) of interest, the intervention specificity, and a well specified counterfactual condition.

Other approaches specify that the eligibility criteria, delivery setting time, and location must be well described, which relates to some of the issues of theory of change and logic model addressed. For instance, New South Wales Government (2017[91]) refers to the need for a programme logic to identify the issues that a programme is seeking to address; its intended activities and processes; their outputs; and the intended programme outcomes. Another example is the work on Cost benefit analysis and the environment by the OECD (2018[87]) which specifies that it is necessary to mention all the direct and indirect participants involved in the policy, geographical boundary, and extension to wider limits.

Most of the approaches stipulate that the outcomes used in CEA and CBA should come from robust designs to determinate unbiased impacts from an intervention (e.g. RCTs or QEDs). This builds on the issues of efficacy and effectiveness addressed in Section 0 and 0 (Research design, measurements, sample, potential impacts, and external validity). For instance, The National Academies of Sciences, Engineering, and Medicine (2016[89]) stipulate that for CEA or CBA not only is information on the resources used to implement the intervention required, there is also a need for credible evidence of impact. Another example is the Vera institute (USA) (2014[92]), which refers to quantifying the investment’s impacts using evaluations that establish the causal link between an investment and its impacts.

Other approaches also highlight the use of meta-analysis or systematic reviews when multiple impact studies exist regarding a programme or similar intervention. For example, the Washington State Institute for Public Policy Benefit (USA) develops a meta-analytic approach to identify, screen, and code research studies in its cost-benefit analysis. The WSIPP also adjusts effect size regarding the methodical quality of the study and the longitudinal linkage (2017[17]).

Developing accurate estimates of the cost of an intervention is one of the main concerns in economic evaluations; and represents an opportunity to improve subsequent programme planning and implementation (Crowley et al., 2018[93]). According to this, some of the approaches agree on planning cost data collection, ideally, in the early stages of the intervention through standardized methodologies such as: a macro top-down approach; or a bottom-up approach (See Box 3.20). Some of them also provide information regarding tools to facilitate the process of data collection: for instance, CostOut, produced by the Center for Benefit-Cost Studies of Education (Vera Institute, 2014[92]; Crowley et al., 2018[93]), which was designed to simplify the estimation of costs and cost-effectiveness of educational or other social programmes. Other approaches provide information on current practices, such as the analytical report presented by EMCDDA (2017[94]), which compiled initiatives for estimating drug treatment costs across eleven countries—including US, Australia, Portugal, Italy, and Czech Republic and the European Union.

Whatever the methodology chosen to measure the cost of an intervention, the majority of approaches agree on covering as much information cost categories as possible (See Box 3.21) to ensure unbiased analysis (Vera Institute, 2014[92]; NSW Goverment, 2017[91]). For instance, the National Academy of Sciences (2016[89]) not only considers personnel, space, materials, and supplies (in the micro costing method); but also found useful to register direct cost, indirect cost (e.g. volunteer time), fixed cost (do not vary with the number of participants served), and variable costs, particularly when the evaluator is interested in an intervention’s marginal and steady-state (average) costs. Additional to these costs, Crowley et al (2018[93]) also suggests that resources needed to support programme adoption, implementation, sustainability, and monitoring should be included in cost estimates.

Other approaches, such as the WSIPP (USA) (2017[17]), use several strategies in meta-analysis to construct programme cost estimates. Some of their principles are the following:

  • If the programme evaluations they have meta-analysed contain information on the number of “physical resource units” used by the programme, then they summarize those units, and produce an estimate of the average cost.

  • The per-participant programme costs represent the cost of the average person who enters the programme, rather than the cost of a participant who completes the programme.

  • In addition to a per participant cost estimate, they also note the year in which the dollars are denominated.

  • Programmes that involve multiple years of per-participant spending can be present valued with NPV equation, where the discount factor depends on the years.

After having identified the resources used in an intervention and their outcomes, the approaches typically refer to how cost and benefits should be valued. This will depend on the type of economic evaluation, its purpose, and time horizon.

For CA, some of the approaches consider the market price of a resource as a good approximation for its opportunity cost. For example, CADTH (Canada) recommends that the fees and prices, listed in schedules and formularies of Canadian ministries of health, be considered as unit-cost measures when calculating the perspective of the public payer (2017[97]). Other approaches suggest that shadow prices can be used, as another method for valuing the resource, when a market price does not exist (National Academies of Sciences, Engineering, and Medicine, 2016[89]; Crowley et al., 2018[93]). Shadow prices are used to capture the appropriate economic value in terms of willingness to pay: what consumers are willing to forget to obtain a given benefit or avoid a given cost (Karoly, 2012[98]).

For CBA, most of the approaches agree on the three summary statistics or decision rules in CBA model (See Box 3.22) Net Present Value (NPV); the Benefit-Cost Ratio (BCR); and the Internal Rate of Return (IRR). Particularly, NPV requests that factors such as the discount rate, inflation and time horizon be taken into account (National Academies of Sciences, Engineering, and Medicine, 2016[89]; OECD, 2018[87]; OMB, 1992[99]; CADTH, 2017[97]; Crowley et al., 2018[93]; NSW Goverment, 2017[91]).

Other approaches such as the Office for Management and Budget (USA) (2018[102]) provides further Guidelines and Discount Rates for Benefit-Cost Analysis of Federal Programmes. They stipulate and update information on treatment of inflation, Real Discount Rates (a forecast of real interest rates from which the inflation premium has been removed), and Nominal Discount Rates (a forecast of nominal or market interest rates for calendar year). Additionally, the OECD (2018[100]) provides information about health valuations for valuing risks to life (VSL), and the value of a (statistical) life-year (VSLY) (See Box 3.23).

For CEA, some of approaches stipulate the need for a comprehensive measure of the intervention’s economic costs from a societal perspective, and the examination of one or more no-monetized outcomes(s). However, the use of different units to measure outcomes limits the aggregation of them. This problem has been mitigated by the development of measures, such as quality-adjusted life years (QALYs) (also known as cost-utility analysis) or disability-adjusted life years (DALYs) (National Academies of Sciences, Engineering, and Medicine, 2016[89]) (Goverment of Netherlands, 2016[104]). Other approaches, such as NICE (UK) (2013[105]), provide parameters to consider an intervention cost effective (less than £20,000 per QALY gained or between £20,000 and £30,000 per QALY, if certain conditions are satisfied).

One of the problems that arise from valuing cost and benefits is double counting. This issue refers to outcomes (benefits or cost) that are inputs for other outcomes or can be linked within each other. A number of approaches mention the need or precautions to avoid double counting (OECD, 2018[100]; Goverment of Netherlands, 2016[104]; National Academies of Sciences, Engineering, and Medicine, 2016[89]). For instance, Crowley et al (2018[93]) suggests that one approach to manage double counting is to employ a series of “trumping rules” that isolate developmental pathways to ensure no double counting occurs. Another example is the Treasury from New Zealand, which provides practical examples concerning double counting (See Box 3.24).

Another problem that can arise when measuring cost and benefits are externalities. This issue refers to goods that, once produced, can be consumed simultaneously by any number of people and from which people can’t be excluded. These can have negative or positive effects (2015[106]). OMB (USA) presents several examples where externalities are addressed, such as in the principle of Willingness-To-Pay, where market prices provide an invaluable starting point for measuring costs, but prices sometimes do not adequately reflect the true value of a good to society; hence, the use of shadow prices2 can avoid market distortions, such as externalities or taxes (1992[99]). Externalities also lead to measuring indirect cost and intangible costs as another way to avoid a measurement bias in the evaluation. The Treasury from New Zealand offers another example regarding externalities (See Box 3.24).

Because cost and benefits estimations are made prior to the implementation of the programme, the outcomes from an economic evaluation could take many different possible values; and create a level of uncertainty. Standards are needed for estimating, resolving, and reporting that uncertainty. A risk, on the other hand, refers to whether the information allows for the estimation of the full range of possibilities of an event in terms of their probabilities. (OECD, 2018[100]).

The majority of approaches focus on uncertainty, and request testing the economic projections using a variety of approaches to sensitivity analysis (See Box 3.25). For instance, The Pew Charitable Trusts (USA) (2013[107]) stipulates the need to conduct and report sensitivity analysis, and provide a range of possible outcomes to ensure methodological rigor and transparency from an economic evaluation. Another example is the Regulatory Impact Analysis (RIA) Guidelines by the Government of Ireland (2009[108]), which suggest that any assumptions made in RIAs (and the MCAs and CBAs performed in this context) should be calculated for a variety or range of future values through a sensitivity analysis.

Most of the approaches suggest using the Monte Carlos Analysis to handle uncertainty. Other approaches propose different practices such as Partial sensitivity analysis and Break-even analysis (See Box 3.25). Additionally to this, some approaches stipulate the need for these methods not only to test the robustness of the findings, but also to present estimates within a confidence interval and their standards errors (Crowley et al., 2018[93]) (National Academies of Sciences, Engineering, and Medicine, 2016[89]).

The procedure of reporting economic evaluation findings depends not only on the type of evaluation conducted, but also on the needs to promote transparency and comparability across studies. The standards are requested to provide best practices for reporting a clear record of how the evaluation was conducted and to support the verification of the findings by an independent researcher.

The majority of approaches provide guidelines on how and what to report in an economic evaluation. For instance, the National Academies of Sciences (2016[89]) offer a Checklist of Best Practices for Reporting Economic Evidence according to the methodology implemented (CA, CEA or CBA). Another example is the Government of Netherlands (2016[104]), which provides guidance related to reporting input values, costs, and uncertainty analysis, within others.

Some of the approaches suggest more specific reporting requirements such as maintaining a common table of inputs and assumptions. For instance, Crowley et al (2018[93]) recommends, in their Standards for Reporting Findings from Economic Evaluations, implementing a two-tiered reporting system that includes a consumer-focused summary accompanied by a technical description (e.g. included as an appendix) that details the modelling and assumptions made to estimate costs and benefits. Another example is the Vera institute (USA) (2014[92]), which provides guidance for CBA on how to tabulate results, document the analysis, and interpret the findings (See Figure 3.6).

Only a small number of approaches are concerned with the delivery (time) and accessibility of the evaluation findings. For instance, the Government of Ireland (2009[108]) discusses factors such as: Where should Regulatory Impact Analysis (RIAs) be published? And the Evidence-Based Policymaking Collaborative (USA) (2016[109]), which highlights the importance that CBA results are delivered in accessible, concise, and compelling ways, and completed in time to inform decision-makers’ choices. They consider that adopting rigorous, replicable CBA methodologies and making data readily available to conduct analyses can help improve timeliness.

Knowledge of ‘what works’ – of which policies and programmes are effective, is necessary but not sufficient for obtaining outcomes for citizens. Increasingly, there is recognition that ‘implementation matters’- that the quality and level of implementation of an intervention of a policy is associated with outcomes for citizens (Durlak, 1998[110]; Durlak and DuPre, 2008[111]).

It is important to understand the features of policies and programmes, of the organisation or entity implementing them, along with the myriad of other factors that are related to adoption, implementation and sustainability of a policy or programme. This enables practical guidance to enable successful implementation and scale-up efforts. Increased attention to implementation has also been drawn by the work of economists such as Pr. Duflo, and the JPAL networks working on development issues. It has been estimated that interventions implemented correctly can achieve effects two or three times greater than interventions where problems with implementation have been experienced (Durlak and DuPre, 2008[111]).

Of the approaches included in the mapping, the majority include some coverage issues concerning implementation and scale up of interventions.

Most of the approaches that cover implementation and scale up are focused on simply providing factual details about the delivery and implementation requirements of an intervention. These approaches include the Australian What Works for Kids, the Canadian Best Practices Portal, Spain’s ‘Prevención basada en la evidencia’ and the Evidence Based Teen Pregnancy Programmes in the USA. Spain’s approach provides information related to the delivery of an intervention, its materials and setting. The Canadian Best Practices Portal is another approach that provides key information about what is required to implement an intervention (Box 3.26).

Other approaches provide more granularity about the implementation requirements of an intervention. The Evidence Based Teen Pregnancy Programs in the USA has a standalone section on implementation, which comprises of eight fields including implementation requirements and guidance and allowable adaptations. What Works for Kids also has a standalone section on implementation which includes the following fields:

  • Training

  • Can training be accessed in Australia?

  • Who delivers the programme?

  • Minimum practitioner qualifications

  • Are there any licensing or accreditation requirements?

  • Is there a manual that describes how to implement the programme?

  • What are the required materials for the trainer?

  • Are specific assessments required prior to implementation?

  • Are particular tools required for implementation?

  • Overall implementation / resourcing issues

  • Is the programme scalable?

  • Comments on the scalability of the intervention

  • Setup costs

  • Ongoing costs

Some of the approaches that cover issues around implementation and scale up are focused on providing and categorising experiences of implementing an intervention. These experiences are typically the findings from process evaluations and qualitative studies. These approaches include The Community guide, the EMCDDA Best Practice Portal, the EU-Compass for Action on Mental Health and Well-Being and HomeVee.

The Community Guide is a resource that helps practitioners and policy makers to improve health and safety in their communities. As part of a ten-step process it includes details about the applicability and barriers to implementation for the recommended interventions. The EU-Compass for Action on Mental Health and Well-being also includes information on experiences of implementation and is described in Box 3.27. The EMCDDA Best Practice Portal has recently published a new database of programmes for implementation. This includes details of programmes that have been implemented in more than one European Country, along with details of experiences of implementation (EMCDDA, 2020[113]).

HomeVee is another approach that provides a summary of ‘Implementation Experiences’ based on the studies included in a review, focusing on:

  • Characteristics of Model Participants,

  • Location and Setting,

  • Staffing and Supervision,

  • Model Components,

  • Model Adaptations or Enhancements,

  • Dosage (Home visits), and

  • Lessons Learned.

A small number of approaches go further in providing detailed criteria against which dissemination readiness and/or system readiness could be assessed. These are features of the intervention or of the organisation or community adopting the intervention that have been shown to be related to adoption, implementation, or sustainability of the intervention (Society for Prevention Research Standards of Evidence, 2015[53]). The purpose of such approaches is to support the implementation and scale-up efforts of evidence-based interventions.

These approaches include the EU-Compass for Action on Mental Health and Well-being, the Green List Prevention, NESTA, Housing Associations' Charitable Trust, Blueprints, and SUPERU. The Green List Prevention includes six criteria to rate the ‘implementation quality’ of an intervention including whether ‘support / technical assistance during implementation is available’ and whether ‘instruments for quality control during the implementation are available’. Blueprints includes five criteria on ‘dissemination readiness’ including that ‘there are explicit processes for ensuring the intervention gets to the right persons’. SUPERU also developed comparable criteria, described in Box 3.28.

A limited number of approaches go further in either explicitly scoring the implementation readiness of an intervention. The Evidence Based Teen Pregnancy Programs conducts a detailed assessment of an intervention’s ‘Implementation Readiness’ conducted based on materials and documents about the intervention and its implementation. Based on this assessment, an implementation readiness score is awarded by three component scores: (1) curriculum and materials, (2) training and staff support, and (3) fidelity monitoring tools and resources. The component scores are added together to give a total score, which ranges from 0 to 8, with higher scores indicating the interventions most ready to implement.

References

[56] Agency for Healthcare Research and Quality (2012), What Is the Evidence Rating?, https://innovations.ahrq.gov/help/evidence-rating (accessed on 19 February 2019).

[10] Alton-Lee, A. (2004), Guidelines for Generating a Best Evidence Synthesis Iteration, Ministry of Education New Zealand, http://www.minedu.govt.nz (accessed on 14 February 2019).

[84] Askim, J., U. Hjelmar and L. Pedersen (2018), “Turning Innovation into Evidence-based Policies: Lessons Learned from Free Commune Experiments”, Scandinavian Political Studies, Vol. 41/4, pp. 288-308, http://dx.doi.org/10.1111/1467-9477.12130.

[39] Axford, N. et al. (2005), “Evaluating Children’s Services: Recent Conceptual and Methodological Developments”, British Journal of Social Work, Vol. 35/1, pp. 73-88, http://dx.doi.org/10.1093/bjsw/bch163.

[78] Be You (2020), The Be You Programs Directory, https://beyou.edu.au/resources/tools-and-guides/about-programs-directory (accessed on 25 March 2020).

[41] Better Evaluation (2012), Describe the theory of change, https://www.betterevaluation.org/en/node/5280 (accessed on 20 October 2019).

[48] Blueprints for Health Youth Development (2015), Evidence-Based Programs - Standards of Evidence, https://www.blueprintsprograms.org/resources/Blueprints_Standards_full.pdf (accessed on 15 February 2019).

[82] Blueprints for health youth development (2018), Blueprints Database Standards, https://www.blueprintsprograms.org/resources/Blueprints_Standards_full.pdf (accessed on 15 February 2019).

[103] Bosworth, R., A. Professor and A. Kibria (2017), THE VALUE OF A STATISTICAL LIFE: ECONOMICS AND POLITICS Primary Investigators, https://strata.org/pdf/2017/vsl-full-report.pdf (accessed on 7 June 2019).

[97] CADTH (2017), Guidelines for the Economic Evaluation of Health Technologies: Canada 4th Edition, https://www.cadth.ca/sites/default/files/pdf/guidelines_for_the_economic_evaluation_of_health_technologies_canada_4th_ed.pdf (accessed on 17 April 2019).

[71] Child Trends (2018), , https://www.childtrends.org/what-works/eligibility-criteria.

[6] Clearinghouse for Labor Evaluation and Research (2017), About CLEAR, https://clear.dol.gov/about (accessed on 8 March 2019).

[63] Clearinghouse for Labor Evaluation and Research (2014), Guidelines for reviewing implementation studies, https://clear.dol.gov/sites/default/files/CLEAR_Operational%20Implementation%20Study%20Guidelines.pdf (accessed on 19 February 2019).

[60] Clearinghouse for Labor Evaluation and Research (2014), Guidelines for reviewing quantitative descriptive studies, https://clear.dol.gov/sites/default/files/CLEAROperationalDescriptiveStudyGuidelines.pdf (accessed on 19 February 2019).

[72] Clearinghouse for Military Family Readiness (2012), Continuum of Evidence, https://militaryfamilies.psu.edu/wp-content/uploads/2017/08/continuum.pdf (accessed on 30 January 2019).

[2] Coalition for Evidence-Based Policy (2010), Checklist For Reviewing a Randomized Controlled Trial of a Social Program or Project, To Assess Whether It Produced Valid Evidence, http://coalition4evidence.org/wp-content/uploads/2010/02/Checklist-For-Reviewing-a-RCT-Jan10.pdf (accessed on 19 February 2019).

[51] College of Policing: What Work Network (2017), Crime Reduction Toolkit, https://whatworks.college.police.uk/toolkit/Pages/Toolkit.aspx (accessed on 30 April 2019).

[5] Crime Solutions (2013), Practices Scoring Instrument, https://www.crimesolutions.gov/pdfs/PracticeScoringInstrument.pdf (accessed on 18 February 2019).

[54] Crime Solutions (2013), Program Scoring Instrument Version 2.0, https://www.crimesolutions.gov/pdfs/program-rating-instrument-v2.0.pdf (accessed on 18 February 2019).

[93] Crowley, D. et al. (2018), “Standards of Evidence for Conducting and Reporting Economic Evaluations in Prevention Science”, Prevention Science, Vol. 19/3, pp. 366-390, http://dx.doi.org/10.1007/s11121-017-0858-1.

[62] Drost, E. (2011), Validity and Reliability in Social Science Research, https://www3.nd.edu/~ggoertz/sgameth/Drost2011.pdf (accessed on 13 February 2019).

[110] Durlak, J. (1998), “Why program implementation is important”, Journal of Prevention & Intervention in the community, Vol. 17/2, pp. 5-18.

[111] Durlak, J. and E. DuPre (2008), “Implementation matters: A review of research on the influence of implementation on program outcomes and the factors affecting implementation”, American journal of community psychology, Vol. 41/3-4, pp. 327-350.

[79] Early Childhood Australia (2020), KidsMatter has become Be You, http://www.earlychildhoodaustralia.org.au/our-work/beyou/ (accessed on 22 April 2020).

[55] Early Intervention Foundation (2019), 10 steps for evaluation success, Early Intervention Foundation, http://dx.doi.org/12345.

[32] Early Intervention Foundation (2018), EIF Guidebook, https://guidebook.eif.org.uk/eif-evidence-standards (accessed on 14 February 2019).

[25] Education Endowment Foundation (2018), Technical appendix and process manual, https://educationendowmentfoundation.org.uk/public/files/Toolkit/Toolkit_Manual_2018.pdf (accessed on 1 February 2019).

[37] EFSA Guidance for those carrying out systematic reviews European Food Safety Authority (2010), “Application of systematic review methodology to food and feed safety assessments to support decision making”, EFSA Journal, Vol. 8/6, p. 1637, http://dx.doi.org/10.2903/j.efsa.2010.1637.

[113] EMCDDA (2020), Xchange prevention registry, http://www.emcdda.europa.eu/best-practice/xchange (accessed on 21 April 2020).

[94] EMCDDA (2017), “Drug treatment expenditure: a methodological overview”, http://www.emcdda.europa.eu/publications/insights/drug-treatment-expenditure-measurement_en.

[40] Epstein, D. and J. Klerman (2012), “When is a Program Ready for Rigorous Impact Evaluation? The Role of a Falsifiable Logic Model”, Evaluation Review, Vol. 36/5, pp. 375-401, http://dx.doi.org/10.1177/0193841X12474275.

[45] European Commission (2018), Guidance Document on Monitoring and Evaluation.

[43] European Commission (2011), “Towards a new system of monitoring and evaluation in EU cohesion policy”, https://ec.europa.eu/regional_policy/sources/docgener/evaluation/doc/performance/outcome_indicators_en.pdf.

[49] European Commission - Directorate-General for Health and Food Safety (2017), Criteria to select best practices in health promotion and chronic disease prevention and management in Europe, https://ec.europa.eu/health/sites/health/files/mental_health/docs/compass_bestpracticescriteria_en.pdf (accessed on 25 January 2019).

[19] European Monitoring Centre for Drugs and Drug Addiction (2020), Best practice portal, http://www.emcdda.europa.eu/best-practice_en (accessed on 27 April 2020).

[9] European Monitoring Centre for Drugs and Drug Addiction (2011), “European drug prevention quality standards”, http://dx.doi.org/10.2810/48879.

[57] European Platform for Investing in Children (2017), Review Criteria and Process, https://ec.europa.eu/social/main.jsp?catId=1246&intPageId=4286&langId=en (accessed on 14 February 2019).

[34] Every Student Succeeds Act - ESSA (2019), Evidence for ESSA: Standards and Procedures, https://content.evidenceforessa.org/sites/default/files/On%20clean%20Word%20doc.pdf (accessed on 18 February 2019).

[109] Evidence-Based Policymaking Collaborative (2016), Evidence Toolkit: Cost Benefit Analysis.

[18] Ferri, M. and P. Griffiths (2015), “Good Practice and Quality Standards”, in Textbook of Addiction Treatment: International Perspectives, Springer Milan, http://dx.doi.org/10.1007/978-88-470-5322-9_64.

[44] Gaffey, V. (2013), “A fresh look at the intervention logic of Structural Funds”, European Commision, http://dx.doi.org/10.1177/1356389013485196.

[42] Ghate, D. (2018), “Developing theories of change for social programmes: co-producing evidence-supported quality improvement”, Palgrave Communications, Vol. 4/1, p. 90, http://dx.doi.org/10.1057/s41599-018-0139-z.

[13] Gough, D., J. Thomas and S. Oliver (2019), Clarifying differences between reviews within evidence ecosystems, BioMed Central Ltd., http://dx.doi.org/10.1186/s13643-019-1089-2.

[3] Gough, D. and H. White (2018), Evidence standards and evidence claims in web based research portals, Centre for Homelessness Impact, https://uploads-ssl.webflow.com/59f07e67422cdf0001904c14/5bfffe39daf9c956d0815519_CFHI_EVIDENCE_STANDARDS_REPORT_V14_WEB.pdf (accessed on 8 March 2019).

[104] Goverment of Netherlands (2016), Guideline for economic evaluations in healthcare.

[108] Government of Ireland (2009), How to conduct a Regulatory Impact Analysis.

[73] Graham Allen (2011), Early Intervention: The Next Steps An Independent Report to Her Majesty’s Government, http://www.childtrauma.org (accessed on 14 February 2019).

[50] Groeger-Roth, F. and B. Hasenpusch (2011), Green List Prevention: Inclusion-and Rating-Criteria for the CTC Programme-Databank Crime Prevention Council of Lower Saxony, Crime Prevention Council of Lower Saxony, https://www.gruene-liste-praevention.de/communities-that-care/Media/GreenListPrevention_Rating-Criteria.pdf (accessed on 14 February 2019).

[38] Guyatt, G. et al. (2008), “GRADE: an emerging consensus on rating quality of evidence and strength of recommendations”, BMJ, Vol. 336/7650, pp. 924-926, http://dx.doi.org/10.1136/bmj.39489.470347.ad.

[4] Health Evidence (2018), Quality Assessment Tool, https://www.healthevidence.org/documents/our-appraisal-tools/quality-assessment-tool-dictionary-en.pdf (accessed on 8 March 2019).

[68] Hollis, S. and F. Campbell (1999), “What is meant by intention to treat analysis? Survey of published randomised controlled trials”, British Medical Journal, Vol. 42, p. 4, http://dx.doi.org/10.1136/bmj.319.7211.670.

[83] Home Visiting Evidence of Effectiveness (2018), Assessing Evidence of Effectiveness, https://homvee.acf.hhs.gov/Review-Process/4/Assessing-Evidence-of-Effectiveness/19/7 (accessed on 19 February 2019).

[77] Home Visiting Evidence of Effectiveness (2018), Producing Study Ratings, https://homvee.acf.hhs.gov/Review-Process/4/Producing-Study-Ratings/19/5 (accessed on 19 February 2019).

[59] Housing Associations’ Charitable Trust (2016), Standard for Producing Evidence - Effectiveness of Interventions – Part 1: Specification, https://www.hact.org.uk/sites/default/files/StEv2-1-2016%20Effectiveness-Specification.pdf (accessed on 25 January 2019).

[80] Johns Hopkins University School of Education’s Center for Data-Driven Reform in Education - CDDRE (n.d.), Best Evidence Encylopedia, http://www.bestevidence.org/aboutbee.htm (accessed on 18 February 2019).

[27] Johnson, S., N. Tilley and K. Bowers (2015), “Introducing EMMIE: an evidence rating scale to encourage mixed-method crime prevention synthesis reviews”, Journal of Experimental Criminology, Vol. 11/3, pp. 459-473, http://dx.doi.org/10.1007/s11292-015-9238-7.

[98] Karoly, L. (2012), “Toward Standardization of Benefit-Cost Analysis of Early Childhood Interventions”, Journal of Benefit-Cost Analysis, Vol. 3/1, http://dx.doi.org/10.1515/2152-2812.1085.

[115] Karoly, L. (2010), “Toward Standardization of Benefit-Cost Analyses of Early Childhood Interventions”, SSRN Electronic Journal, http://dx.doi.org/10.2139/ssrn.1753326.

[86] Levin, C. and D. Chisholm (2016), “Cost-Effectiveness and Affordability of Interventions, Policies, and Platforms for the Prevention and Treatment of Mental, Neurological, and Substance Use Disorders”, Mental, Neurological, and Substance Use Disorders: Disease Control Priorities, Vol. 4, http://dx.doi.org/10.1596/978-1-4648-0426-7_ch12.

[95] Lomas, J. et al. (2018), “Which Costs Matter? Costs Included in Economic Evaluation and their Impact on Decision Uncertainty for Stable Coronary Artery Disease”, PharmacoEconomics - Open, Vol. 2/4, pp. 403-413, http://dx.doi.org/10.1007/s41669-018-0068-1.

[33] Mathematica Policy Research (2016), Identifying Programs That Impact Teen Pregnancy, Sexually Transmitted Infections, and Associated Sexual Risk Behaviors, https://tppevidencereview.aspe.hhs.gov/pdfs/TPPER_Review%20Protocol_v5.pdf (accessed on 26 January 2019).

[89] National Academies of Sciences, Engineering, and Medicine (2016), Advancing the Power of Economic Evidence to Inform Investments in, The National Academies Press, http://dx.doi.org/10.17226/23481.

[112] National Collaborating Centre for Methods and Tools (2010), Effective interventions: The Canadian Best Practices Portal, McMaster University, Hamilton, http://www.nccmt.ca/resources/search/69 (accessed on 15 February 2019).

[81] National Dropout Prevention Center - NDPC (2019), Rating system, http://dropoutprevention.org/mpdb/web/rating-system (accessed on 19 February 2019).

[8] National Implementation Research Network (2018), The Hexagon: An Exploration Tool. Hexagon Discussion & Analysis Tool Instructions, https://implementation.fpg.unc.edu/sites/implementation.fpg.unc.edu/files/resources/NIRN_HexagonTool_11.2.18.pdf (accessed on 19 February 2019).

[105] National Institute for Health and Care Excellence (2013), How NICE measures value for money in relation to public health interventions, https://www.nice.org.uk/Media/Default/guidance/LGB10-Briefing-20150126.pdf (accessed on 1 May 2019).

[1] Nest What works for kids (2012), Rapid Evidence Assessment, http://whatworksforkids.org.au/rapid-evidence-assessment (accessed on 19 February 2019).

[47] NESTA (2013), Standards of evidence: an approach that balances the need for evidence with innovation, https://media.nesta.org.uk/documents/standards_of_evidence.pdf (accessed on 26 February 2019).

[106] New Zealand Treasury (2015), Guide to Social Cost Benefit Analysis - July 2015, New Zealand Treasury, Wellington, http://www.treasury.govt.nz/publications/guidance/planning/costbenefitanalysis/guide/ (accessed on 2 May 2018).

[66] Newhouse J.P., H. (1993), Free for All, lessons from the Rand Health Insurance Experiment,, https://doi.org/10.7249/CB199.

[23] Norwegian Institute of Public Health (2021), Elementer i livsstilstiltak for vektreduksjon blant voksne personer med overvekt eller fedme, https://www.fhi.no/globalassets/dokumenterfiler/rapporter/2021/elementer-i-livsstilstiltak-for-vektreduksjon-blant-voksne-personer-med-overvekt-eller-fedme-rapport-2021-v2.pdf.

[22] Norwegian Institute of Public Health (2020), A systematic and living evidence map on COVID-19, https://www.fhi.no/contentassets/e64790be5d3b4c4abe1f1be25fc862ce/covid-19-evidence-map-protocol-20200403.pdf.

[96] NPC Research & Portland State University’s Center for the Improvement of Child and Family Services (2019), Conduct a Cost Analysis of Your Home Visiting Program, http://www.homevisitcosts.com/organizing-your-data.php (accessed on 22 May 2019).

[91] NSW Goverment (2017), Guide to Cost-Benefit Analysis, https://www.treasury.nsw.gov.au/sites/default/files/2017-03/TPP17-03%20NSW%20Government%20Guide%20to%20Cost-Benefit%20Analysis%20-%20pdf_0.pdf.

[100] OECD (2018), Cost-Benefit Analysis and the Environment: Further Developments and Policy Use, OECD Publishing, Paris, https://dx.doi.org/10.1787/9789264085169-en.

[87] OECD (2018), “Preface”, in Cost-Benefit Analysis and the Environment: Further Developments and Policy Use, OECD Publishing, Paris, https://dx.doi.org/10.1787/9789264085169-1-en.

[67] OECD (2017), “Making policy evaluation work: The case of regional development policy”, OECD Science, Technology and Industry Policy Papers, https://dx.doi.org/10.1787/c9bb055f-en.

[101] OECD (2006), Cost-Benefit Analysis and the Environment: Recent Developments.

[102] Office of Manangment and Budget (2018), 2018 Discount Rates for OMB Circular No. A-94.

[11] Oliver, S. et al. (2018), “Approaches to evidence synthesis in international development: a research agenda”, Journal of Development Effectiveness, Vol. 10/3, pp. 305-326, http://dx.doi.org/10.1080/19439342.2018.1478875.

[99] OMB (1992), Guidelines and Discount Rates for Benefit-Cost Analysis of Federal Programs, https://www.whitehouse.gov/sites/whitehouse.gov/files/omb/circulars/A94/a094.pdf (accessed on 17 September 2018).

[20] Oxman, A., J. Lavis and A. Fretheim (2007), “Use of evidence in WHO recommendations”, Lancet, Vol. 369/9576, pp. 1883-1889, http://dx.doi.org/10.1016/S0140-6736(07)60675-8.

[46] Project Oracle Children and Youth Evidence Hub (2018), Validation Guidebook: An overview of Project Oracle’s validation process, https://project-oracle.com/uploads/files/Validation_Guidebook.pdf (accessed on 14 February 2019).

[64] Puddy, R. and N. Wilkins (2011), Understanding Evidence Part 1: Best Available Research Evidence. A Guide to the Continuum of Evidence of Effectiveness, Centers for Disease Control and Prevention, Atlanta, https://www.cdc.gov/violenceprevention/pdf/understanding_evidence-a.pdf (accessed on 18 February 2019).

[90] Reduction, W. (2017), Crime Reduction Toolkit, The College of policing, https://whatworks.college.police.uk/toolkit/Pages/Toolkit.aspx (accessed on 30 April 2019).

[76] Review Srengthening Families Evidence (2019), Review Process, https://familyreview.acf.hhs.gov/ReviewProcess.aspx?id=3 (accessed on 19 February 2019).

[15] Saran, A. and H. White (2018), “Evidence and gap maps: a comparison of different approaches”, http://dx.doi.org/10.4073/cmdp.2018.2.

[61] Scholtes, V., C. Terwee and R. Poolman (2010), “What makes a measurement instrument valid and reliable?”, Injury, Vol. 42, pp. 236-240, http://dx.doi.org/10.1016/j.injury.2010.11.042.

[12] Shemilt, I. et al. (2010), “Evidence synthesis, economics and public policy”, Research Synthesis Methods, Vol. 1/2, pp. 126-135, http://dx.doi.org/10.1002/jrsm.14.

[70] Social Programs That Work (2019), Evidence Based Programs, https://evidencebasedprograms.org/ (accessed on 19 February 2019).

[53] Society for Prevention Research Standards of Evidence (2015), “Standards of evidence for efficacy, effectiveness, and scale-up research in prevention science: Next generation”, Society for Prevention Research Standards of Evidence, Vol. 16/7, pp. 893-926.

[35] Strengthening Families Evidence Review - SFER (2018), Review Process, https://familyreview.acf.hhs.gov/ReviewProcess.aspx?id=3 (accessed on 19 February 2019).

[52] SUPERU (2017), An evidence rating scale for New Zealand. Understanding the effectiveness of interventions in the social sector.

[114] SUPERU (2016), Standards of evidence for understanding what works: International experiences and prospects for Aotearoa New Zealand, SUPERU, Wellington, http://www.superu.govt.nz/sites/default/files/Standards%20of%20evidence.pdf (accessed on 18 April 2018).

[75] The California Evidence-Based Clearinghouse for Child Welfare (2019), Scientific Rating Scale, http://www.cebc4cw.org/ratings/scientific-rating-scale/ (accessed on 25 January 2019).

[30] The Campbell Collaboration (2019), Campbell systematic reviews: Policies and Guidelines, http://dx.doi.org/10.4073/cpg.2016.1.

[21] The Cochrane Collaboration (2021), Cochrane Denmark, https://www.cochrane.dk/nordic-cochrane-centre-copenhagen.

[88] The Cochrane Collaboration (2011), Cochrane Handbook for Systematic Reviews of Interventions, https://handbook-5-1.cochrane.org/ (accessed on 18 April 2019).

[36] The Community Guide (2018), The Community Guide Methodology, https://www.thecommunityguide.org/about/our-methodology (accessed on 19 February 2019).

[28] The EQUATOR Network (2020), Enhancing the quality and transparency of health research, https://www.equator-network.org/ (accessed on 21 April 2020).

[107] The Pew Charitable Trusts (2013), States’ Use of Cost-Benefit Analysis.

[16] The UK Civil Service (2014), What is a Rapid Evidence Assessment?, https://webarchive.nationalarchives.gov.uk/20140402163359/http://www.civilservice.gov.uk/networks/gsr/resources-and-guidance/rapid-evidence-assessment/what-is (accessed on 22 April 2020).

[85] Tõnurist, P. (2019), Evaluating Public Sector Innovation Support or hindrance to innovation?, OECD, Paris.

[92] Vera Institute (2014), Cost-Benefit Analysis and Justice Policy Toolkit.

[17] Washington State Institute for Public Policy Benefit (2017), Benefit-Cost Technical Documentation.

[69] What Works Centre for Children’s Social Care (2018), Evidence standards, https://wwc-evidence.herokuapp.com/pages/our-ratings-explained (accessed on 19 February 2019).

[65] What Works Centre for Local Economic Growth (2016), Guide to scoring evidence using the Maryland Scientific Methods Scale, https://whatworksgrowth.org/public/files/Methodology/16-06-28_Scoring_Guide.pdf (accessed on 24 January 2019).

[29] What Works Centre for Local Economic Growth (2015), Evidence Review Apprenticeships, https://whatworksgrowth.org/public/files/Policy_Reviews/15-09-04_Apprenticeships_Review.pdf (accessed on 13 May 2019).

[31] What Works Centre for Wellbeing (2017), A guide to our evidence review methods, https://whatworkswellbeing.org/product/a-guide-to-our-evidence-review-methods/ (accessed on 8 March 2019).

[74] What Works Clearinghouse (2020), Standards Handbook (Version 4.1), https://ies.ed.gov/ncee/wwc/Docs/referenceresources/WWC-Standards-Handbook-v4-1-508.pdf (accessed on 5 February 2019).

[24] What Works For Health (2010), Evidence Rating: Guidelines, http://whatworksforhealth.wisc.edu/evidence.php (accessed on 6 June 2019).

[58] What Works For Health (2010), Methods, http://whatworksforhealth.wisc.edu/evidence.php (accessed on 19 February 2019).

[14] White, H. (2019), “The twenty-first century experimenting society: the four waves of the evidence revolution”, Palgrave Communications, Vol. 5/1, p. 47, http://dx.doi.org/10.1057/s41599-019-0253-6.

[26] Whiting, P. et al. (2016), “ROBIS: A new tool to assess risk of bias in systematic reviews was developed”, Journal of Clinical Epidemiology, Vol. 69, pp. 225-234, http://dx.doi.org/10.1016/j.jclinepi.2015.06.005.

[7] Zaza, S. et al. (2000), Data Collection Instrument and Procedure for Systematic Reviews in the Guide to Community Preventive Services, http://www.thecommunityguide.org (accessed on 25 January 2019).

Notes

← 1. This will be broadly discuss in the effectiveness section.

← 2. A shadow price is an estimate of an economic value when market-based values are unavailable (e.g., no market for buying and selling emotional regulation) (Karoly, 2010[115]). The quality and consensus on shadow prices can vary by substantive area. Sometimes an estimate is only appropriate for projections in certain circumstances and should not be generalized (Crowley et al., 2018[93]).

Metadata, Legal and Rights

This document, as well as any data and map included herein, are without prejudice to the status of or sovereignty over any territory, to the delimitation of international frontiers and boundaries and to the name of any territory, city or area. Extracts from publications may be subject to additional disclaimers, which are set out in the complete version of the publication, available at the link provided.

© OECD 2020

The use of this work, whether digital or print, is governed by the Terms and Conditions to be found at http://www.oecd.org/termsandconditions.