Impact evaluation

Impact evaluation assesses the extent to which an intervention (such as a project, program or policy) is causally responsible for observed outcomes, intended and unintended.^[1] In contrast to outcome monitoring, which examines whether targets have been achieved, impact evaluation is structured to answer the question: how would outcomes such as participants' well-being have changed if the intervention had not been undertaken? This involves counterfactual analysis, that is, "a comparison between what actually happened and what would have happened in the absence of the intervention."^[2] Impact evaluations seek to answer cause-and-effect questions. In other words, they look for the changes in outcome that are directly attributable to a program.^[3]

Impact evaluation helps people answer key questions for evidence-based policy making: what works, what doesn't, where, why and for how much? It has received increasing attention in policy making in recent years in the context of both developed and developing countries.^[4] It is an important component of the armory of evaluation tools and approaches and integral to global efforts to improve the effectiveness of aid delivery and public spending more generally in improving living standards. Originally more oriented towards evaluation of social sector programs in developing countries, notably conditional cash transfers, impact evaluation is now being increasingly applied in other areas such as agriculture, energy and transport.

Counterfactual evaluation designs

Counterfactual analysis enables evaluators to attribute cause and effect between interventions and outcomes. The 'counterfactual' measures what would have happened to beneficiaries in the absence of the intervention, and impact is estimated by comparing counterfactual outcomes to those observed under the intervention. The key challenge in impact evaluation is that the counterfactual cannot be directly observed and must be approximated with reference to a comparison group. There are a range of accepted approaches to determining an appropriate comparison group for counterfactual analysis, using either prospective (ex ante) or retrospective (ex post) evaluation design. Prospective evaluations begin during the design phase of the intervention, involving collection of baseline and end-line data from intervention beneficiaries (the 'treatment group') and non-beneficiaries (the 'comparison group'); they may involve selection of individuals or communities into treatment and comparison groups. Retrospective evaluations are usually conducted after the implementation phase and may exploit existing survey data, although the best evaluations will collect data as close to baseline as possible, to ensure comparability of intervention and comparison groups.

There are five key principles relating to internal validity (study design) and external validity (generalizability) which rigorous impact evaluations should address: confounding factors, selection bias, spillover effects, contamination, and impact heterogeneity.^[5]

Confounding occurs where certain factors, typically relating to socioeconomic status, are correlated with exposure to the intervention and, independent of exposure, are causally related to the outcome of interest. Confounding factors are therefore alternate explanations for an observed (possibly spurious) relationship between intervention and outcome.
Selection bias, a special case of confounding, occurs where intervention participants are non-randomly drawn from the beneficiary population, and the criteria determining selection are correlated with outcomes. Unobserved factors, which are associated with access to or participation in the intervention, and are causally related to the outcome of interest, may lead to a spurious relationship between intervention and outcome if unaccounted for. Self-selection occurs where, for example, more able or organized individuals or communities, who are more likely to have better outcomes of interest, are also more likely to participate in the intervention. Endogenous program selection occurs where individuals or communities are chosen to participate because they are seen to be more likely to benefit from the intervention. Ignoring confounding factors can lead to a problem of omitted variable bias. In the special case of selection bias, the endogeneity of the selection variables can cause simultaneity bias.
Spillover (referred to as contagion in the case of experimental evaluations) occurs when members of the comparison (control) group are affected by the intervention.
Contamination occurs when members of treatment and/or comparison groups have access to another intervention which also affects the outcome of interest.
Impact heterogeneity refers to differences in impact due by beneficiary type and context. High quality impact evaluations will assess the extent to which different groups (e.g., the disadvantaged) benefit from an intervention as well as the potential effect of context on impact. The degree that results are generalizable will determine the applicability of lessons learned for interventions in other contexts.

Impact evaluation designs are identified by the type of methods used to generate the counterfactual and can be broadly classified into three categories – experimental, quasi-experimental and non-experimental designs – that vary in feasibility, cost, involvement during design or after implementation phase of the intervention, and degree of selection bias. White (2006)^[6] and Ravallion (2008)^[7] discuss alternate Impact Evaluation approaches.

Experimental approaches

Under experimental evaluations the treatment and comparison groups are selected randomly and isolated both from the intervention, as well as any interventions which may affect the outcome of interest. These evaluation designs are referred to as randomized control trials (RCTs). In experimental evaluations the comparison group is called a control group. When randomization is implemented over a sufficiently large sample with no contagion by the intervention, the only difference between treatment and control groups on average is that the latter does not receive the intervention. Random sample surveys, in which the sample for the evaluation is chosen randomly, should not be confused with experimental evaluation designs, which require the random assignment of the treatment.

The experimental approach is often held up as the 'gold standard' of evaluation. It is the only evaluation design which can conclusively account for selection bias in demonstrating a causal relationship between intervention and outcomes. Randomization and isolation from interventions might not be practicable in the realm of social policy and may be ethically difficult to defend,^[8]^[9] although there may be opportunities to use natural experiments. Bamberger and White (2007)^[10] highlight some of the limitations to applying RCTs to development interventions. Methodological critiques have been made by Scriven (2008)^[11] on account of the biases introduced since social interventions cannot be fully blinded, and Deaton (2009)^[12] has pointed out that in practice analysis of RCTs falls back on the regression-based approaches they seek to avoid and so are subject to the same potential biases. Other problems include the often heterogeneous and changing contexts of interventions, logistical and practical challenges, difficulties with monitoring service delivery, access to the intervention by the comparison group and changes in selection criteria and/or intervention over time. Thus, it is estimated that RCTs are only applicable to 5 percent of development finance.^[10]

Randomised control trials (RCTs)

RCTs are studies used to measure the effectiveness of a new intervention. They are unlikely to prove causality on their own, however randomisation reduces bias while providing a tool for examining cause-effect relationships.^[13] RCTs rely on random assignment, meaning that that evaluation almost always has to be designed ex ante, as it is rare that the natural assignment of a project would be on a random basis.^[14] When designing an RCT, there are five key questions that need to be asked: What treatment is being tested, how many treatment arms will there be, what will be the unit of assignment, how large of a sample is needed, how will the test be randomised.^[14] A well conducted RCT will yield a credible estimate regarding the average treatment effect within one specific population or unit of assignment.^[15] A drawback of RCTs is 'the transportation problem', outlining that what works within one population does not necessarily work within another population, meaning that the average treatment effect is not applicable across differing units of assignment.^[15]

Natural experiments

Natural experiments are used because these methods relax the inherent tension uncontrolled field and controlled laboratory data collection approaches.^[16] Natural experiments leverage events outside the researchers' and subjects' control to address several threats to internal validity, minimising the chance of confounding elements, while sacrificing a few of the features of field data, such as more natural ranges of treatment effects and the presence of organically formed context.^[16] A main problem with natural experiments is the issue of replicability. Laboratory work, when properly described and repeated, should be able to produce similar results. Due to the uniqueness of natural experiments, replication is often limited to analysis of alternate data from a similar event.^[16]

Non-experimental approaches

Quasi-experimental design

Quasi-experimental approaches can remove bias arising from selection on observables and, where panel data are available, time invariant unobservables. Quasi-experimental methods include matching, differencing, instrumental variables and the pipeline approach; they are usually carried out by multivariate regression analysis.

If selection characteristics are known and observed, they can be controlled for to remove the bias. Matching involves comparing program participants with non-participants based on observed selection characteristics. Propensity score matching (PSM) uses a statistical model to calculate the probability of participating on the basis of a set of observable characteristics and matches participants and non-participants with similar probability scores. Regression discontinuity design exploits a decision rule as to who does and does not get the intervention to compare outcomes for those just either side of this cut-off.

Difference in differences or double differences, which use data collected at baseline and end-line for intervention and comparison groups, can be used to account for selection bias under the assumption that unobservable factors determining selection are fixed over time (time invariant). Difference in differences can also be applied to multiple time points and when an intervention is incrementally introduced in phases.^[17]

Instrumental variables estimation accounts for selection bias by modelling participation using factors ('instruments') that are correlated with selection but not the outcome, thus isolating the aspects of program participation which can be treated as exogenous.

The pipeline approach (stepped-wedge design) uses beneficiaries already chosen to participate in a project at a later stage as the comparison group. The assumption is that as they have been selected to receive the intervention in the future they are similar to the treatment group, and therefore comparable in terms of outcome variables of interest. However, in practice, it cannot be guaranteed that treatment and comparison groups are comparable and some method of matching will need to be applied to verify comparability.

Non-experimental design

Non-experimental impact evaluations are so-called because they do not involve a comparison group that does not have access to the intervention. The method used in non-experimental evaluation is to compare intervention groups before and after implementation of the intervention. Intervention interrupted time-series (ITS) evaluations require multiple data points on treated individuals before and after the intervention, while before versus after (or pre-test post-test) designs simply require a single data point before and after. Post-test analyses include data after the intervention from the intervention group only. Non-experimental designs are the weakest evaluation design, because to show a causal relationship between intervention and outcomes convincingly, the evaluation must demonstrate that any likely alternate explanations for the outcomes are irrelevant. However, there remain applications to which this design is relevant, for example, in calculating time-savings from an intervention which improves access to amenities. In addition, there may be cases where non-experimental designs are the only feasible impact evaluation design, such as universally implemented programmes or national policy reforms in which no isolated comparison groups are likely to exist.

Biases in estimating programme effects

Randomized field experiments are the strongest research designs for assessing program impact. This particular research design is said to generally be the design of choice when it is feasible as it allows for a fair and accurate estimate of the program's actual effects (Rossi, Lipsey & Freeman, 2004).

With that said, randomized field experiments are not always feasible to carry out and in these situations there are alternative research designs that are at the disposal of an evaluator. The main problem though is that regardless of which design an evaluator chooses, they are prone to a common problem: Regardless of how well thought through or well implemented the design is, each design is subject to yielding biased estimates of the program effects. These biases play the role of exaggerating or diminishing program effects. Not only that, but the direction the bias may take cannot usually be known in advance (Rossi et al., 2004). These biases affect the interest of the stakeholder. Furthermore, it is possible that program participants are disadvantaged if the bias is in such a way that it contributes to making an ineffective or harmful program seem effective. There is also the possibility that a bias can make an effective program seem ineffective or even as far as harmful. This could possibly make the accomplishments of program seem small or even insignificant therefore forcing the personnel and even cause the program's sponsors to reduce or eliminate the funding for the program (Rossi et al., 2004).

It is safe to say that if an inadequate design yields bias, the stakeholders who are largely responsible for the funding of the program will be the ones most concerned; the results of the evaluation help the stakeholders decide whether or not to continue funding the program because the final decision lies with the funders and the sponsors. Not only are the stakeholders mostly concerned, but those taking part in the program or those the program is intended to positively affect will be affected by the design chosen and the outcome rendered by that chosen design. Therefore, the evaluator's concern is to minimize the amount of bias in the estimation of program effects (Rossi et al., 2004).

Biases are normally visible in two situations: when the measurement of the outcome with program exposure or the estimate of what the outcome would have been without the program exposure is higher or lower than the corresponding "true" value (p267). Unfortunately, not all forms of bias that may compromise impact assessment are obvious (Rossi et al., 2004).

The most common form of impact evaluation design is comparing two groups of individuals or other units, an intervention group that receives the program and a control group that does not. The estimate of program effect is then based on the difference between the groups on a suitable outcome measure (Rossi et al., 2004). The random assignment of individuals to program and control groups allows for making the assumption of continuing equivalence. Group comparisons that have not been formed through randomization are known as non-equivalent comparison designs (Rossi et al., 2004).

Selection bias

When there is an absence of the assumption of equivalence, the difference in outcome between the groups that would have occurred regardless creates a form of bias in the estimate of program effects. This is known as selection bias (Rossi et al., 2004). It creates a threat to the validity of the program effect estimate in any impact assessment using a non-equivalent group comparison design and appears in situations where some process responsible for influences that are not fully known selects which individuals will be in which group instead of the assignment to groups being determined by pure chance (Rossi et al., 2004). This may be because of participant self-selection, or it may be because of program placement (placement bias).^[18]

Selection bias can occur through natural or deliberate processes that cause a loss of outcome data for members of the intervention and control groups that have already been formed. This is known as attrition and it can come about in two ways (Rossi et al., 2004): targets drop out of the intervention or control group cannot be reached or targets refuse to co-operate in outcome measurement. Differential attrition is assumed when attrition occurs as a result of something either than explicit chance process (Rossi et al., 2004). This means that "those individuals that were from the intervention group whose outcome data are missing cannot be assumed to have the same outcome-relevant characteristics as those from the control group whose outcome data are missing" (Rossi et al., 2004, p271). However, random assignment designs are not safe from selection bias which is induced by attrition (Rossi et al., 2004).

Other forms of bias

There are other factors that can be responsible for bias in the results of an impact assessment. These generally have to do with events or experiences other than receiving the program that occur during the intervention. These biases include secular trends, interfering events and maturation (Rossi et al., 2004).

Secular trends or secular drift

Secular trends can be defined as being relatively long-term trends in the community, region or country. These are also termed secular drift and may produce changes that enhance or mask the apparent effects of an intervention(Rossi et al., 2004). For example, when a community's birth rate is declining, a program to reduce fertility may appear effective because of bias stemming from that downward trend (Rossi et al., 2004, p273).

Interfering events

Interfering events are similar to secular trends; in this case it is the short-term events that can produce changes that may introduce bias into estimates of program effect, such as a power outage disrupting communications or hampering the delivery of food supplements may interfere with a nutrition program (Rossi et al., 2004, p273).

Maturation

Impact evaluation needs to accommodate the fact that natural maturational and developmental processes can produce considerable change independently of the program. Including these changes in the estimates of program effects would result in bias estimates. An example of this form of bias would be a program to improve preventative health practices among adults may seem ineffective because health generally declines with age (Rossi et al., 2004, p273).

"Careful maintenance of comparable circumstances for program and control groups between random assignment and outcome measurement should prevent bias from the influence of other differential experiences or events on the groups. If either of these conditions is absent from the design, there is potential for bias in the estimates of program effect" (Rossi et al., 2004, p274).

Estimation methods

Estimation methods broadly follow evaluation designs. Different designs require different estimation methods to measure changes in well-being from the counterfactual. In experimental and quasi-experimental evaluation, the estimated impact of the intervention is calculated as the difference in mean outcomes between the treatment group (those receiving the intervention) and the control or comparison group (those who don't). This method is also called randomized control trials (RCT). According to an interview with Jim Rough, former representative of the American Evaluation Association, in the magazine D+C Development and Cooperation, this method does not work for complex, multilayer matters. The single difference estimator compares mean outcomes at end-line and is valid where treatment and control groups have the same outcome values at baseline. The difference-in-difference (or double difference) estimator calculates the difference in the change in the outcome over time for treatment and comparison groups, thus utilizing data collected at baseline for both groups and a second round of data collected at end-line, after implementation of the intervention, which may be years later.^[19]

Impact Evaluations which have to compare average outcomes in the treatment group, irrespective of beneficiary participation (also referred to as 'compliance' or 'adherence'), to outcomes in the comparison group are referred to as intention-to-treat (ITT) analyses. Impact Evaluations which compare outcomes among beneficiaries who comply or adhere to the intervention in the treatment group to outcomes in the control group are referred to as treatment-on-the-treated (TOT) analyses. ITT therefore provides a lower-bound estimate of impact, but is arguably of greater policy relevance than TOT in the analysis of voluntary programs.^[20]...yes

Debates

While there is agreement on the importance of impact evaluation, and a consensus is emerging around the use of counterfactual evaluation methods, there has also been widespread debate in recent years on both the definition of impact evaluation and the use of appropriate methods (see White 2009^[21] for an overview).

Definitions

The International Initiative for Impact Evaluation (3ie) defines rigorous impact evaluations as: "analyses that measure the net change in outcomes for a particular group of people that can be attributed to a specific program using the best methodology available, feasible and appropriate to the evaluation question that is being investigated and to the specific context".^[22]

According to the World Bank's DIME Initiative, "Impact evaluations compare the outcomes of a program against a counterfactual that shows what would have happened to beneficiaries without the program. Unlike other forms of evaluation, they permit the attribution of observed changes in outcomes to the program being evaluated by following experimental and quasi-experimental designs".^[23]

Similarly, according to the US Environmental Protection Agency impact evaluation is a form of evaluation that assesses the net effect of a program by comparing program outcomes with an estimate of what would have happened in the absence of a program.^[24]

According to the World Bank's Independent Evaluation Group (IEG), impact evaluation is the systematic identification of the effects positive or negative, intended or not on individual households, institutions, and the environment caused by a given development activity such as a program or project.^[25]

Impact evaluation has been defined differently over the past few decades.^[6] Other interpretations of impact evaluation include:

An evaluation which looks at the impact of an intervention on final welfare outcomes, rather than only at project outputs, or a process evaluation which focuses on implementation;
An evaluation carried out some time (five to ten years) after the intervention has been completed so as to allow time for impact to appear; and
An evaluation considering all interventions within a given sector or geographical area.

Other authors make a distinction between "impact evaluation" and "impact assessment." "Impact evaluation" uses empirical techniques to estimate the effects of interventions and their statistical significance, whereas "impact assessment" includes a broader set of methods, including structural simulations and other approaches that cannot test for statistical significance.^[18]

Common definitions of 'impact' used in evaluation generally refer to the totality of longer-term consequences associated with an intervention on quality-of-life outcomes. For example, the Organization for Economic Cooperation and Development's Development Assistance Committee (OECD-DAC) defines impact as the "positive and negative, primary and secondary long-term effects produced by a development intervention, directly or indirectly, intended or unintended".^[26] A number of international agencies have also adopted this definition of impact. For example, UNICEF defines impact as "The longer term results of a program – technical, economic, socio-cultural, institutional, environmental or other – whether intended or unintended. The intended impact should correspond to the program goal."^[27] Similarly, Evaluationwiki.org defines impact evaluation as an evaluation that looks beyond the immediate results of policies, instruction, or services to identify longer-term as well as unintended program effects.^[28]

Technically, an evaluation could be conducted to assess 'impact' as defined here without reference to a counterfactual. However, much of the existing literature (e.g. NONIE Guidelines on Impact Evaluation^[29]) adopts the OECD-DAC definition of impact while referring to the techniques used to attribute impact to an intervention as necessarily based on counterfactual analysis.

What is missing from the term 'impact' evaluation is the way 'impact' shows up long-term. For instance, most Monitoring and Evaluation 'logical framework' plans have inputs-outputs-outcomes and... impacts. While the first three appear during the project duration itself, impact takes far longer to take place. For instance, in a 5-year agricultural project, seeds are inputs, farmers trained in using them our outputs, changes in crop yields as a result of the seeds being planted properly is an outcome and families being more sustainably food secure over time is an impact. Such post-project impact evaluations are very rare. They are also called ex-post evaluations or we are coining the term sustained impact evaluations. While hundreds of thousands of documents call for them, rarely do donors have the funding flexibility - or interest - to return to see how sustained, and durable our interventions remained after project close out, after resources were withdrawn. There are many lessons to be learned for design, implementation, M&E and how to foster country-ownership.

Methodological debates

There is intensive debate in academic circles around the appropriate methodologies for impact evaluation, between proponents of experimental methods on the one hand and proponents of more general methodologies on the other. William Easterly has referred to this as 'The Civil War in Development economics' Archived 2010-02-06 at the Wayback Machine. Proponents of experimental designs, sometimes referred to as 'randomistas',^[8] argue randomization is the only means to ensure unobservable selection bias is accounted for, and that building up the flimsy experimental evidence base should be developed as a matter of priority.^[30] In contrast, others argue that randomized assignment is seldom appropriate to development interventions and even when it is, experiments provide us with information on the results of a specific intervention applied to a specific context, and little of external relevance.^[31] There has been criticism from evaluation bodies and others that some donors and academics overemphasize favoured methods for impact evaluation,^[32] and that this may in fact hinder learning and accountability.^[33] In addition, there has been a debate around the appropriate role for qualitative methods within impact evaluations.^[34]^[35]

Theory-based impact evaluation

While knowledge of effectiveness is vital, it is also important to understand the reasons for effectiveness and the circumstances under which results are likely to be replicated. In contrast with 'black box' impact evaluation approaches, which only report mean differences in outcomes between treatment and comparison groups, theory-based impact evaluation involves mapping out the causal chain from inputs to outcomes and impact and testing the underlying assumptions.^[36]^[29] Most interventions within the realm of public policy are of a voluntary, rather than coercive (legally required) nature. In addition, interventions are often active rather than passive, requiring a greater rather than lesser degree of participation among beneficiaries and therefore behavior change as a pre-requisite for effectiveness. Public policy will therefore be successful to the extent that people are incentivized to change their behaviour favourably. A theory-based approach enables policy-makers to understand the reasons for differing levels of program participation (referred to as 'compliance' or 'adherence') and the processes determining behavior change. Theory-Based approaches use both quantitative and qualitative data collection, and the latter can be particularly useful in understanding the reasons for compliance and therefore whether and how the intervention may be replicated in other settings. Methods of qualitative data collection include focus groups, in-depth interviews, participatory rural appraisal (PRA) and field visits, as well as reading of anthropological and political literature.

White (2009b)^[36] advocates more widespread application of a theory-based approach to impact evaluation as a means to improve policy relevance of impact evaluations, outlining six key principles of the theory-based approach:

Map out the causal chain (program theory) which explains how the intervention is expected to lead to the intended outcomes, and collect data to test the underlying assumptions of the causal links.
Understand context, including the social, political and economic setting of the intervention.
Anticipate heterogeneity to help in identifying sub-groups and adjusting the sample size to account for the levels of disaggregation to be used in the analysis.
Rigorous evaluation of impact using a credible counterfactual (as discussed above).
Rigorous factual analysis of links in the causal chain.
Use mixed methods (a combination of quantitative and qualitative methods).

Examples

While experimental impact evaluation methodologies have been used to assess nutrition and water and sanitation interventions in developing countries since the 1980s, the first, and best known, application of experimental methods to a large-scale development program is the evaluation of the Conditional Cash Transfer (CCT) program Progresa (now called Oportunidades) in Mexico, which examined a range of development outcomes, including schooling, immunization rates and child work.^[37]^[38] CCT programs have since been implemented by a number of governments in Latin America and elsewhere, and a report released by the World Bank in February 2009 examines the impact of CCTs across twenty countries.^[39]

More recently, impact evaluation has been applied to a range of interventions across social and productive sectors. 3ie has launched an online database of impact evaluations^{[permanent dead link]} covering studies conducted in low- and middle income countries. Other organisations publishing Impact Evaluations include Innovations for Poverty Action, the World Bank's DIME Initiative and NONIE. The IEG of the World Bank has systematically assessed and summarized the experience of ten impact evaluation of development programs in various sectors carried out over the past 20 years.^[40]

Organizations promoting impact evaluation of development interventions

In 2006, the Evaluation Gap Working Group^[41] argued for a major gap in the evidence on development interventions, and in particular for an independent body to be set up to plug the gap by funding and advocating for rigorous impact evaluation in low- and middle-income countries. The International Initiative for Impact Evaluation (3ie) was set up in response to this report. 3ie seeks to improve the lives of poor people in low- and middle-income countries by providing, and summarizing, evidence of what works, when, why and for how much. 3ie operates a grant program, financing impact studies in low- and middle-income countries and synthetic reviews of existing evidence updated as new evidence appears, and supports quality impact evaluation through its quality assurance services.

Another initiative devoted to the evaluation of impacts is the Committee on Sustainability Assessment (COSA). COSA is a non-profit global consortium of institutions, sustained in partnership with the International Institute for Sustainable Development (IISD) Sustainable Commodity Initiative, the United Nations Conference on Trade and Development (UNCTAD), and the United Nations International Trade Centre (ITC). COSA is developing and applying an independent measurement tool to analyze the distinct social, environmental and economic impacts of agricultural practices, and in particular those associated with the implementation of specific sustainability programs (Organic, Fairtrade etc.). The focus of the initiative is to establish global indicators and measurement tools which farmers, policy-makers, and industry can use to understand and improve their sustainability with different crops or agricultural sectors. COSA aims to facilitate this by enabling them to accurately calculate the relative costs and benefits of becoming involved in any given sustainability initiative.

A number of additional organizations have been established to promote impact evaluation globally, including Innovations for Poverty Action, the World Bank's Strategic Impact Evaluation Fund (SIEF), the World Bank's Development Impact Evaluation (DIME) Initiative, the Institutional Learning and Change (ILAC) Initiative of the CGIAR, and the Network of Networks on Impact Evaluation (NONIE).

Systematic reviews of impact evidence

A range of organizations are working to coordinate the production of systematic reviews. Systematic reviews aim to bridge the research-policy divide by assessing the range of existing evidence on a particular topic, and presenting the information in an accessible format. Like rigorous impact evaluations, they are developed from a study Protocol which sets out a priori the criteria for study inclusion, search and methods of synthesis. Systematic reviews involve five key steps: determination of interventions, populations, outcomes and study designs to be included; searches to identify published and unpublished literature, and application of study inclusion criteria (relating to interventions, populations, outcomes and study design), as set out in study Protocol; coding of information from studies; presentation of quantitative estimates on intervention effectiveness using forest plots and, where interventions are determined as appropriately homogeneous, calculation of a pooled summary estimate using meta-analysis; finally, systematic reviews should be updated periodically as new evidence emerges. Systematic reviews may also involve the synthesis of qualitative information, for example relating to the barriers to, or facilitators of, intervention effectiveness.

References

^ World Bank Poverty Group on Impact Evaluation, accessed on January 6, 2008
^ "White, H. (2006) Impact Evaluation: The Experience of the Independent Evaluation Group of the World Bank, World Bank, Washington, D.C., p. 3" (PDF). Archived from the original (PDF) on 2018-02-19. Retrieved 2010-01-07.
^ "Gertler, Martinez, Premand, Rawlings and Vermeersch (2011) Impact Evaluation in Practice, Washington, DC:The World Bank". Archived from the original on 2011-07-17. Retrieved 2010-12-15.
^ "Log in" (PDF). Retrieved 16 January 2017.
^ "Log in" (PDF). Retrieved 16 January 2017.
^ ^a ^b "White, H. (2006) Impact Evaluation: The Experience of the Independent Evaluation Group of the World Bank, World Bank, Washington, D.C." (PDF). Archived from the original (PDF) on 2018-02-19. Retrieved 2010-01-07.
^ Ravallion, M. (2008) Evaluating Anti-Poverty Programs
^ ^a ^b Martin, Ravallion (1 January 2009). "Should the Randomistas Rule?". 6 (2): 1–5. Retrieved 16 January 2017 – via RePEc - IDEAS. {{cite journal}}: Cite journal requires |journal= (help)
^ Note that it has been argued that “Randomistas is a slang term used by critics to describe proponents of the RCT methodology. It is almost certainly a gendered, derogatory term intended to flippantly dismiss experimental economists and their success, particularly Esther Duflo, one of the most successful experts on randomization.” See Webber, S., & Prouse, C. (2018). The New Gold Standard: The Rise of Randomized Control Trials and Experimental Development. Economic Geography, 94(2), 166–187.
^ ^a ^b Bamberger, M. and White, H. (2007) Using Strong Evaluation Designs in Developing Countries: Experience and Challenges, Journal of MultiDisciplinary Evaluation, Volume 4, Number 8, 58-73
^ Scriven (2008) A Summative Evaluation of RCT Methodology: & An Alternative Approach to Causal Research, Journal of MultiDisciplinary Evaluation, Volume 5, Number 9, 11-24
^ Deaton, Angus (1 January 2009). "Instruments of Development: Randomization in the Tropics, and the Search for the Elusive Keys to Economic Development". SSRN 1335715. {{cite journal}}: Cite journal requires |journal= (help)
^ Hariton, Eduardo; Locascio, Joseph J. (December 2018). "Randomised controlled trials—the gold standard for effectiveness research". BJOG: An International Journal of Obstetrics and Gynaecology. 125 (13): 1716. doi:10.1111/1471-0528.15199. ISSN 1470-0328. PMC 6235704. PMID 29916205.
^ ^a ^b White, Howard (8 March 2013). "An introduction to the use of randomised control trials to evaluate development interventions". Journal of Development Effectiveness. 5: 30–49. doi:10.1080/19439342.2013.764652. S2CID 51812043.
^ ^a ^b Deaton, Angus; Cartwright, Nancy (2016-11-09). "The limitations of randomised controlled trials". VoxEU.org. Retrieved 2020-10-26.
^ ^a ^b ^c Roe, Brian E.; Just, David R. (December 2009). "Internal and External Validity in Economics Research: Tradeoffs between Experiments, Field Experiments, Natural Experiments, and Field Data". American Journal of Agricultural Economics. 91 (5): 1266–1271. doi:10.1111/j.1467-8276.2009.01295.x. hdl:10.1111/j.1467-8276.2009.01295.x. ISSN 0002-9092.
^ Callaway, Brantly; Sant’Anna, Pedro H. C. (2021-12-01). "Difference-in-Differences with multiple time periods". Journal of Econometrics. Themed Issue: Treatment Effect 1. 225 (2): 200–230. doi:10.1016/j.jeconom.2020.12.001. ISSN 0304-4076.
^ ^a ^b White, Howard; Raitzer, David (2017). Impact Evaluation of Development Interventions: A Practical Guide (PDF). Manila: Asian Development Bank. ISBN 978-92-9261-059-3.
^ Rugh, Jim (June 22, 2012). "Hammer in search of nails". D+C Development and Cooperation. 2012 (7): 300.
^ Bloom, H. (2006) The core analytics of randomized experiments for social research. MDRC Working Papers on Research Methodology. MDRC, New York
^ "White, H. (2009) Some reflections on current debates in impact evaluation, Working paper 1, International Initiative for Impact Evaluation, New Delhi". Archived from the original on 2013-01-08. Retrieved 2012-10-29.
^ "Log in" (PDF). Retrieved 16 January 2017.
^ World Bank (n.d.) The Development IMpact Evaluation (DIME) Initiative, Project Document, World Bank, Washington, D.C.
^ US Environmental Protection Agency Program Evaluation Glossary, accessed on January 6, 2008
^ World Bank Independent Evaluation Group, accessed on January 6, 2008
^ OECD-DAC (2002) Glossary of Key Terms in Evaluation and Results-Based Management Proposed Harmonized Terminology, OECD, Paris
^ UNICEF (2004) UNICEF Evaluation Report Standards, Evaluation Office, UNICEF NYHQ, New York
^ "Evaluation Definition: What is Evaluation? - EvaluationWiki". Retrieved 16 January 2017.^{[permanent dead link]}
^ ^a ^b "Page Not Found". Retrieved 16 January 2017. {{cite web}}: Cite uses generic title (help)
^ "Banerjee, A. V. (2007) 'Making Aid Work' Cambridge, Boston Review Book, MIT Press, MA" (PDF). Retrieved 16 January 2017.^{[permanent dead link]}
^ Bamberger, M. and White, H. (2007) Using Strong Evaluation Designs in Developing Countries: Experience and Challenges, Journal of MultiDisciplinary Evaluation, Volume 4, Number 8, 58-73
^ http://www.europeanevaluation.org/download/?noGzip=1&id=1969403^{[permanent dead link]} EES Statement on the importance of a methodologically diverse approach to impact evaluation
^ http://www.odi.org.uk/resources/odi-publications/opinions/127-impact-evaluation.pdf^{[permanent dead link]} The 'gold standard' is not a silver bullet for evaluation
^ "Aid effectiveness: The role of qualitative research in impact evaluation". 27 June 2014.
^ Prowse, Martin; Camfield, Laura (2013). "Improving the quality of development assistance". Progress in Development Studies. 13: 51–61. doi:10.1177/146499341201300104. S2CID 44482662.
^ ^a ^b "White, H. (2009b) Theory-based impact evaluation: Principles and practice, Working Paper 3, International Initiative for Impact Evaluation, New Delhi". Archived from the original on 2012-11-06. Retrieved 2012-10-29.
^ Gertler, P. (2000) Final Report: The Impact of PROGRESA on Health. International Food Policy Research Institute, Washington, D.C.
^ "Untitled Document" (PDF). Retrieved 16 January 2017.
^ Fiszbein, A. and Schady, N. (2009) Conditional Cash Transfers: Reducing present and future poverty: A World Bank Policy Research Report, World Bank, Washington, D.C.
^ "Impact Evaluation: The Experience of the Independent Evaluation Group of the World Bank, 2006" (PDF). Archived from the original (PDF) on 2008-05-10. Retrieved 2008-01-07.
^ "When Will We Ever Learn? Improving Lives Through Impact Evaluation". 31 May 2006. Retrieved 16 January 2017.

[1] World Bank Poverty Group on Impact Evaluation, accessed on January 6, 2008

[2] "White, H. (2006) Impact Evaluation: The Experience of the Independent Evaluation Group of the World Bank, World Bank, Washington, D.C., p. 3" (PDF). Archived from the original (PDF) on 2018-02-19. Retrieved 2010-01-07.

[3] "Gertler, Martinez, Premand, Rawlings and Vermeersch (2011) Impact Evaluation in Practice, Washington, DC:The World Bank". Archived from the original on 2011-07-17. Retrieved 2010-12-15.

[4] "Log in" (PDF). Retrieved 16 January 2017.

[5] "Log in" (PDF). Retrieved 16 January 2017.

[worldbank.org-6] "White, H. (2006) Impact Evaluation: The Experience of the Independent Evaluation Group of the World Bank, World Bank, Washington, D.C." (PDF). Archived from the original (PDF) on 2018-02-19. Retrieved 2010-01-07.

[7] Ravallion, M. (2008) Evaluating Anti-Poverty Programs

[auto-8] Martin, Ravallion (1 January 2009). "Should the Randomistas Rule?". 6 (2): 1–5. Retrieved 16 January 2017 – via RePEc - IDEAS. {{cite journal}}: Cite journal requires |journal= (help)

[9] Note that it has been argued that “Randomistas is a slang term used by critics to describe proponents of the RCT methodology. It is almost certainly a gendered, derogatory term intended to flippantly dismiss experimental economists and their success, particularly Esther Duflo, one of the most successful experts on randomization.” See Webber, S., & Prouse, C. (2018). The New Gold Standard: The Rise of Randomized Control Trials and Experimental Development. Economic Geography, 94(2), 166–187.

[ed.gov-10] Bamberger, M. and White, H. (2007) Using Strong Evaluation Designs in Developing Countries: Experience and Challenges, Journal of MultiDisciplinary Evaluation, Volume 4, Number 8, 58-73

[11] Scriven (2008) A Summative Evaluation of RCT Methodology: & An Alternative Approach to Causal Research, Journal of MultiDisciplinary Evaluation, Volume 5, Number 9, 11-24

[12] Deaton, Angus (1 January 2009). "Instruments of Development: Randomization in the Tropics, and the Search for the Elusive Keys to Economic Development". SSRN 1335715. {{cite journal}}: Cite journal requires |journal= (help)

[13] Hariton, Eduardo; Locascio, Joseph J. (December 2018). "Randomised controlled trials—the gold standard for effectiveness research". BJOG: An International Journal of Obstetrics and Gynaecology. 125 (13): 1716. doi:10.1111/1471-0528.15199. ISSN 1470-0328. PMC 6235704. PMID 29916205.

[:1-14] White, Howard (8 March 2013). "An introduction to the use of randomised control trials to evaluate development interventions". Journal of Development Effectiveness. 5: 30–49. doi:10.1080/19439342.2013.764652. S2CID 51812043.

[:2-15] Deaton, Angus; Cartwright, Nancy (2016-11-09). "The limitations of randomised controlled trials". VoxEU.org. Retrieved 2020-10-26.

[:3-16] Roe, Brian E.; Just, David R. (December 2009). "Internal and External Validity in Economics Research: Tradeoffs between Experiments, Field Experiments, Natural Experiments, and Field Data". American Journal of Agricultural Economics. 91 (5): 1266–1271. doi:10.1111/j.1467-8276.2009.01295.x. hdl:10.1111/j.1467-8276.2009.01295.x. ISSN 0002-9092.

[17] Callaway, Brantly; Sant’Anna, Pedro H. C. (2021-12-01). "Difference-in-Differences with multiple time periods". Journal of Econometrics. Themed Issue: Treatment Effect 1. 225 (2): 200–230. doi:10.1016/j.jeconom.2020.12.001. ISSN 0304-4076.

[:0-18] White, Howard; Raitzer, David (2017). Impact Evaluation of Development Interventions: A Practical Guide (PDF). Manila: Asian Development Bank. ISBN 978-92-9261-059-3.

[19] Rugh, Jim (June 22, 2012). "Hammer in search of nails". D+C Development and Cooperation. 2012 (7): 300.

[20] Bloom, H. (2006) The core analytics of randomized experiments for social research. MDRC Working Papers on Research Methodology. MDRC, New York

[21] "White, H. (2009) Some reflections on current debates in impact evaluation, Working paper 1, International Initiative for Impact Evaluation, New Delhi". Archived from the original on 2013-01-08. Retrieved 2012-10-29.

[22] "Log in" (PDF). Retrieved 16 January 2017.

[23] World Bank (n.d.) The Development IMpact Evaluation (DIME) Initiative, Project Document, World Bank, Washington, D.C.

[24] US Environmental Protection Agency Program Evaluation Glossary, accessed on January 6, 2008

[25] World Bank Independent Evaluation Group, accessed on January 6, 2008

[26] OECD-DAC (2002) Glossary of Key Terms in Evaluation and Results-Based Management Proposed Harmonized Terminology, OECD, Paris

[27] UNICEF (2004) UNICEF Evaluation Report Standards, Evaluation Office, UNICEF NYHQ, New York

[28] "Evaluation Definition: What is Evaluation? - EvaluationWiki". Retrieved 16 January 2017.^{[permanent dead link]}

[worldbank.org1-29] "Page Not Found". Retrieved 16 January 2017. {{cite web}}: Cite uses generic title (help)

[30] "Banerjee, A. V. (2007) 'Making Aid Work' Cambridge, Boston Review Book, MIT Press, MA" (PDF). Retrieved 16 January 2017.^{[permanent dead link]}

[31] Bamberger, M. and White, H. (2007) Using Strong Evaluation Designs in Developing Countries: Experience and Challenges, Journal of MultiDisciplinary Evaluation, Volume 4, Number 8, 58-73

[32] ttp://www.europeanevaluation.org/download/?noGzip=1&id=1969403^{[permanent dead link]} EES Statement on the importance of a methodologically diverse approach to impact evaluation

[33] ttp://www.odi.org.uk/resources/odi-publications/opinions/127-impact-evaluation.pdf^{[permanent dead link]} The 'gold standard' is not a silver bullet for evaluation

[34] "Aid effectiveness: The role of qualitative research in impact evaluation". 27 June 2014.

[35] Prowse, Martin; Camfield, Laura (2013). "Improving the quality of development assistance". Progress in Development Studies. 13: 51–61. doi:10.1177/146499341201300104. S2CID 44482662.

[3ieimpact.org-36] "White, H. (2009b) Theory-based impact evaluation: Principles and practice, Working Paper 3, International Initiative for Impact Evaluation, New Delhi". Archived from the original on 2012-11-06. Retrieved 2012-10-29.

[37] Gertler, P. (2000) Final Report: The Impact of PROGRESA on Health. International Food Policy Research Institute, Washington, D.C.

[38] "Untitled Document" (PDF). Retrieved 16 January 2017.

[39] Fiszbein, A. and Schady, N. (2009) Conditional Cash Transfers: Reducing present and future poverty: A World Bank Policy Research Report, World Bank, Washington, D.C.

[40] "Impact Evaluation: The Experience of the Independent Evaluation Group of the World Bank, 2006" (PDF). Archived from the original (PDF) on 2008-05-10. Retrieved 2008-01-07.

[41] "When Will We Ever Learn? Improving Lives Through Impact Evaluation". 31 May 2006. Retrieved 16 January 2017.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

Approach	Key Strength	Key Limitation	Example Application
RCTs	High internal validity via randomization	Poor scalability, ethical barriers, limited generalizability	Microfinance impacts in India (2000s trials showing modest effects)^[68]
Quasi-Experimental (e.g., DiD, RDD)	Leverages real-world data for broader applicability	Depends on assumptions like parallel trends, testable but not always verifiable	Minimum wage effects (DiD in 1994 U.S. study)^[72]

History

Impact evaluation

Recent from talks

Recent from talks

Contribute something

Contribute something

Media Pages

Timelines

Articles

Notes collections

Notes

Notes

Days in Chronicle

Impact evaluation

Counterfactual evaluation designs

Experimental approaches

Randomised control trials (RCTs)

Natural experiments

Non-experimental approaches

Quasi-experimental design

Non-experimental design

Biases in estimating programme effects

Selection bias

Other forms of bias

Secular trends or secular drift

Interfering events

Maturation

Estimation methods

Debates

Definitions

Methodological debates

Theory-based impact evaluation

Examples

Organizations promoting impact evaluation of development interventions

Systematic reviews of impact evidence

See also

References