ImpactMojoImpact Evaluation 101www.impactmojo.in

ImpactMojo 101 Series · Free Forever

Impact
Evaluation
101

Designing & Commissioning Credible Evaluations — a Foundational Course for Development Programme & MEL Practitioners in South Asia

Practice-FocusedSouth Asia Focus100 SlidesFree Access

ImpactMojoImpact Evaluation 101www.impactmojo.in

Agenda

What We Cover

01

What Impact Evaluation Is

Slides 3–11

02

The Counterfactual & Causal Inference

Slides 12–20

03

Theory of Change & Evaluation Questions

Slides 21–29

04

Randomised Experiments (RCTs)

Slides 30–38

05

Quasi-Experimental Designs

Slides 39–48

06

Choosing a Method

Slides 49–57

07

Sampling & Statistical Power

Slides 58–66

08

Measurement & Data Collection

Slides 67–75

09

Analysis & Interpreting Results

Slides 76–84

10

External Validity, Cost & Ethics

Slides 85–91

11

Using IE for Decisions & Further Reading

Slides 92–99

ImpactMojoImpact Evaluation 101www.impactmojo.in

01

Section One

What Impact Evaluation Is

ImpactMojoImpact Evaluation 101www.impactmojo.in

Definition

Measuring the change a programme caused

An impact evaluation answers one demanding question: how much of the change we observe was actually caused by our programme — and not by everything else happening at the same time?

Impact evaluation

A study that estimates the causal effect of a programme, policy or intervention on an outcome — the difference between what happened with the programme and what would have happened without it.

Every other M&E question can be useful. This one tells you whether the programme is worth running at all.

ImpactMojoImpact Evaluation 101www.impactmojo.in

The Distinction

Impact is not the same as monitoring

	Monitoring	Impact evaluation
Asks	Are we doing what we planned?	Did it cause the change?
Tracks	Inputs, activities, outputs	Outcomes vs a counterfactual
Timing	Continuous, routine	Periodic, designed in advance
Example	1,200 toilets built	Did diarrhoea actually fall because of them?
Answers	Implementation	Attribution

Monitoring tells you the programme ran. Impact evaluation tells you whether it worked. You need both — they answer different questions.

ImpactMojoImpact Evaluation 101www.impactmojo.in

The Results Chain

Where impact sits in the results chain

01

INPUTS: budget, staff, training

→

02

ACTIVITIES: deliver the programme

→

03

OUTPUTS: toilets built, classes held

→

04

OUTCOMES: behaviour, take-up

→

05

IMPACT: the caused change in wellbeing

Monitoring lives on the left of this chain; impact evaluation targets the far right — and insists on attribution, not just association.

ImpactMojoImpact Evaluation 101www.impactmojo.in

The Attribution Question

Compared to what?

A village had the programme; outcomes improved. Tempting to claim success — but outcomes might have improved anyway, through a good monsoon, rising incomes, or other schemes. The honest question is always: compared to what?

The fundamental question of impact evaluation is not 'did things get better?' but 'did things get better because of us?'

— a working principle of evaluation practice

ImpactMojoImpact Evaluation 101www.impactmojo.in

Why It Matters

What credible evidence buys you

When you have it

Scale what genuinely works
Stop what does not, and free up budget
Defend funding with real evidence
Learn why, not just whether

When you don't

Scale failures by mistake
Credit yourself for trends you didn't cause
Cut effective programmes blindly
Repeat the same expensive errors

ImpactMojoImpact Evaluation 101www.impactmojo.in

A Caution

Impact evaluation is not always the answer

Impact evaluation is powerful but costly and slow. It is one tool in the MEL toolkit — not a stamp of legitimacy to apply to everything. Much of the time, good monitoring and process evaluation serve you better.

This course is as much about knowing when to commission an impact evaluation as about how to design one.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Your Role

You don't have to run it to commission it well

Most practitioners will commission an impact evaluation rather than estimate it themselves. That still demands real literacy: framing the question, choosing a design, judging a proposal, and reading the findings critically.

This deck is written for the commissioner and the manager — the person who must ask hard questions of an evaluator, not necessarily write the regression.

ImpactMojoImpact Evaluation 101www.impactmojo.in

The Family of M&E

Impact evaluation among its cousins

Activity	Question it answers
Needs assessment	What is the problem, and for whom?
Process evaluation	Was the programme delivered as designed?
Monitoring	Are we on track against the plan?
Impact evaluation	Did the programme cause the change?
Cost-effectiveness analysis	Was the impact worth the cost?

Impact evaluation is one row in this table, not the whole table. It shines only when paired with the others — especially process evaluation, which tells you whether a null means the idea or the delivery failed.

ImpactMojoImpact Evaluation 101www.impactmojo.in

02

Section Two

The Counterfactual & Causal Inference

ImpactMojoImpact Evaluation 101www.impactmojo.in

The Big Idea

The counterfactual: the road not taken

Counterfactual

What would have happened to the very same people, in the same period, had the programme not existed. It is the benchmark against which impact is measured.

Impact = outcome with the programme − outcome without it, for the same group. The first we observe; the second we never can. Everything in this course is a strategy to estimate that missing half.

ImpactMojoImpact Evaluation 101www.impactmojo.in

The Core Problem

The fundamental problem of causal inference

A person is either treated or not. We can never observe the same person in both states at once, so the true individual counterfactual is permanently missing. This is the fundamental problem of causal inference.

The solution is not to find each person's missing twin, but to build a comparison group that, on average, stands in for what the treated group would have looked like untreated.

ImpactMojoImpact Evaluation 101www.impactmojo.in

The Wrong Counterfactuals

Two tempting comparisons that mislead

Before vs after

Compare the same group before and after. But the world moved on too — you confuse the programme with everything else that changed (the monsoon, prices, other schemes).

Participants vs non-participants

Compare those who joined to those who didn't. But joiners often differ — more motivated, better placed. That difference is selection bias, not impact.

Both are easy, cheap and wrong. A credible counterfactual is the whole game.

ImpactMojoImpact Evaluation 101www.impactmojo.in

The Central Threat

Selection bias, defined

Selection bias

When the people who receive a programme differ systematically from those who don't — in ways that also affect the outcome — so a naive comparison mixes the programme's effect with those pre-existing differences.

If a skills programme enrols the most motivated youth, their later success partly reflects motivation, not training. Selection bias is the single biggest reason naive evaluations overstate impact.

ImpactMojoImpact Evaluation 101www.impactmojo.in

See It

Why volunteers make a misleading comparison

Illustrative: who self-selects into a training programme

Illustrative example, not real data

Joiners already differ before the programme. Comparing their later outcomes to non-joiners' measures who they were, not what the programme did.

ImpactMojoImpact Evaluation 101www.impactmojo.in

The Goal

What a good comparison group must satisfy

It looks like the treatment group before the programme — same characteristics, same trajectory
It is exposed to the same outside forces — weather, prices, other schemes
The only systematic difference is the programme itself

Get this right and the comparison group's outcomes are a credible estimate of the treatment group's counterfactual. Every method that follows is a different way to build such a group.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Two Estimands

ATE and ATT: effect on whom?

ATE

Average Treatment Effect — the effect if everyone in the population were treated.

ATT

Average Treatment effect on the Treated — the effect among those who actually received the programme. Often the policy-relevant one.

They differ when the programme works differently for participants than for everyone. Always ask which one a study reports — and which one your decision needs.

ImpactMojoImpact Evaluation 101www.impactmojo.in

The Logic in One Line

Every method, one promise

Strip away the jargon and every design in this course makes the same promise: my comparison group is a credible stand-in for the treatment group's counterfactual. Methods differ only in how they make that promise believable.

01

RCT: a lottery makes the groups alike

→

02

RD: the cutoff makes near-neighbours alike

→

03

DiD: a shared trend makes changes comparable

→

04

Matching: observed traits make pairs alike

Judge any study by asking: how believable is its version of that promise — and what would break it?

ImpactMojoImpact Evaluation 101www.impactmojo.in

03

Section Three

Theory of Change & Evaluation Questions

ImpactMojoImpact Evaluation 101www.impactmojo.in

Start Here

No theory of change, no evaluation

Theory of change

An explicit map of how and why a programme's activities are expected to lead to its intended outcomes — the chain of cause and effect, plus the assumptions at each link.

Before measuring impact, you must state what impact you expect and through what pathway. A vague programme cannot be rigorously evaluated.

ImpactMojoImpact Evaluation 101www.impactmojo.in

The Causal Chain

Map the pathway, link by link

01

Cash transfer to mothers

→

02

more household income

→

03

more spent on food & care

→

04

better child nutrition

→

05

lower stunting

Each arrow is an assumption that can fail. If transfers are spent elsewhere, or food prices rise, the chain breaks — and that is exactly what evaluation should test.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Assumptions

The arrows are where programmes fail

A theory of change makes assumptions visible so you can check them. Most programmes that 'fail' did not fail because the idea was wrong — they failed at a specific, checkable link.

Implementation failure

The chain was never delivered — training didn't happen, transfers didn't arrive. The theory is untested.

Theory failure

Everything was delivered, but an assumed link did not hold — the idea itself was wrong.

Impact evaluation paired with monitoring lets you tell these two apart — a null result means little if the programme never actually reached people.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Framing

Turn the theory into evaluation questions

Impact question: did the programme change the outcome, by how much, for whom?
Mechanism question: through which link in the chain?
Process question: was it delivered as designed?
Cost question: was the impact worth what it cost?

Only the first is strictly an impact question — but a good evaluation usually answers several, because 'did it work?' is rarely enough to act on.

ImpactMojoImpact Evaluation 101www.impactmojo.in

A Good Question

What a sharp evaluation question looks like

A usable impact question names the intervention, the population, the outcome, the comparison and the timeframe.

Weak: 'Does our programme help?' Strong: 'Does 18 months of monthly cash transfers to mothers of under-2s in Koraput reduce child stunting, relative to comparable villages without transfers?'

ImpactMojoImpact Evaluation 101www.impactmojo.in

Scope

Choose the right question, not every question

You cannot evaluate everything. A focused evaluation that answers one decision well beats a sprawling one that answers many vaguely. Let the decision you face dictate the question.

Ask up front: what would we do differently if the answer were yes, versus no? If nothing changes either way, the evaluation may not be worth running.

ImpactMojoImpact Evaluation 101www.impactmojo.in

When NOT to do it

When an impact evaluation is the wrong call

The programme is still changing weekly — there is nothing stable to evaluate
The intervention is already proven and the question is purely operational
No credible comparison group is possible (e.g. a universal, simultaneous rollout)
The budget and timeline cannot support a rigorous design — a weak IE is worse than none
The decision is already made and evidence will not change it

A badly underpowered or poorly identified impact evaluation can mislead more than no evaluation at all. Knowing when to walk away is a skill.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Evaluability

Run an evaluability check first

An evaluability assessment asks, before you commit: is this programme well-defined, stable and measurable enough to evaluate credibly — and will anyone act on the result?

01

Clear theory of change?

→

02

Measurable outcomes?

→

03

A feasible counterfactual?

→

04

A decision waiting on the answer?

ImpactMojoImpact Evaluation 101www.impactmojo.in

04

Section Four

Randomised Experiments (RCTs)

ImpactMojoImpact Evaluation 101www.impactmojo.in

The Gold Standard

Randomisation: let the lottery build the counterfactual

Randomised controlled trial (RCT)

A study in which eligible units are assigned to treatment or control by chance, so the two groups are comparable in expectation and the difference in their outcomes estimates the causal effect.

If a coin flip decides who gets the programme, the treatment and control groups have no systematic reason to differ — so any later difference in outcomes can be attributed to the programme.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Why It Works

Chance balances what you can and cannot see

This is the deep power of randomisation: with enough units, it balances the two groups on observed characteristics (age, caste, income) and on unobserved ones (motivation, ability, family support) — in expectation.

No other method can claim to balance the unobservable factors. That is why a well-run RCT has the strongest claim to internal validity.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Check Balance

Show the groups started alike

Illustrative baseline balance: treatment vs control

Illustrative example, not real data

A balance table like this is the first thing to check in any RCT: if randomisation worked, the groups look alike at baseline on every measured characteristic.

ImpactMojoImpact Evaluation 101www.impactmojo.in

J-PAL

Who champions the RCT in development

The Abdul Latif Jameel Poverty Action Lab (J-PAL), based at MIT, has run hundreds of randomised evaluations of anti-poverty programmes worldwide, including across South Asia. Its founders, Abhijit Banerjee and Esther Duflo, shared the 2019 Nobel Prize in Economics (with Michael Kremer) for this experimental approach.

Its sister network 3ie (the International Initiative for Impact Evaluation) funds and synthesises impact evaluations across the Global South.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Growth

How randomised evaluations took off

Illustrative: the rise of randomised evaluations in development

Illustrative trend, schematic only

Schematic, not a precise count — but the direction is real: randomised impact evaluation went from rare to mainstream in development over two decades.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Unit of Randomisation

Randomise individuals, or whole clusters?

Individual

Each person assigned separately. Most statistically efficient, but risky when treated and control people mix and influence each other.

Cluster

Whole villages, schools or clinics assigned together. Needed when the programme operates at group level or spillovers are likely — but you need many clusters.

Cluster designs cost statistical power: 40 villages of 50 people is far weaker than 2,000 individually randomised people. Power is driven by the number of clusters, not just total people.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Spillovers

When treatment leaks to the control group

If treated and control units interact, the programme can spill over — a dewormed child stops infecting an untreated neighbour. The control group is no longer a clean counterfactual, and the effect is understated.

Cluster randomisation — treating whole villages — is the usual defence, keeping treated and control units far enough apart that they don't contaminate each other.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Ethics

Equipoise: the ethics of the lottery

Equipoise

Genuine uncertainty about whether the programme works. When we honestly do not know, randomly choosing who gets it first is defensible — and the evaluation resolves the uncertainty for everyone.

Random phase-in: everyone gets it eventually; the lottery just sets the order
Randomise only when demand exceeds supply — a lottery is already fair
Never withhold a known, effective, life-saving intervention to run a trial

ImpactMojoImpact Evaluation 101www.impactmojo.in

05

Section Five

Quasi-Experimental Designs

ImpactMojoImpact Evaluation 101www.impactmojo.in

When You Can't Randomise

Finding a counterfactual in the real world

Often you cannot randomise — the programme is already running, or rolled out to everyone, or randomisation is infeasible. Quasi-experimental designs exploit natural variation to approximate a control group.

They can be highly credible — but each rests on an assumption you must argue for, not just an untouchable coin flip. Know the assumption behind every design.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Diff-in-Diff

Difference-in-differences: subtract the trend

Difference-in-differences (DiD) compares the change in a treated group with the change in a comparison group over the same period. Subtracting the comparison group's change removes whatever trend would have happened anyway.

01

Treated: before → after

→

02

Comparison: before → after

→

03

Impact = (treated change) − (comparison change)

ImpactMojoImpact Evaluation 101www.impactmojo.in

See It

DiD: the gap that opens up

Illustrative DiD: outcome before and after, treated vs comparison

Illustrative example, not real data

The grey dashed line is the assumed counterfactual: where the treated group would have gone, tracking the comparison group's trend. The estimated impact is the gap above it (~14 points).

ImpactMojoImpact Evaluation 101www.impactmojo.in

The Key Assumption

DiD identifies only under parallel trends

Parallel trends

The assumption that, absent the programme, the treated and comparison groups would have moved in parallel — the same trend over time. DiD is only valid if this holds.

If the groups were already diverging before the programme, DiD attributes that pre-existing divergence to the programme — and overstates impact. Always check trends in the pre-period first.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Regression Discontinuity

RD: compare just either side of a cutoff

Many programmes use a sharp eligibility cutoff — a poverty score below 0.33, a test mark above 60. People just above and just below the line are nearly identical, except one gets the programme. That tiny band is a natural experiment.

Regression discontinuity (RD)

A design that estimates impact from the jump in outcomes exactly at an eligibility threshold, comparing units narrowly above and below the cutoff.

ImpactMojoImpact Evaluation 101www.impactmojo.in

See It

RD: read the jump at the cutoff

Illustrative RD: outcome jumps at the eligibility threshold

Illustrative example, not real data

The vertical jump at the cutoff (~14 points) is the estimated effect. Crucially, RD gives a local effect — valid for units near the threshold, not for everyone.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Matching / PSM

Matching: build a statistical twin

Matching pairs each treated unit with one or more untreated units that look alike on observed characteristics. Propensity-score matching (PSM) matches on a single number: the estimated probability of being treated, given those characteristics.

The fatal limit: matching can only balance what you measured. If treated and untreated differ on something unobserved — motivation, hidden need — matching cannot fix it. Weaker than an RCT.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Instrumental Variables

IV: borrow some random-like variation

An instrumental variable is an outside factor that nudges people into the programme but has no other path to the outcome. It isolates the slice of programme take-up that behaves as if random — e.g. distance to a facility shifting who enrols.

A valid instrument needs two things: relevance (it genuinely shifts take-up) and the exclusion restriction (it affects the outcome only through take-up). The second can never be fully proven — only argued.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Compare

The quasi-experimental toolkit at a glance

Design	Exploits	Key assumption
Diff-in-differences	Before/after × treated/comparison	Parallel trends
Regression discontinuity	A sharp eligibility cutoff	Units similar across the cutoff
Matching / PSM	Observed similarity	No unobserved confounders
Instrumental variables	An external nudge into take-up	Relevance + exclusion restriction

Each is only as credible as its assumption. A commissioner's job is to ask: what must be true for this to identify the effect — and is it?

ImpactMojoImpact Evaluation 101www.impactmojo.in

06

Section Six

Choosing a Method

ImpactMojoImpact Evaluation 101www.impactmojo.in

The Trade-Off

There is no single best method

The 'best' design is the most credible one that is feasible and ethical for your situation. Method choice is a negotiation between three forces, not a ranking to memorise.

Validity

How credible is the causal claim?

Feasibility

Can you actually do it, in time and budget?

Ethics

Is it fair to those involved?

ImpactMojoImpact Evaluation 101www.impactmojo.in

Internal Validity

Internal validity: is the answer right here?

Internal validity

The degree to which a study correctly identifies the causal effect for the people and place it studied — free of selection bias and confounding.

A well-run RCT typically has the strongest internal validity; a before-after comparison the weakest. Internal validity is the first hurdle: an answer that is wrong here is useless everywhere.

ImpactMojoImpact Evaluation 101www.impactmojo.in

The Credibility Ladder

Designs ranked by credibility

Design	Counterfactual quality	Credibility
Randomised controlled trial	Strongest — balances unobservables	Highest
Regression discontinuity	Strong, but local to the cutoff	High
Difference-in-differences	Good if parallel trends hold	Medium-high
Instrumental variables	Depends on a defensible instrument	Conditional
Matching / PSM	Only as good as observed variables	Medium
Before-after / naive comparison	No real counterfactual	Lowest

Climb as high as feasibility and ethics allow — but a credible quasi-experiment beats a badly run RCT. Execution matters as much as the rung.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Feasibility

What makes a design practical

Is there a stage of rollout where you can still assign or compare?
Are baseline data and a comparison group available?
Do you have the sample size, budget and time the design needs?
Will the result arrive in time to inform the decision?

The single biggest feasibility lever is timing: the best designs are built in before a programme starts, not retrofitted afterwards.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Design In Early

Plan the evaluation before the programme rolls out

A staggered or phased rollout — which most large programmes need anyway — is a gift to evaluators. It creates a natural comparison group (those not yet reached) at no extra cost.

Lesson for commissioners: involve the evaluator at the design stage. Retrofitting rigour onto a finished programme is the hardest and least credible path.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Mixed Methods

Quantitative tells you whether; qualitative tells you why

Quantitative IE

Estimates the size of the effect and how sure we are. Answers did it work, and by how much?

Qualitative work

Explains the mechanism, the surprises, the implementation. Answers why, how, and for whom?

The strongest evaluations combine both: a number you can trust, and a story that explains it. Neither alone is enough to act.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Decision Guide

A rough rule for choosing

01

Can you assign at random, ethically? → RCT

→

02

Is there a sharp eligibility cutoff? → RD

→

03

Phased rollout with baseline data? → DiD

→

04

Only an external nudge into take-up? → IV

→

05

None of the above? → reconsider whether to run an IE

ImpactMojoImpact Evaluation 101www.impactmojo.in

Execution Beats Elegance

A well-run simple design beats a botched fancy one

The credibility ladder ranks designs, but in the field, execution decides everything. An RCT wrecked by attrition, spillovers and broken randomisation can be less trustworthy than a careful difference-in-differences.

So weigh the design and the team, the timeline and the field conditions together. Ambition you cannot execute is worse than modesty you can.

ImpactMojoImpact Evaluation 101www.impactmojo.in

07

Section Seven

Sampling & Statistical Power

ImpactMojoImpact Evaluation 101www.impactmojo.in

The Stakes

Why an underpowered evaluation misleads

If your sample is too small, even a real, useful effect can fail to reach statistical significance. You then conclude 'no impact' — when the truth is you simply could not detect it. This is the curse of the underpowered study.

An underpowered evaluation wastes money and can kill a programme that actually works. Power must be planned before data collection — it cannot be fixed afterwards.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Definition

Statistical power, defined

Statistical power

The probability that a study will detect a real effect of a given size, if one truly exists. By convention, evaluations aim for 80% power or more.

Power of 80% means that if the programme really has the effect you assumed, you have an 80% chance of finding a statistically significant result — and a 20% chance of missing it.

ImpactMojoImpact Evaluation 101www.impactmojo.in

MDE

Minimum detectable effect: what can you even see?

Minimum detectable effect (MDE)

The smallest true effect a study is reliably able to detect, given its sample size, the outcome's variability and the desired power.

Flip the question round: with this sample, what is the smallest impact I could detect? If your MDE is a 10-point gain but the programme can realistically deliver 3, the study is doomed before it starts.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Drivers

What determines how big a sample you need

Factor	Effect on required sample
Smaller expected effect	Much larger sample needed
Higher outcome variability	Larger sample needed
Higher power target (e.g. 90%)	Larger sample needed
Clustered design	Larger sample — driven by number of clusters
A good baseline to control for	Smaller sample needed

The headline driver is effect size: detecting small effects is expensive. Be honest about how big an impact is plausible.

ImpactMojoImpact Evaluation 101www.impactmojo.in

See It

How power rises with sample size

Illustrative power curve: power vs sample size

Illustrative, schematic shape only

Power climbs steeply at first, then flattens as it approaches 1. The dashed convention of 80% (~1,000 per arm here) is where most evaluations aim — beyond it, extra sample buys little.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Clustering

Why clusters cost you power

When you randomise whole villages, people within a village resemble each other — they share the same school, water, weather. These correlated responses carry less independent information than the same number of unrelated individuals.

So power is driven by the number of clusters, not the total headcount. Forty large villages can be weaker than a hundred small ones. Add clusters, not just people.

ImpactMojoImpact Evaluation 101www.impactmojo.in

In Practice

Run a power calculation before you commit

State the smallest effect worth detecting (your MDE)
Estimate outcome variability from prior data or a pilot
Account for clustering, attrition and partial take-up
Solve for the sample size that gives at least 80% power

Tools like J-PAL's resources and free software make this routine. Insist on seeing the power calculation in any evaluation proposal — a design without one is a red flag.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Two Errors

False positives and false negatives

Type I (false positive)

Concluding the programme worked when it didn't. Controlled by the significance level — conventionally a 5% risk.

Type II (false negative)

Missing a real effect. Its risk is 1 − power — the very thing a good sample size guards against.

Underpowered studies are dominated by Type II error: they scream 'no effect' when they simply could not see one. Power is how you buy down that risk.

ImpactMojoImpact Evaluation 101www.impactmojo.in

08

Section Eight

Measurement & Data Collection

ImpactMojoImpact Evaluation 101www.impactmojo.in

Outcomes First

Decide what to measure before how

The outcome indicators flow straight from the theory of change. Choose them before the evaluation starts, define them precisely, and commit to them — so you cannot fish for whichever outcome happens to look good afterwards.

Distinguish primary outcomes (the one or two you will judge success on) from secondary ones. Pre-specifying the primary outcome is a core discipline of credible IE.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Good Indicators

What makes an outcome indicator usable

Valid

Captures the concept you actually care about

Reliable

Gives the same reading on repeat measurement

Sensitive

Moves when the real outcome moves

Feasible

Can be measured affordably in the field

ImpactMojoImpact Evaluation 101www.impactmojo.in

Baseline / Endline

Measure before, measure after

01

BASELINE: measure outcomes before the programme

→

02

RANDOMISE / ASSIGN: form treatment & comparison

→

03

DELIVER: run the programme

→

04

ENDLINE: re-measure the same outcomes

A baseline does double duty: it confirms the groups started balanced, and it sharpens power by letting you control for starting levels. Skip it only if you must.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Data Sources

Surveys and administrative data

Survey data

Collected for the evaluation — exactly the outcomes you need, but costly, and prone to recall and reporting error.

Administrative data

Already generated by the system — school registers, HMIS, MGNREGA records. Cheap and continuous, but may not measure quite what you want.

Linking your evaluation to existing administrative systems can slash cost and enable long-term follow-up — where the data quality is good enough to trust.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Measurement Bias

How measurement quietly corrupts a finding

Social desirability: people report what they think you want to hear
Recall error: hazy memory of past income, illness, spending
Surveyor effects: who asks, and how, shifts the answer
Differential measurement: measuring treatment and control groups differently

The last is the most dangerous: if treated respondents are surveyed more enthusiastically than controls, you manufacture an effect out of thin air. Measure both groups identically.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Good Instruments

Build the survey to avoid bias

Pilot every instrument before the real round — without exception
Use neutral wording; avoid leading and double-barrelled questions
Blind enumerators to treatment status where you possibly can
Use the same instrument, timing and team for both groups

Where feasible, prefer objective measures (test scores, anthropometry, biomarkers, admin records) over self-report — they are harder to bias.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Attrition

When people drop out of the study

Attrition

The loss of study participants between baseline and endline — through migration, refusal or death — who cannot be measured at follow-up.

Attrition is dangerous when it differs between groups, or relates to the outcome — if the worst-off in the treatment group leave, the survivors look artificially good. Track it, report it, and test whether it is balanced.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Timing

When you measure shapes what you find

Measure too early and the effect has not yet appeared — a nutrition programme needs months to move stunting. Measure too late and a real effect may have faded, or comparison villages may have caught up.

Let the theory of change set the clock: when should this outcome plausibly respond? Endline timing is a design choice, not an afterthought — and a second follow-up tells you whether effects last.

ImpactMojoImpact Evaluation 101www.impactmojo.in

09

Section Nine

Analysis & Interpreting Results

ImpactMojoImpact Evaluation 101www.impactmojo.in

Effect Size

How big, not just whether

The first thing to read is the effect size: how much did the outcome change? Report it in units a decision-maker understands — percentage points, rupees, days of schooling — not just a coefficient.

Then ask the practical question: is an effect of this size worth the programme's cost? A real but tiny effect may not justify the spend.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Significance

Significant is not the same as important

Statistical significance

A result unlikely to have arisen by chance alone if the programme truly had no effect. It speaks to confidence, not to the size or importance of the effect.

With a huge sample, a trivially small effect can be 'statistically significant'. With a small sample, a large, real effect can miss significance. Always read the effect size and the uncertainty — never the p-value alone.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Confidence Intervals

Report the range, not just the point

An estimated impact of '6 percentage points' is shorthand for a range — say 2 to 10 — the confidence interval. The width of that range tells you how precisely the effect was estimated.

If the interval comfortably excludes zero, the effect is reasonably firm. If it straddles zero, the evaluation cannot rule out 'no effect'. Read intervals, not just stars.

ImpactMojoImpact Evaluation 101www.impactmojo.in

ITT vs ToT

Intention-to-treat vs treatment-on-the-treated

Intention-to-treat (ITT)

The effect of being offered the programme — everyone assigned to treatment, whether or not they took it up. Reflects real-world rollout, where take-up is never 100%.

Treatment-on-treated (ToT)

The effect on those who actually took up the programme. Usually larger than ITT, since it strips out the non-participants.

Keep groups by original assignment to preserve the randomisation. For a policy that can only offer (not force) a programme, ITT is often the more honest, decision-relevant number.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Heterogeneity

Did it work differently for different people?

An average effect can hide wide variation. A programme might help women but not men, or the poorest but not the better-off. Heterogeneity analysis looks for these differences — and is often where the most useful learning lives.

But beware: test enough subgroups and one will look 'significant' by chance. Pre-specify the subgroups you care about, and treat surprising ones as hypotheses to test next time, not conclusions.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Null Results

A null result is not a failure

Finding no significant impact is genuine, valuable knowledge — it can stop a wasteful programme or redirect resources. But distinguish a true zero from an inconclusive one.

A real null

Well-powered study, tight interval around zero: the programme genuinely did little. Act on it.

An empty null

Underpowered, wide interval: you simply couldn't detect an effect. This tells you almost nothing.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Honesty

Read against the threats you learned

Was the design well identified — did the key assumption hold?
Were the groups balanced at baseline?
Was attrition low and balanced across groups?
Is the primary outcome the pre-specified one, or a convenient substitute?
Is the effect size practically meaningful, not just significant?

ImpactMojoImpact Evaluation 101www.impactmojo.in

Mind the Multiple

Test twenty outcomes and one will 'work'

Measure enough outcomes and subgroups and, by chance alone, roughly one in twenty will look 'significant' at the usual threshold. Reporting only the winners — p-hacking — manufactures findings that will not replicate.

Defences: pre-specify the primary outcome, limit the number of tests, and adjust for multiple comparisons. Treat a lone surprising result as a hypothesis for the next study, not a conclusion.

ImpactMojoImpact Evaluation 101www.impactmojo.in

10

Section Ten

External Validity, Cost & Ethics

ImpactMojoImpact Evaluation 101www.impactmojo.in

Will It Travel?

External validity: will it work elsewhere?

External validity

The degree to which a result found in one place, time and population holds in another. A programme that worked in Bihar may not work the same way in Tamil Nadu — or at national scale.

Internal validity asks 'is the answer right here?'; external validity asks 'does it transfer there?' A perfectly identified RCT can still mislead if you scale it into a different context.

ImpactMojoImpact Evaluation 101www.impactmojo.in

What Breaks Transfer

Why results don't always generalise

Context: different markets, culture, institutions
Implementation: a research-grade pilot run far better than a government scale-up
Scale effects: general-equilibrium changes once everyone is treated
Population: the studied group differs from the target group

Don't ask only 'did it work?' Ask 'why did it work, and are those conditions present where I want to use it?' Mechanism travels better than a headline number.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Cost-Effectiveness

Impact per rupee, not impact alone

Impact is only half the decision. Cost-effectiveness asks how much outcome you buy per rupee — letting you compare very different programmes chasing the same goal.

A smaller-impact programme that costs a tenth as much may be the better buy. Always pair the effect size with a credible cost figure — the question is impact per rupee, not impact alone.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Do No Harm

Ethics: the programme and the study both

Informed consent: participants understand and may refuse, without penalty
Do no harm: the study itself must not worsen anyone's situation
Privacy: protect respondents' data, especially the marginalised
Ethics review: an independent board (IRB) approves the design

Recall equipoise: random assignment is ethical only when we genuinely do not know what works — never withhold a proven, essential intervention to run a trial.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Transparency

Pre-registration and open evidence

Pre-registration

Publicly recording your hypotheses, primary outcome and analysis plan before collecting or seeing the data — so you cannot quietly change the goalposts to fit the result.

Registries like the AEA RCT Registry and 3ie's records make evaluation honest and cumulative. Pre-registration is the main defence against cherry-picking and p-hacking.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Publication Bias

The file-drawer problem

Studies that find big, positive effects get published and shared; null results often languish in a file drawer. So the published record can overstate what works — you see the hits, not the misses.

Trust systematic reviews and replications over any single splashy study. Bodies like 3ie and the Campbell Collaboration synthesise across studies precisely to correct this bias.

ImpactMojoImpact Evaluation 101www.impactmojo.in

11

Section Eleven

Using IE for Decisions & Further Reading

ImpactMojoImpact Evaluation 101www.impactmojo.in

Commissioning

How to commission an impact evaluation well

01

Define the decision the evidence will inform

→

02

Write a sharp evaluation question

→

03

Bring the evaluator in at design stage

→

04

Demand a power calculation & identification strategy

→

05

Pre-register; plan to act on whatever you find

Your leverage as commissioner is greatest before the contract is signed. Ask the hard questions then, not when the report lands.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Reading a Proposal

Questions to ask any evaluator

What is the counterfactual, and how credible is it?
What must be true for this design to identify the effect?
What is the MDE, and is it smaller than a plausible impact?
How will you handle attrition, spillovers and partial take-up?
Is the primary outcome pre-specified, and will you pre-register?

ImpactMojoImpact Evaluation 101www.impactmojo.in

The Limits

The limits of 'what works'

Impact evaluation tells you whether a specific programme worked in a specific place. It cannot, by itself, tell you what to value, how to navigate trade-offs, or whether the result will hold at scale.

Evidence informs judgement; it does not replace it. The number is an input to a decision, never the decision itself.

— a working principle for evidence use

ImpactMojoImpact Evaluation 101www.impactmojo.in

Evidence Into Action

Why good evidence still goes unused

Findings arrive after the decision was already made
Results are framed for academics, not for managers
No one owns turning the finding into a change in practice
Inconvenient nulls are quietly shelved

Plan use from the start: agree who decides what, by when, and commit to act on the answer — including if it is disappointing.

ImpactMojoImpact Evaluation 101www.impactmojo.in

The Cornerstone

The one book to read: Gertler et al.

Impact Evaluation in Practice by Paul Gertler, Sebastian Martinez, Patrick Premand, Laura Rawlings and Christel Vermeersch (World Bank) is the standard, free, practitioner-friendly handbook — the natural next step after this deck.

It walks through counterfactuals, every design in this course, and a worked case study, in plain language. Download it free from the World Bank's Open Knowledge Repository.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Where to Learn More

Trusted sources and resources

Source	What it offers	Note
J-PAL	Randomised evaluations, training, evidence reviews	MIT-based; strong South Asia office
3ie	Impact evaluations & systematic reviews of the Global South	Searchable evidence portal
World Bank DIME	Methods, guidance, the Gertler et al. handbook	Free handbook & toolkits
Campbell Collaboration	Systematic reviews of social interventions	Synthesis across studies
AEA RCT Registry	Pre-registered trial protocols	Check what was promised

ImpactMojoImpact Evaluation 101www.impactmojo.in

The Takeaways

If you remember five things

Always ask 'compared to what?' — the counterfactual is everything
Randomisation balances the unobservable, in expectation — no other method can claim that
Every quasi-experiment rests on an assumption — name it and test it
Plan power before, read effect size after — significance is not importance
Know when NOT to run an IE — and pre-register when you do

Pair this deck with ImpactMojo's Econometrics, Data Literacy and Research Ethics 101 courses.

ImpactMojoImpact Evaluation 101www.impactmojo.in

Impact Evaluation 101 · Complete

Now go ask:
compared to what?

You don't have to run the regression to commission an evaluation well — you need to frame the question, choose a credible counterfactual, and read the findings honestly. Explore the rest of the ImpactMojo 101 Series, free forever.

More 101 Courses Explore ImpactMojo Dataverse

Continue your learning

Impact Evaluation Studio Causal Inference Flagship Counterfactual Game

CC BY-NC-ND 4.0·Free Forever·ImpactMojo 101 Series