fullscreen
ImpactMojoImpact Evaluation 101www.impactmojo.in
ImpactMojo 101 Series · Free Forever
Impact
Evaluation
101
Designing & Commissioning Credible Evaluations — a Foundational Course for Development Programme & MEL Practitioners in South Asia
Practice-FocusedSouth Asia Focus100 SlidesFree Access
ImpactMojoImpact Evaluation 101www.impactmojo.in
What We Cover
01
What Impact Evaluation Is
Slides 3–11
02
The Counterfactual & Causal Inference
Slides 12–20
03
Theory of Change & Evaluation Questions
Slides 21–29
04
Randomised Experiments (RCTs)
Slides 30–38
05
Quasi-Experimental Designs
Slides 39–48
06
Choosing a Method
Slides 49–57
07
Sampling & Statistical Power
Slides 58–66
08
Measurement & Data Collection
Slides 67–75
09
Analysis & Interpreting Results
Slides 76–84
10
External Validity, Cost & Ethics
Slides 85–91
11
Using IE for Decisions & Further Reading
Slides 92–99
ImpactMojoImpact Evaluation 101www.impactmojo.in
01
Section One
What Impact Evaluation Is
ImpactMojoImpact Evaluation 101www.impactmojo.in
Measuring the change a programme caused
An impact evaluation answers one demanding question: how much of the change we observe was actually caused by our programme — and not by everything else happening at the same time?
Impact evaluation
A study that estimates the causal effect of a programme, policy or intervention on an outcome — the difference between what happened with the programme and what would have happened without it.
Every other M&E question can be useful. This one tells you whether the programme is worth running at all.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Impact is not the same as monitoring
MonitoringImpact evaluation
AsksAre we doing what we planned?Did it cause the change?
TracksInputs, activities, outputsOutcomes vs a counterfactual
TimingContinuous, routinePeriodic, designed in advance
Example1,200 toilets builtDid diarrhoea actually fall because of them?
AnswersImplementationAttribution
Monitoring tells you the programme ran. Impact evaluation tells you whether it worked. You need both — they answer different questions.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Where impact sits in the results chain
01
INPUTS: budget, staff, training
02
ACTIVITIES: deliver the programme
03
OUTPUTS: toilets built, classes held
04
OUTCOMES: behaviour, take-up
05
IMPACT: the caused change in wellbeing
Monitoring lives on the left of this chain; impact evaluation targets the far right — and insists on attribution, not just association.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Compared to what?
A village had the programme; outcomes improved. Tempting to claim success — but outcomes might have improved anyway, through a good monsoon, rising incomes, or other schemes. The honest question is always: compared to what?
The fundamental question of impact evaluation is not 'did things get better?' but 'did things get better because of us?'
— a working principle of evaluation practice
ImpactMojoImpact Evaluation 101www.impactmojo.in
What credible evidence buys you
When you have it
  • Scale what genuinely works
  • Stop what does not, and free up budget
  • Defend funding with real evidence
  • Learn why, not just whether
When you don't
  • Scale failures by mistake
  • Credit yourself for trends you didn't cause
  • Cut effective programmes blindly
  • Repeat the same expensive errors
ImpactMojoImpact Evaluation 101www.impactmojo.in
Impact evaluation is not always the answer
Impact evaluation is powerful but costly and slow. It is one tool in the MEL toolkit — not a stamp of legitimacy to apply to everything. Much of the time, good monitoring and process evaluation serve you better.
This course is as much about knowing when to commission an impact evaluation as about how to design one.
ImpactMojoImpact Evaluation 101www.impactmojo.in
You don't have to run it to commission it well
Most practitioners will commission an impact evaluation rather than estimate it themselves. That still demands real literacy: framing the question, choosing a design, judging a proposal, and reading the findings critically.
This deck is written for the commissioner and the manager — the person who must ask hard questions of an evaluator, not necessarily write the regression.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Impact evaluation among its cousins
ActivityQuestion it answers
Needs assessmentWhat is the problem, and for whom?
Process evaluationWas the programme delivered as designed?
MonitoringAre we on track against the plan?
Impact evaluationDid the programme cause the change?
Cost-effectiveness analysisWas the impact worth the cost?
Impact evaluation is one row in this table, not the whole table. It shines only when paired with the others — especially process evaluation, which tells you whether a null means the idea or the delivery failed.
ImpactMojoImpact Evaluation 101www.impactmojo.in
02
Section Two
The Counterfactual & Causal Inference
ImpactMojoImpact Evaluation 101www.impactmojo.in
The counterfactual: the road not taken
Counterfactual
What would have happened to the very same people, in the same period, had the programme not existed. It is the benchmark against which impact is measured.
Impact = outcome with the programme − outcome without it, for the same group. The first we observe; the second we never can. Everything in this course is a strategy to estimate that missing half.
ImpactMojoImpact Evaluation 101www.impactmojo.in
The fundamental problem of causal inference
A person is either treated or not. We can never observe the same person in both states at once, so the true individual counterfactual is permanently missing. This is the fundamental problem of causal inference.
The solution is not to find each person's missing twin, but to build a comparison group that, on average, stands in for what the treated group would have looked like untreated.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Two tempting comparisons that mislead
Before vs after
Compare the same group before and after. But the world moved on too — you confuse the programme with everything else that changed (the monsoon, prices, other schemes).
Participants vs non-participants
Compare those who joined to those who didn't. But joiners often differ — more motivated, better placed. That difference is selection bias, not impact.
Both are easy, cheap and wrong. A credible counterfactual is the whole game.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Selection bias, defined
Selection bias
When the people who receive a programme differ systematically from those who don't — in ways that also affect the outcome — so a naive comparison mixes the programme's effect with those pre-existing differences.
If a skills programme enrols the most motivated youth, their later success partly reflects motivation, not training. Selection bias is the single biggest reason naive evaluations overstate impact.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Why volunteers make a misleading comparison
Illustrative: who self-selects into a training programme
Illustrative example, not real data
Joiners already differ before the programme. Comparing their later outcomes to non-joiners' measures who they were, not what the programme did.
ImpactMojoImpact Evaluation 101www.impactmojo.in
What a good comparison group must satisfy
  • It looks like the treatment group before the programme — same characteristics, same trajectory
  • It is exposed to the same outside forces — weather, prices, other schemes
  • The only systematic difference is the programme itself
Get this right and the comparison group's outcomes are a credible estimate of the treatment group's counterfactual. Every method that follows is a different way to build such a group.
ImpactMojoImpact Evaluation 101www.impactmojo.in
ATE and ATT: effect on whom?
ATE
Average Treatment Effect — the effect if everyone in the population were treated.
ATT
Average Treatment effect on the Treated — the effect among those who actually received the programme. Often the policy-relevant one.
They differ when the programme works differently for participants than for everyone. Always ask which one a study reports — and which one your decision needs.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Every method, one promise
Strip away the jargon and every design in this course makes the same promise: my comparison group is a credible stand-in for the treatment group's counterfactual. Methods differ only in how they make that promise believable.
01
RCT: a lottery makes the groups alike
02
RD: the cutoff makes near-neighbours alike
03
DiD: a shared trend makes changes comparable
04
Matching: observed traits make pairs alike
Judge any study by asking: how believable is its version of that promise — and what would break it?
ImpactMojoImpact Evaluation 101www.impactmojo.in
03
Section Three
Theory of Change & Evaluation Questions
ImpactMojoImpact Evaluation 101www.impactmojo.in
No theory of change, no evaluation
Theory of change
An explicit map of how and why a programme's activities are expected to lead to its intended outcomes — the chain of cause and effect, plus the assumptions at each link.
Before measuring impact, you must state what impact you expect and through what pathway. A vague programme cannot be rigorously evaluated.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Map the pathway, link by link
01
Cash transfer to mothers
02
more household income
03
more spent on food & care
04
better child nutrition
05
lower stunting
Each arrow is an assumption that can fail. If transfers are spent elsewhere, or food prices rise, the chain breaks — and that is exactly what evaluation should test.
ImpactMojoImpact Evaluation 101www.impactmojo.in
The arrows are where programmes fail
A theory of change makes assumptions visible so you can check them. Most programmes that 'fail' did not fail because the idea was wrong — they failed at a specific, checkable link.
Implementation failure
The chain was never delivered — training didn't happen, transfers didn't arrive. The theory is untested.
Theory failure
Everything was delivered, but an assumed link did not hold — the idea itself was wrong.
Impact evaluation paired with monitoring lets you tell these two apart — a null result means little if the programme never actually reached people.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Turn the theory into evaluation questions
  • Impact question: did the programme change the outcome, by how much, for whom?
  • Mechanism question: through which link in the chain?
  • Process question: was it delivered as designed?
  • Cost question: was the impact worth what it cost?
Only the first is strictly an impact question — but a good evaluation usually answers several, because 'did it work?' is rarely enough to act on.
ImpactMojoImpact Evaluation 101www.impactmojo.in
What a sharp evaluation question looks like
A usable impact question names the intervention, the population, the outcome, the comparison and the timeframe.
Weak: 'Does our programme help?' Strong: 'Does 18 months of monthly cash transfers to mothers of under-2s in Koraput reduce child stunting, relative to comparable villages without transfers?'
ImpactMojoImpact Evaluation 101www.impactmojo.in
Choose the right question, not every question
You cannot evaluate everything. A focused evaluation that answers one decision well beats a sprawling one that answers many vaguely. Let the decision you face dictate the question.
Ask up front: what would we do differently if the answer were yes, versus no? If nothing changes either way, the evaluation may not be worth running.
ImpactMojoImpact Evaluation 101www.impactmojo.in
When an impact evaluation is the wrong call
  • The programme is still changing weekly — there is nothing stable to evaluate
  • The intervention is already proven and the question is purely operational
  • No credible comparison group is possible (e.g. a universal, simultaneous rollout)
  • The budget and timeline cannot support a rigorous design — a weak IE is worse than none
  • The decision is already made and evidence will not change it
A badly underpowered or poorly identified impact evaluation can mislead more than no evaluation at all. Knowing when to walk away is a skill.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Run an evaluability check first
An evaluability assessment asks, before you commit: is this programme well-defined, stable and measurable enough to evaluate credibly — and will anyone act on the result?
01
Clear theory of change?
02
Measurable outcomes?
03
A feasible counterfactual?
04
A decision waiting on the answer?
ImpactMojoImpact Evaluation 101www.impactmojo.in
04
Section Four
Randomised Experiments (RCTs)
ImpactMojoImpact Evaluation 101www.impactmojo.in
Randomisation: let the lottery build the counterfactual
Randomised controlled trial (RCT)
A study in which eligible units are assigned to treatment or control by chance, so the two groups are comparable in expectation and the difference in their outcomes estimates the causal effect.
If a coin flip decides who gets the programme, the treatment and control groups have no systematic reason to differ — so any later difference in outcomes can be attributed to the programme.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Chance balances what you can and cannot see
This is the deep power of randomisation: with enough units, it balances the two groups on observed characteristics (age, caste, income) and on unobserved ones (motivation, ability, family support) — in expectation.
No other method can claim to balance the unobservable factors. That is why a well-run RCT has the strongest claim to internal validity.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Show the groups started alike
Illustrative baseline balance: treatment vs control
Illustrative example, not real data
A balance table like this is the first thing to check in any RCT: if randomisation worked, the groups look alike at baseline on every measured characteristic.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Who champions the RCT in development
The Abdul Latif Jameel Poverty Action Lab (J-PAL), based at MIT, has run hundreds of randomised evaluations of anti-poverty programmes worldwide, including across South Asia. Its founders, Abhijit Banerjee and Esther Duflo, shared the 2019 Nobel Prize in Economics (with Michael Kremer) for this experimental approach.
Its sister network 3ie (the International Initiative for Impact Evaluation) funds and synthesises impact evaluations across the Global South.
ImpactMojoImpact Evaluation 101www.impactmojo.in
How randomised evaluations took off
Illustrative: the rise of randomised evaluations in development
Illustrative trend, schematic only
Schematic, not a precise count — but the direction is real: randomised impact evaluation went from rare to mainstream in development over two decades.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Randomise individuals, or whole clusters?
Individual
Each person assigned separately. Most statistically efficient, but risky when treated and control people mix and influence each other.
Cluster
Whole villages, schools or clinics assigned together. Needed when the programme operates at group level or spillovers are likely — but you need many clusters.
Cluster designs cost statistical power: 40 villages of 50 people is far weaker than 2,000 individually randomised people. Power is driven by the number of clusters, not just total people.
ImpactMojoImpact Evaluation 101www.impactmojo.in
When treatment leaks to the control group
If treated and control units interact, the programme can spill over — a dewormed child stops infecting an untreated neighbour. The control group is no longer a clean counterfactual, and the effect is understated.
Cluster randomisation — treating whole villages — is the usual defence, keeping treated and control units far enough apart that they don't contaminate each other.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Equipoise: the ethics of the lottery
Equipoise
Genuine uncertainty about whether the programme works. When we honestly do not know, randomly choosing who gets it first is defensible — and the evaluation resolves the uncertainty for everyone.
  • Random phase-in: everyone gets it eventually; the lottery just sets the order
  • Randomise only when demand exceeds supply — a lottery is already fair
  • Never withhold a known, effective, life-saving intervention to run a trial
ImpactMojoImpact Evaluation 101www.impactmojo.in
05
Section Five
Quasi-Experimental Designs
ImpactMojoImpact Evaluation 101www.impactmojo.in
Finding a counterfactual in the real world
Often you cannot randomise — the programme is already running, or rolled out to everyone, or randomisation is infeasible. Quasi-experimental designs exploit natural variation to approximate a control group.
They can be highly credible — but each rests on an assumption you must argue for, not just an untouchable coin flip. Know the assumption behind every design.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Difference-in-differences: subtract the trend
Difference-in-differences (DiD) compares the change in a treated group with the change in a comparison group over the same period. Subtracting the comparison group's change removes whatever trend would have happened anyway.
01
Treated: before → after
02
Comparison: before → after
03
Impact = (treated change) − (comparison change)
ImpactMojoImpact Evaluation 101www.impactmojo.in
DiD: the gap that opens up
Illustrative DiD: outcome before and after, treated vs comparison
Illustrative example, not real data
The grey dashed line is the assumed counterfactual: where the treated group would have gone, tracking the comparison group's trend. The estimated impact is the gap above it (~14 points).
ImpactMojoImpact Evaluation 101www.impactmojo.in
DiD identifies only under parallel trends
Parallel trends
The assumption that, absent the programme, the treated and comparison groups would have moved in parallel — the same trend over time. DiD is only valid if this holds.
If the groups were already diverging before the programme, DiD attributes that pre-existing divergence to the programme — and overstates impact. Always check trends in the pre-period first.
ImpactMojoImpact Evaluation 101www.impactmojo.in
RD: compare just either side of a cutoff
Many programmes use a sharp eligibility cutoff — a poverty score below 0.33, a test mark above 60. People just above and just below the line are nearly identical, except one gets the programme. That tiny band is a natural experiment.
Regression discontinuity (RD)
A design that estimates impact from the jump in outcomes exactly at an eligibility threshold, comparing units narrowly above and below the cutoff.
ImpactMojoImpact Evaluation 101www.impactmojo.in
RD: read the jump at the cutoff
Illustrative RD: outcome jumps at the eligibility threshold
Illustrative example, not real data
The vertical jump at the cutoff (~14 points) is the estimated effect. Crucially, RD gives a local effect — valid for units near the threshold, not for everyone.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Matching: build a statistical twin
Matching pairs each treated unit with one or more untreated units that look alike on observed characteristics. Propensity-score matching (PSM) matches on a single number: the estimated probability of being treated, given those characteristics.
The fatal limit: matching can only balance what you measured. If treated and untreated differ on something unobserved — motivation, hidden need — matching cannot fix it. Weaker than an RCT.
ImpactMojoImpact Evaluation 101www.impactmojo.in
IV: borrow some random-like variation
An instrumental variable is an outside factor that nudges people into the programme but has no other path to the outcome. It isolates the slice of programme take-up that behaves as if random — e.g. distance to a facility shifting who enrols.
A valid instrument needs two things: relevance (it genuinely shifts take-up) and the exclusion restriction (it affects the outcome only through take-up). The second can never be fully proven — only argued.
ImpactMojoImpact Evaluation 101www.impactmojo.in
The quasi-experimental toolkit at a glance
DesignExploitsKey assumption
Diff-in-differencesBefore/after × treated/comparisonParallel trends
Regression discontinuityA sharp eligibility cutoffUnits similar across the cutoff
Matching / PSMObserved similarityNo unobserved confounders
Instrumental variablesAn external nudge into take-upRelevance + exclusion restriction
Each is only as credible as its assumption. A commissioner's job is to ask: what must be true for this to identify the effect — and is it?
ImpactMojoImpact Evaluation 101www.impactmojo.in
06
Section Six
Choosing a Method
ImpactMojoImpact Evaluation 101www.impactmojo.in
There is no single best method
The 'best' design is the most credible one that is feasible and ethical for your situation. Method choice is a negotiation between three forces, not a ranking to memorise.
Validity
How credible is the causal claim?
Feasibility
Can you actually do it, in time and budget?
Ethics
Is it fair to those involved?
ImpactMojoImpact Evaluation 101www.impactmojo.in
Internal validity: is the answer right here?
Internal validity
The degree to which a study correctly identifies the causal effect for the people and place it studied — free of selection bias and confounding.
A well-run RCT typically has the strongest internal validity; a before-after comparison the weakest. Internal validity is the first hurdle: an answer that is wrong here is useless everywhere.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Designs ranked by credibility
DesignCounterfactual qualityCredibility
Randomised controlled trialStrongest — balances unobservablesHighest
Regression discontinuityStrong, but local to the cutoffHigh
Difference-in-differencesGood if parallel trends holdMedium-high
Instrumental variablesDepends on a defensible instrumentConditional
Matching / PSMOnly as good as observed variablesMedium
Before-after / naive comparisonNo real counterfactualLowest
Climb as high as feasibility and ethics allow — but a credible quasi-experiment beats a badly run RCT. Execution matters as much as the rung.
ImpactMojoImpact Evaluation 101www.impactmojo.in
What makes a design practical
  • Is there a stage of rollout where you can still assign or compare?
  • Are baseline data and a comparison group available?
  • Do you have the sample size, budget and time the design needs?
  • Will the result arrive in time to inform the decision?
The single biggest feasibility lever is timing: the best designs are built in before a programme starts, not retrofitted afterwards.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Plan the evaluation before the programme rolls out
A staggered or phased rollout — which most large programmes need anyway — is a gift to evaluators. It creates a natural comparison group (those not yet reached) at no extra cost.
Lesson for commissioners: involve the evaluator at the design stage. Retrofitting rigour onto a finished programme is the hardest and least credible path.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Quantitative tells you whether; qualitative tells you why
Quantitative IE
Estimates the size of the effect and how sure we are. Answers did it work, and by how much?
Qualitative work
Explains the mechanism, the surprises, the implementation. Answers why, how, and for whom?
The strongest evaluations combine both: a number you can trust, and a story that explains it. Neither alone is enough to act.
ImpactMojoImpact Evaluation 101www.impactmojo.in
A rough rule for choosing
01
Can you assign at random, ethically? → RCT
02
Is there a sharp eligibility cutoff? → RD
03
Phased rollout with baseline data? → DiD
04
Only an external nudge into take-up? → IV
05
None of the above? → reconsider whether to run an IE
ImpactMojoImpact Evaluation 101www.impactmojo.in
A well-run simple design beats a botched fancy one
The credibility ladder ranks designs, but in the field, execution decides everything. An RCT wrecked by attrition, spillovers and broken randomisation can be less trustworthy than a careful difference-in-differences.
So weigh the design and the team, the timeline and the field conditions together. Ambition you cannot execute is worse than modesty you can.
ImpactMojoImpact Evaluation 101www.impactmojo.in
07
Section Seven
Sampling & Statistical Power
ImpactMojoImpact Evaluation 101www.impactmojo.in
Why an underpowered evaluation misleads
If your sample is too small, even a real, useful effect can fail to reach statistical significance. You then conclude 'no impact' — when the truth is you simply could not detect it. This is the curse of the underpowered study.
An underpowered evaluation wastes money and can kill a programme that actually works. Power must be planned before data collection — it cannot be fixed afterwards.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Statistical power, defined
Statistical power
The probability that a study will detect a real effect of a given size, if one truly exists. By convention, evaluations aim for 80% power or more.
Power of 80% means that if the programme really has the effect you assumed, you have an 80% chance of finding a statistically significant result — and a 20% chance of missing it.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Minimum detectable effect: what can you even see?
Minimum detectable effect (MDE)
The smallest true effect a study is reliably able to detect, given its sample size, the outcome's variability and the desired power.
Flip the question round: with this sample, what is the smallest impact I could detect? If your MDE is a 10-point gain but the programme can realistically deliver 3, the study is doomed before it starts.
ImpactMojoImpact Evaluation 101www.impactmojo.in
What determines how big a sample you need
FactorEffect on required sample
Smaller expected effectMuch larger sample needed
Higher outcome variabilityLarger sample needed
Higher power target (e.g. 90%)Larger sample needed
Clustered designLarger sample — driven by number of clusters
A good baseline to control forSmaller sample needed
The headline driver is effect size: detecting small effects is expensive. Be honest about how big an impact is plausible.
ImpactMojoImpact Evaluation 101www.impactmojo.in
How power rises with sample size
Illustrative power curve: power vs sample size
Illustrative, schematic shape only
Power climbs steeply at first, then flattens as it approaches 1. The dashed convention of 80% (~1,000 per arm here) is where most evaluations aim — beyond it, extra sample buys little.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Why clusters cost you power
When you randomise whole villages, people within a village resemble each other — they share the same school, water, weather. These correlated responses carry less independent information than the same number of unrelated individuals.
So power is driven by the number of clusters, not the total headcount. Forty large villages can be weaker than a hundred small ones. Add clusters, not just people.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Run a power calculation before you commit
  • State the smallest effect worth detecting (your MDE)
  • Estimate outcome variability from prior data or a pilot
  • Account for clustering, attrition and partial take-up
  • Solve for the sample size that gives at least 80% power
Tools like J-PAL's resources and free software make this routine. Insist on seeing the power calculation in any evaluation proposal — a design without one is a red flag.
ImpactMojoImpact Evaluation 101www.impactmojo.in
False positives and false negatives
Type I (false positive)
Concluding the programme worked when it didn't. Controlled by the significance level — conventionally a 5% risk.
Type II (false negative)
Missing a real effect. Its risk is 1 − power — the very thing a good sample size guards against.
Underpowered studies are dominated by Type II error: they scream 'no effect' when they simply could not see one. Power is how you buy down that risk.
ImpactMojoImpact Evaluation 101www.impactmojo.in
08
Section Eight
Measurement & Data Collection
ImpactMojoImpact Evaluation 101www.impactmojo.in
Decide what to measure before how
The outcome indicators flow straight from the theory of change. Choose them before the evaluation starts, define them precisely, and commit to them — so you cannot fish for whichever outcome happens to look good afterwards.
Distinguish primary outcomes (the one or two you will judge success on) from secondary ones. Pre-specifying the primary outcome is a core discipline of credible IE.
ImpactMojoImpact Evaluation 101www.impactmojo.in
What makes an outcome indicator usable
Valid
Captures the concept you actually care about
Reliable
Gives the same reading on repeat measurement
Sensitive
Moves when the real outcome moves
Feasible
Can be measured affordably in the field
ImpactMojoImpact Evaluation 101www.impactmojo.in
Measure before, measure after
01
BASELINE: measure outcomes before the programme
02
RANDOMISE / ASSIGN: form treatment & comparison
03
DELIVER: run the programme
04
ENDLINE: re-measure the same outcomes
A baseline does double duty: it confirms the groups started balanced, and it sharpens power by letting you control for starting levels. Skip it only if you must.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Surveys and administrative data
Survey data
Collected for the evaluation — exactly the outcomes you need, but costly, and prone to recall and reporting error.
Administrative data
Already generated by the system — school registers, HMIS, MGNREGA records. Cheap and continuous, but may not measure quite what you want.
Linking your evaluation to existing administrative systems can slash cost and enable long-term follow-up — where the data quality is good enough to trust.
ImpactMojoImpact Evaluation 101www.impactmojo.in
How measurement quietly corrupts a finding
  • Social desirability: people report what they think you want to hear
  • Recall error: hazy memory of past income, illness, spending
  • Surveyor effects: who asks, and how, shifts the answer
  • Differential measurement: measuring treatment and control groups differently
The last is the most dangerous: if treated respondents are surveyed more enthusiastically than controls, you manufacture an effect out of thin air. Measure both groups identically.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Build the survey to avoid bias
  • Pilot every instrument before the real round — without exception
  • Use neutral wording; avoid leading and double-barrelled questions
  • Blind enumerators to treatment status where you possibly can
  • Use the same instrument, timing and team for both groups
Where feasible, prefer objective measures (test scores, anthropometry, biomarkers, admin records) over self-report — they are harder to bias.
ImpactMojoImpact Evaluation 101www.impactmojo.in
When people drop out of the study
Attrition
The loss of study participants between baseline and endline — through migration, refusal or death — who cannot be measured at follow-up.
Attrition is dangerous when it differs between groups, or relates to the outcome — if the worst-off in the treatment group leave, the survivors look artificially good. Track it, report it, and test whether it is balanced.
ImpactMojoImpact Evaluation 101www.impactmojo.in
When you measure shapes what you find
Measure too early and the effect has not yet appeared — a nutrition programme needs months to move stunting. Measure too late and a real effect may have faded, or comparison villages may have caught up.
Let the theory of change set the clock: when should this outcome plausibly respond? Endline timing is a design choice, not an afterthought — and a second follow-up tells you whether effects last.
ImpactMojoImpact Evaluation 101www.impactmojo.in
09
Section Nine
Analysis & Interpreting Results
ImpactMojoImpact Evaluation 101www.impactmojo.in
How big, not just whether
The first thing to read is the effect size: how much did the outcome change? Report it in units a decision-maker understands — percentage points, rupees, days of schooling — not just a coefficient.
Then ask the practical question: is an effect of this size worth the programme's cost? A real but tiny effect may not justify the spend.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Significant is not the same as important
Statistical significance
A result unlikely to have arisen by chance alone if the programme truly had no effect. It speaks to confidence, not to the size or importance of the effect.
With a huge sample, a trivially small effect can be 'statistically significant'. With a small sample, a large, real effect can miss significance. Always read the effect size and the uncertainty — never the p-value alone.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Report the range, not just the point
An estimated impact of '6 percentage points' is shorthand for a range — say 2 to 10 — the confidence interval. The width of that range tells you how precisely the effect was estimated.
If the interval comfortably excludes zero, the effect is reasonably firm. If it straddles zero, the evaluation cannot rule out 'no effect'. Read intervals, not just stars.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Intention-to-treat vs treatment-on-the-treated
Intention-to-treat (ITT)
The effect of being offered the programme — everyone assigned to treatment, whether or not they took it up. Reflects real-world rollout, where take-up is never 100%.
Treatment-on-treated (ToT)
The effect on those who actually took up the programme. Usually larger than ITT, since it strips out the non-participants.
Keep groups by original assignment to preserve the randomisation. For a policy that can only offer (not force) a programme, ITT is often the more honest, decision-relevant number.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Did it work differently for different people?
An average effect can hide wide variation. A programme might help women but not men, or the poorest but not the better-off. Heterogeneity analysis looks for these differences — and is often where the most useful learning lives.
But beware: test enough subgroups and one will look 'significant' by chance. Pre-specify the subgroups you care about, and treat surprising ones as hypotheses to test next time, not conclusions.
ImpactMojoImpact Evaluation 101www.impactmojo.in
A null result is not a failure
Finding no significant impact is genuine, valuable knowledge — it can stop a wasteful programme or redirect resources. But distinguish a true zero from an inconclusive one.
A real null
Well-powered study, tight interval around zero: the programme genuinely did little. Act on it.
An empty null
Underpowered, wide interval: you simply couldn't detect an effect. This tells you almost nothing.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Read against the threats you learned
  • Was the design well identified — did the key assumption hold?
  • Were the groups balanced at baseline?
  • Was attrition low and balanced across groups?
  • Is the primary outcome the pre-specified one, or a convenient substitute?
  • Is the effect size practically meaningful, not just significant?
ImpactMojoImpact Evaluation 101www.impactmojo.in
Test twenty outcomes and one will 'work'
Measure enough outcomes and subgroups and, by chance alone, roughly one in twenty will look 'significant' at the usual threshold. Reporting only the winners — p-hacking — manufactures findings that will not replicate.
Defences: pre-specify the primary outcome, limit the number of tests, and adjust for multiple comparisons. Treat a lone surprising result as a hypothesis for the next study, not a conclusion.
ImpactMojoImpact Evaluation 101www.impactmojo.in
10
Section Ten
External Validity, Cost & Ethics
ImpactMojoImpact Evaluation 101www.impactmojo.in
External validity: will it work elsewhere?
External validity
The degree to which a result found in one place, time and population holds in another. A programme that worked in Bihar may not work the same way in Tamil Nadu — or at national scale.
Internal validity asks 'is the answer right here?'; external validity asks 'does it transfer there?' A perfectly identified RCT can still mislead if you scale it into a different context.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Why results don't always generalise
  • Context: different markets, culture, institutions
  • Implementation: a research-grade pilot run far better than a government scale-up
  • Scale effects: general-equilibrium changes once everyone is treated
  • Population: the studied group differs from the target group
Don't ask only 'did it work?' Ask 'why did it work, and are those conditions present where I want to use it?' Mechanism travels better than a headline number.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Impact per rupee, not impact alone
Impact is only half the decision. Cost-effectiveness asks how much outcome you buy per rupee — letting you compare very different programmes chasing the same goal.
A smaller-impact programme that costs a tenth as much may be the better buy. Always pair the effect size with a credible cost figure — the question is impact per rupee, not impact alone.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Ethics: the programme and the study both
  • Informed consent: participants understand and may refuse, without penalty
  • Do no harm: the study itself must not worsen anyone's situation
  • Privacy: protect respondents' data, especially the marginalised
  • Ethics review: an independent board (IRB) approves the design
Recall equipoise: random assignment is ethical only when we genuinely do not know what works — never withhold a proven, essential intervention to run a trial.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Pre-registration and open evidence
Pre-registration
Publicly recording your hypotheses, primary outcome and analysis plan before collecting or seeing the data — so you cannot quietly change the goalposts to fit the result.
Registries like the AEA RCT Registry and 3ie's records make evaluation honest and cumulative. Pre-registration is the main defence against cherry-picking and p-hacking.
ImpactMojoImpact Evaluation 101www.impactmojo.in
The file-drawer problem
Studies that find big, positive effects get published and shared; null results often languish in a file drawer. So the published record can overstate what works — you see the hits, not the misses.
Trust systematic reviews and replications over any single splashy study. Bodies like 3ie and the Campbell Collaboration synthesise across studies precisely to correct this bias.
ImpactMojoImpact Evaluation 101www.impactmojo.in
11
Section Eleven
Using IE for Decisions & Further Reading
ImpactMojoImpact Evaluation 101www.impactmojo.in
How to commission an impact evaluation well
01
Define the decision the evidence will inform
02
Write a sharp evaluation question
03
Bring the evaluator in at design stage
04
Demand a power calculation & identification strategy
05
Pre-register; plan to act on whatever you find
Your leverage as commissioner is greatest before the contract is signed. Ask the hard questions then, not when the report lands.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Questions to ask any evaluator
  • What is the counterfactual, and how credible is it?
  • What must be true for this design to identify the effect?
  • What is the MDE, and is it smaller than a plausible impact?
  • How will you handle attrition, spillovers and partial take-up?
  • Is the primary outcome pre-specified, and will you pre-register?
ImpactMojoImpact Evaluation 101www.impactmojo.in
The limits of 'what works'
Impact evaluation tells you whether a specific programme worked in a specific place. It cannot, by itself, tell you what to value, how to navigate trade-offs, or whether the result will hold at scale.
Evidence informs judgement; it does not replace it. The number is an input to a decision, never the decision itself.
— a working principle for evidence use
ImpactMojoImpact Evaluation 101www.impactmojo.in
Why good evidence still goes unused
  • Findings arrive after the decision was already made
  • Results are framed for academics, not for managers
  • No one owns turning the finding into a change in practice
  • Inconvenient nulls are quietly shelved
Plan use from the start: agree who decides what, by when, and commit to act on the answer — including if it is disappointing.
ImpactMojoImpact Evaluation 101www.impactmojo.in
The one book to read: Gertler et al.
Impact Evaluation in Practice by Paul Gertler, Sebastian Martinez, Patrick Premand, Laura Rawlings and Christel Vermeersch (World Bank) is the standard, free, practitioner-friendly handbook — the natural next step after this deck.
It walks through counterfactuals, every design in this course, and a worked case study, in plain language. Download it free from the World Bank's Open Knowledge Repository.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Trusted sources and resources
SourceWhat it offersNote
J-PALRandomised evaluations, training, evidence reviewsMIT-based; strong South Asia office
3ieImpact evaluations & systematic reviews of the Global SouthSearchable evidence portal
World Bank DIMEMethods, guidance, the Gertler et al. handbookFree handbook & toolkits
Campbell CollaborationSystematic reviews of social interventionsSynthesis across studies
AEA RCT RegistryPre-registered trial protocolsCheck what was promised
ImpactMojoImpact Evaluation 101www.impactmojo.in
If you remember five things
  • Always ask 'compared to what?' — the counterfactual is everything
  • Randomisation balances the unobservable, in expectation — no other method can claim that
  • Every quasi-experiment rests on an assumption — name it and test it
  • Plan power before, read effect size after — significance is not importance
  • Know when NOT to run an IE — and pre-register when you do
Pair this deck with ImpactMojo's Econometrics, Data Literacy and Research Ethics 101 courses.
ImpactMojoImpact Evaluation 101www.impactmojo.in
Impact Evaluation 101 · Complete
Now go ask:
compared to what?
CC BY-NC-ND 4.0·Free Forever·ImpactMojo 101 Series