Method Pack -- M6 -- Interactive

Reading & Critiquing an Evidence Paper

Most development professionals read research papers for conclusions. This pack teaches you to read for design -- so you can assess whether the conclusions are actually supported by the evidence presented.

4 modules~100 minInteractive

Your progress

0% complete

Your Capstone

1-Page Critique of a Paper of Your Choice

A structured critique covering design, validity, statistical claims, and practical implications -- ready to share with colleagues or use in a journal club.

Module 1 -- ~25 min

Reading for design (RCT, quasi-experimental, qualitative)

Before evaluating findings, identify the research design. The design determines what claims the paper can and cannot make. Most over-claims come from using a weak design to support a strong claim.

The design hierarchy (for causal claims)

Randomised Controlled Trial (RCT) -- random assignment to treatment/control. Strongest causal claim. Check: was randomisation actually random? Was there attrition? Were there spillovers?
Quasi-experimental -- comparison group but no random assignment. Difference-in-differences, regression discontinuity, propensity score matching. Check: how was the comparison group selected? Could selection bias remain?
Pre-post (no comparison) -- measures change over time in treatment group only. Cannot separate programme effect from time, maturation, or other changes. Check: does the paper acknowledge this limitation?
Cross-sectional -- snapshot at one point. Can describe associations but not causation or change. Check: does the paper claim change without time-series data?
Qualitative -- explores mechanisms, experiences, meaning. Not designed for causal claims. Check: does the paper stay within its design or over-reach?

Worked example -- Identifying design

Paper: "Impact of MGNREGA on rural livelihoods in Telangana" (hypothetical).

Method section says: "We surveyed 500 MGNREGA cardholders and compared their outcomes with 300 non-cardholders in the same mandals."

Design: Cross-sectional with comparison group. Not quasi-experimental because there is no pre-period and no method to address selection bias (people who get MGNREGA cards may differ systematically from those who do not). The paper can describe associations but cannot claim MGNREGA "caused" the differences.

Your Paper -- Design Identification

Choose a paper relevant to your work and identify its design.

Paper title and citation

Research design

Randomised Controlled Trial Quasi-experimental (DiD, RDD, PSM) Pre-post (no comparison group) Cross-sectional Qualitative Mixed-methods

Research question the paper is trying to answer

Saved

Self-check

A paper titled "Impact of mid-day meals on school attendance" uses a pre-post design with no comparison group. Can it claim "impact"?

Yes -- pre-post measures change

No -- without a comparison group, the change could be due to other factors (time trends, policy changes, seasonal effects). It can document change but not attribute it.

Only if the sample is large enough

Only if the effect is very large

Correct. "Impact" implies causation, which requires a credible counterfactual (what would have happened without the programme). Pre-post designs cannot provide this. The attendance increase might be due to a new state policy, teacher recruitment, or seasonal patterns.

Module 2 -- ~25 min

Assessing internal and external validity

Internal validity: did the study actually measure what it claims to have measured? Were there alternative explanations the design did not rule out?

External validity: do the findings generalise beyond the study population? Would the same intervention produce the same results in a different context?

Internal validity threats

Selection bias -- treatment and comparison groups differ systematically before the intervention
Attrition -- if 30% of the sample drops out, the remaining sample may be systematically different
Spillover/contamination -- control group receives indirect benefits from the treatment
Hawthorne effect -- participants behave differently because they know they are being studied
Measurement error -- instruments are unreliable or biased

External validity questions

Was the study population representative of the broader target population?
Was the context (governance, culture, market access) similar to where you want to apply findings?
Was the implementer typical or exceptional? (J-PAL studies often have stronger implementation than government scale-up.)

The India generalisability problem

Many influential development studies are conducted in one or two Indian states. A study from Tamil Nadu may not generalise to Bihar -- governance capacity, social structures, and economic conditions differ enormously. When reading India-based papers, always check: which states? Urban or rural? Which population? The answer to "does it work?" is almost always "it depends on where and for whom."

Your Validity Assessment

Internal validity threats you identifyDoes the paper acknowledge them? How serious are they?

External validity -- would this work in your context?

What the paper does not tell you

Saved

Self-check

An RCT of a livelihoods programme in Bangladesh shows strong effects. Your organisation wants to replicate it in Odisha. What is the primary concern?

Internal validity -- the RCT may have been poorly designed

External validity -- Bangladesh's context (NGO density, microfinance infrastructure, social norms) may differ significantly from Odisha

Statistical significance -- the sample may have been too small

The language barrier

Correct. RCTs have strong internal validity by design. The question for replication is always external validity: will the same intervention work in a different context with different implementing capacity, social norms, and market structures?

Module 3 -- ~25 min

Effect sizes and statistical claims

You do not need to be a statistician to critique statistical claims. You need to know three things: what the effect size means, what statistical significance actually tells you, and when to be suspicious.

Key concepts

Effect size (Cohen's d or standardised mean difference) -- how large is the difference in standard deviations. In education/development: 0.05-0.10 is small, 0.10-0.25 is medium, 0.25+ is large (Kraft 2020 benchmarks).
Statistical significance (p-value) -- probability of seeing this result if the true effect is zero. p < 0.05 is convention, not truth. With large samples, tiny effects become "significant." With small samples, real effects remain "insignificant."
Confidence interval -- the range of plausible true effects. If the 95% CI includes zero, the effect is not statistically significant. If it ranges from -0.05 to +0.40, the true effect could be anywhere in that wide range.

Worked example -- Reading a results table

Paper reports: "Programme increased test scores by 0.12 SD (95% CI: 0.03-0.21, p=0.008, n=2,400)."

Translation: The effect is small (0.12 SD) but statistically significant. The confidence interval is reasonably tight. With 2,400 students, the study had good power. The effect is real but modest -- roughly equivalent to 1-2 months of additional learning. Whether this justifies the programme cost depends on the per-student investment.

Your Statistical Assessment

Main effect size and what it means practically

Statistical significance and sample size

Anything suspicious in the statistics?Too many outcomes tested? Subgroup findings not pre-registered? Missing data not addressed?

Saved

Self-check

A paper tests 20 outcomes and finds 2 significant at p<0.05. Should you trust these 2 findings?

Yes -- p<0.05 is the standard

Be suspicious -- with 20 tests, you expect 1 false positive by chance. Two significant findings out of 20 could be entirely due to multiple testing.

Only if the effect sizes are large

Only if the sample is large

Correct. This is the multiple comparisons problem. At p<0.05 with 20 tests, you expect 1 false positive (0.05 x 20 = 1). The paper should apply a correction (Bonferroni, FDR) or pre-register the primary outcome. If it does neither, treat the "significant" findings with caution.

Module 4 -- ~25 min

Writing a 1-page critique

A good critique is not negative -- it is honest. It names what the paper does well, where the evidence is strong, and where the claims exceed the evidence. The goal: help your team decide whether and how to use this evidence.

1-page critique structure

Citation and research question (2 lines)
Design summary (3-4 lines) -- what method, what sample, what comparison
Strengths (3-4 bullet points) -- what does this paper do well?
Weaknesses (3-4 bullet points) -- what are the threats to validity?
Claims vs. evidence (2-3 lines) -- do the conclusions match the design?
Relevance to our context (2-3 lines) -- can we use this? What transfers? What does not?
Bottom line (1 sentence) -- "This paper provides [strong/moderate/weak] evidence that [X] because [Y]."

Your 1-Page Critique

Strengths of the paper (3-4 points)

Weaknesses (3-4 points)

Claims vs. evidence -- do conclusions match the design?

Bottom line (one sentence)

Saved

Self-check

Your bottom-line reads: "The paper is bad because it uses a quasi-experimental design instead of an RCT." Is this a valid critique?

Yes -- RCTs are always better

No -- quasi-experimental designs are valid for many questions. Critique the execution, not the design choice. Ask whether the comparison group is credible given the design.

Depends on the budget

Only if the topic is important enough for an RCT

Correct. Design choice should match the question and constraints. Many important questions cannot be studied with RCTs (ethics, feasibility, cost). A well-executed DiD or RDD can provide strong evidence. Critique the execution within the chosen design, not the choice itself.

Capstone

Your 1-Page Critique

Evidence Critique

Your critique will appear here.

Where to go next

All Practice Packs