fullscreen
ImpactMojoData Literacy 101www.impactmojo.in
ImpactMojo 101 Series · Free Forever
Data
Literacy
101
Reading, Questioning & Using Data Responsibly — a Foundational Course for Development Practitioners in South Asia
Research-BackedSouth Asia Focus100 SlidesFree Access
ImpactMojoData Literacy 101www.impactmojo.in
What We Cover
01
What Is Data Literacy?
Slides 3–9
02
Types & Sources of Data
Slides 10–19
03
From Concept to Indicator
Slides 20–28
04
Describing Data
Slides 29–38
05
Visualising Data
Slides 39–48
06
Relationships & Correlation
Slides 49–58
07
Sampling & Surveys
Slides 59–68
08
Data Quality & Cleaning
Slides 69–77
09
Reading Data Critically
Slides 78–86
10
Data Ethics, Privacy & Equity
Slides 87–94
11
Tools & Further Reading
Slides 95–99
ImpactMojoData Literacy 101www.impactmojo.in
01
Section One
What Is Data Literacy?
ImpactMojoData Literacy 101www.impactmojo.in
Data literacy is a survival skill
Development work runs on numbers — targets, indicators, budgets, surveys, dashboards. Data literacy is the ability to read, question, interpret and communicate data, and to use it to make better decisions. It is not statistics for its own sake; it is judgement.
Data literacy
The capacity to find, read, interpret, critically evaluate and communicate with data — and to recognise when data is being misused. It sits between raw numbers and good decisions.
You do not need to be a statistician. You need to ask the right questions of any number that lands on your desk.
ImpactMojoData Literacy 101www.impactmojo.in
Decisions you already make with data
Programme decisions
  • Which blocks or wards to prioritise
  • Whether an intervention is working
  • How to set a realistic target
  • Where the budget actually goes
Daily judgement calls
  • Is this survey finding trustworthy?
  • Does this chart mislead?
  • Is the sample like our population?
  • Who is missing from this count?
Every one of these is a data-literacy question before it is a technical one.
ImpactMojoData Literacy 101www.impactmojo.in
Data → Information → Knowledge → Decision
01
DATA: raw records — 1,240 children weighed
02
INFORMATION: 18% are underweight
03
KNOWLEDGE: underweight is concentrated in 3 hamlets
04
DECISION: target supplementary feeding there
Data literacy is what moves you up the ladder without slipping — each step adds interpretation, and each step can introduce error.
ImpactMojoData Literacy 101www.impactmojo.in
Five habits of a data-literate practitioner
  • Ask where it came from. Who collected it, when, how, and why?
  • Ask what it measures. Is the indicator really capturing the concept?
  • Ask who is missing. Whom does this number leave out?
  • Ask how sure we are. What is the uncertainty, the sample, the error?
  • Ask what decision it serves. Data without a question is noise.
ImpactMojoData Literacy 101www.impactmojo.in
Numbers feel objective. They are not neutral.
Not everything that counts can be counted, and not everything that can be counted counts.
— commonly attributed to William Bruce Cameron
Every dataset embeds choices: what to measure, what category to use, whom to ask, what to ignore. Those choices carry power. A data-literate practitioner reads the choices, not just the digits.
ImpactMojoData Literacy 101www.impactmojo.in
How this course is built
Foundations
  • Types and sources of data
  • Turning concepts into indicators
  • Describing and visualising data
Judgement
  • Correlation, sampling and surveys
  • Data quality and critical reading
  • Ethics, privacy and equity
Throughout, examples come from India and the wider region — the data you will actually meet at work.
ImpactMojoData Literacy 101www.impactmojo.in
02
Section Two
Types & Sources of Data
ImpactMojoData Literacy 101www.impactmojo.in
Quantitative and qualitative
Quantitative
Numbers and counts — how many, how much, how often. Strong for measuring scale, comparing groups, tracking change.
Qualitative
Words, meanings, experiences — why, how, in what context. Strong for understanding process, mechanism and lived reality.
They are partners, not rivals. Numbers tell you that something changed; stories tell you why. The best evidence usually uses both.
ImpactMojoData Literacy 101www.impactmojo.in
Primary vs secondary data
PrimarySecondary
SourceYou collect itSomeone else collected it
ExampleYour baseline survey, FGDsCensus, NFHS, district HMIS
ControlHigh — you design itLow — you take it as given
Cost / timeHighUsually low
RiskFieldwork error, biasMay not fit your question
Rule of thumb: exhaust good secondary data before collecting primary data. Much of what you need already exists — and re-collecting it wastes respondents' time.
ImpactMojoData Literacy 101www.impactmojo.in
Structured, semi-structured, unstructured
Structured
Neat rows & columns — survey tables, registers, spreadsheets
Semi
Some structure — forms with open text, tagged records, JSON
Unstructured
Free text, audio, images, video, field notes
Most development M&E lives in structured data, but a growing share — call-centre logs, photos, social media, satellite imagery — is unstructured and needs different tools.
ImpactMojoData Literacy 101www.impactmojo.in
Cross-section, time series, panel
  • Cross-section: many units at one time (one NFHS round)
  • Time series: one unit over time (national TFR, 1990–2024)
  • Panel / longitudinal: same units tracked over time (a cohort re-surveyed every year)
Panel data is powerful — it can follow the same household as it changes, separating real change from differences between households.
ImpactMojoData Literacy 101www.impactmojo.in
India's official data ecosystem
SourceWhat it coversFrequency
Census of IndiaEvery person — population, literacy, housing, migrationDecennial (2011 latest)
NFHSHealth, nutrition, fertility, anaemia, women's status~5 years (NFHS-5: 2019–21)
NSS / PLFSConsumption, employment, unemploymentPLFS annual since 2017–18
SRSBirth & death rates, infant mortality, life expectancyAnnual
HMISFacility-level health service deliveryMonthly
SECC 2011Socio-economic & caste deprivation indicatorsOne-off (2011)
Know these by name. Most of your secondary-data needs are met by one of them — free and downloadable.
ImpactMojoData Literacy 101www.impactmojo.in
NFHS, NSS and the Census do different jobs
Census = everyone
Counts every person. Best for small-area detail (a village, a ward). Expensive, so it is rare.
Surveys = a sample
NFHS & NSS interview a carefully chosen sample and infer the whole. Cheaper, frequent — but only reliable down to the level they were designed for (usually state or district).
Common error: using a state-level survey estimate to make claims about a single block. The sample was never designed to say anything that local.
ImpactMojoData Literacy 101www.impactmojo.in
How big are these datasets?
1.21 bn
people enumerated in Census 2011
Census of India 2011
~636,000
households interviewed in NFHS-5
NFHS-5, 2019–21
707
districts covered by NFHS-5
IIPS / MoHFW
These are among the largest demographic and health surveys in the world. Their size is what lets them speak reliably about districts — but not about your single panchayat.
ImpactMojoData Literacy 101www.impactmojo.in
The data your programme already generates
Every scheme produces administrative data as a by-product of delivery: MGNREGA muster rolls, school enrolment (UDISE+), health records (HMIS), ration transactions, immunisation registers.
Strengths
  • Continuous, cheap, already collected
  • Universal coverage of beneficiaries
  • Real-time-ish monitoring
Watch-outs
  • Records who is served, not who is missed
  • Incentives to over- or under-report
  • Gaps, duplicates, stale entries
ImpactMojoData Literacy 101www.impactmojo.in
Big data and digital traces
Mobile-phone records, satellite night-lights, transaction logs and remote sensing increasingly supplement official statistics — useful where surveys are slow or coverage is thin.
But digital traces over-represent the connected and under-represent the poor, women, the elderly and remote areas. Big data can deepen exclusion if read uncritically.
ImpactMojoData Literacy 101www.impactmojo.in
03
Section Three
From Concept to Indicator
ImpactMojoData Literacy 101www.impactmojo.in
You cannot measure 'wellbeing' directly
Most things we care about — poverty, empowerment, health, learning — are concepts, not numbers. Measurement is the bridge from an abstract concept to an observable indicator.
01
CONCEPT: women's empowerment
02
DIMENSIONS: mobility, decision-making, assets
03
INDICATORS: % who can visit a health centre alone
04
DATA: survey responses
ImpactMojoData Literacy 101www.impactmojo.in
Indicators, defined
Indicator
An observable, measurable marker that stands in for something we cannot observe directly. A good indicator is a faithful proxy for the concept — no more, no less.
Operationalisation
The precise rule that turns a concept into a measurement: exactly what to count, for whom, over what period, in what units.
ImpactMojoData Literacy 101www.impactmojo.in
Four kinds of variable
LevelMeaningExampleValid maths
NominalLabels, no orderDistrict, caste, religionCounts, mode
OrdinalOrdered, unequal gapsWealth quintile, Likert scaleMedian, rank
IntervalEqual gaps, no true zeroTemperature (°C), calendar yearMean, difference
RatioEqual gaps, true zeroIncome, age, children ever bornAll, ratios
Why it matters: you cannot take a meaningful average of caste categories, and a wealth quintile is a rank, not a rupee amount. The level decides which statistics are legal.
ImpactMojoData Literacy 101www.impactmojo.in
What makes an indicator trustworthy?
Valid
Measures what it claims to measure
Reliable
Gives the same answer on repeat measurement
Sensitive
Moves when the real thing moves
Feasible
Can actually be collected, affordably
ImpactMojoData Literacy 101www.impactmojo.in
Accurate is not the same as consistent
Reliable, not valid
A miscalibrated weighing scale: it reads 2 kg high every time. Perfectly consistent — consistently wrong.
Valid and reliable
A calibrated scale: same answer each time, and the right answer. This is the target.
You can have reliability without validity, but never validity without reliability. Check both.
ImpactMojoData Literacy 101www.impactmojo.in
When you measure A to learn about B
A proxy stands in for something hard to measure. Household assets proxy for wealth; night-light intensity proxies for economic activity; mid-upper-arm circumference proxies for acute malnutrition.
Every proxy leaks. Asset indices miss debt; night-lights miss the informal economy. Name the gap between your proxy and the concept — and report it.
ImpactMojoData Literacy 101www.impactmojo.in
Bundling many indicators into one number
Indices like the Human Development Index or the Multidimensional Poverty Index (MPI) combine several indicators into a single score for easy comparison.
Upside
One memorable number; ranks and headlines; captures several dimensions at once.
Downside
Weights are value judgements; aggregation hides trade-offs; a good score can mask a terrible component.
ImpactMojoData Literacy 101www.impactmojo.in
India's National MPI
NITI Aayog's National MPI bundles 12 indicators across health, education and standard of living — nutrition, child mortality, schooling, cooking fuel, sanitation, housing, assets and more.
12
indicators in 3 dimensions
NITI Aayog National MPI
Headcount × Intensity
MPI = share who are poor × how deeply poor they are
Notice the design choice: a household is 'MPI poor' if deprived in a weighted third or more of indicators. Change that threshold and the poverty rate changes.
ImpactMojoData Literacy 101www.impactmojo.in
04
Section Four
Describing Data
ImpactMojoData Literacy 101www.impactmojo.in
Where is the centre? How spread out?
Before any fancy analysis, describe the data. Two questions answer most of it: what is typical (central tendency) and how much do values vary (dispersion).
01
CENTRE: mean, median, mode
02
SPREAD: range, IQR, standard deviation
03
SHAPE: skew, peaks, outliers
ImpactMojoData Literacy 101www.impactmojo.in
Mean, median and mode
MeasureWhat it isBest when
MeanArithmetic averageRoughly symmetric data, no wild outliers
MedianMiddle value when sortedSkewed data — income, land, wealth
ModeMost frequent valueCategories — commonest crop, caste, response
For money — income, consumption, landholding — prefer the median. A few crorepatis drag the mean far above what a typical household actually has.
ImpactMojoData Literacy 101www.impactmojo.in
Mean vs median: the same village, two stories
Monthly income, 11 households (₹000s)
Illustrative example
Median = ₹12k (typical household). Mean = ₹31k, pulled up by one rich household. Report the mean here and you describe a village that does not exist.
ImpactMojoData Literacy 101www.impactmojo.in
Spread: range, IQR and standard deviation
  • Range: max − min. Simple, but one outlier wrecks it.
  • IQR (interquartile range): the middle 50% — robust to outliers.
  • Standard deviation: typical distance from the mean. The everyday measure of variability.
Two districts can share the same average income yet feel completely different — one equal, one polarised. The mean hides that; the spread reveals it.
ImpactMojoData Literacy 101www.impactmojo.in
Percentiles, quartiles and quintiles
A percentile is the value below which a given share of cases fall. The 25th percentile (Q1) has a quarter of households below it. Wealth quintiles — five 20% bands — are how NFHS and NSS routinely report inequality.
Q1–Q5
Quintiles: five equal-size groups
p50
The 50th percentile is the median
p90/p10
A common inequality ratio
ImpactMojoData Literacy 101www.impactmojo.in
The shape of the data matters
Symmetric vs right-skewed distributions
Illustrative
Income, landholding and firm size are almost always right-skewed — a long tail of large values. That is exactly when the mean misleads.
ImpactMojoData Literacy 101www.impactmojo.in
The bell curve and the 68–95–99.7 rule
Many natural measurements (height, birth weight, measurement error) follow a roughly normal distribution — symmetric, bell-shaped, defined by its mean and standard deviation (SD).
68%
of values fall within 1 SD of the mean
95%
fall within 2 SD
99.7%
fall within 3 SD
But do not assume normality. Development data — income, expenditure, programme size — is usually skewed, so the rule does not apply. Always look at the shape first.
ImpactMojoData Literacy 101www.impactmojo.in
Outliers: error, or the most important case?
Could be an error
A 9-foot-tall respondent, a household of 80, an income of ₹0 — check for data-entry slips before analysing.
Could be the story
The one block with triple the dropout rate may be precisely where the programme is needed. Do not delete it — investigate it.
Never silently drop outliers. Flag them, explain them, and decide transparently.
ImpactMojoData Literacy 101www.impactmojo.in
Always ask: out of how many?
A raw count without its denominator is almost meaningless. '500 dropouts' could be a crisis or a triumph depending on whether the base is 600 or 60,000.
Count
How many? (numerator alone)
Rate
How many out of how many? (numerator ÷ denominator)
The denominator is where most honest comparison lives — per 1,000 people, per eligible child, per year. Demand it.
ImpactMojoData Literacy 101www.impactmojo.in
05
Section Five
Visualising Data
ImpactMojoData Literacy 101www.impactmojo.in
A good chart is an argument you can see
Visualisation is not decoration. A well-made chart reveals patterns the table hides — trends, gaps, outliers, relationships — and lets a busy decision-maker grasp them in seconds.
The greatest value of a picture is when it forces us to notice what we never expected to see.
— John Tukey, pioneer of exploratory data analysis
ImpactMojoData Literacy 101www.impactmojo.in
Pick the chart for the question
You want to show…UseAvoid
Change over timeLine chartPie chart
Comparison across categoriesBar chart3-D anything
Composition / shares of a wholeStacked bar (or 1 pie, few slices)Many pies
Relationship between two variablesScatter plotDual-axis tricks
Distribution of one variableHistogram / box plotSingle average
Geographic patternChoropleth mapMap coloured by raw counts
ImpactMojoData Literacy 101www.impactmojo.in
What every honest chart needs
  • A clear title that states the takeaway, not just the topic
  • Labelled axes with units
  • An honest baseline — usually zero for bar charts
  • A source and a date
  • A note on the denominator and any exclusions
ImpactMojoData Literacy 101www.impactmojo.in
The truncated axis
Y-axis starts at 90 (misleading)
Illustrative
Y-axis starts at 0 (honest)
Illustrative
Same data. The left chart makes a 4-point rise look like a tripling. Truncating the axis is the most common way charts lie.
ImpactMojoData Literacy 101www.impactmojo.in
More ink is not more information
Edward Tufte's principle: maximise the data-ink ratio. Every gradient, shadow, 3-D effect and clip-art icon competes with the data for attention — and usually wins.
  • No 3-D bars — they distort the very lengths you are comparing
  • No pie charts with 8 slices — the eye cannot rank angles
  • No rainbow palettes — colour should carry meaning, not noise
ImpactMojoData Literacy 101www.impactmojo.in
Use colour with intent — and for everyone
Good colour
  • Sequential for ordered data (light→dark)
  • One accent colour to highlight the point
  • Consistent meaning across charts
Accessibility
  • ~8% of men have colour-vision deficiency
  • Never rely on red-vs-green alone
  • Add labels, patterns or direct text
ImpactMojoData Literacy 101www.impactmojo.in
Repeat a small chart to compare many
Instead of cramming ten states into one tangled line chart, draw ten tiny identical charts side by side — small multiples. The eye compares shapes effortlessly when scale and layout are shared.
Rule: when a single chart gets crowded, split it into a grid of small, identical ones rather than adding more colours.
ImpactMojoData Literacy 101www.impactmojo.in
Sometimes a table beats a chart
  • Use a table for exact values people will look up or quote
  • Use a chart for pattern, trend and comparison
  • Right-align numbers, fix decimal places, and add row/column totals
  • A heat-shaded table can do both — precise and patterned
ImpactMojoData Literacy 101www.impactmojo.in
Before you publish a chart, ask…
  • Does the title state the finding?
  • Is the baseline honest (zero where it should be)?
  • Are axes, units and the denominator labelled?
  • Could a colour-blind reader read it?
  • Is the source and date on the chart?
  • Have I removed everything that is not data?
ImpactMojoData Literacy 101www.impactmojo.in
06
Section Six
Relationships & Correlation
ImpactMojoData Literacy 101www.impactmojo.in
Do two things move together?
Correlation
A measure of how strongly two variables move together. Positive: both rise together. Negative: one rises as the other falls. Zero: no linear relationship.
The correlation coefficient r runs from −1 (perfect negative) through 0 (none) to +1 (perfect positive).
ImpactMojoData Literacy 101www.impactmojo.in
Female literacy and fertility across Indian states
Female literacy (%) vs total fertility rate, major states
Illustrative, patterned on Census 2011 & NFHS-5
A clear negative correlation: states with higher female literacy tend to have lower fertility. But does literacy cause lower fertility? Hold that thought.
ImpactMojoData Literacy 101www.impactmojo.in
Correlation is not causation
Two variables can move together for several reasons, only one of which is 'A causes B'.
  • Reverse causation: B might cause A
  • Confounding: a third factor C drives both
  • Selection: the sample was chosen in a way that creates the link
  • Chance: with enough variables, some correlate by luck
ImpactMojoData Literacy 101www.impactmojo.in
The lurking third variable
Ice-cream sales correlate with drowning deaths. Ice cream does not cause drowning — summer heat drives both. The confounder is the real story.
01
Hot weather (confounder C)
02
drives ice-cream sales (A)
03
AND drives swimming & drowning (B)
04
so A and B correlate — with no causal link
ImpactMojoData Literacy 101www.impactmojo.in
Patterns appear in pure noise
Test enough pairs of unrelated variables and some will correlate strongly by sheer chance. A tight correlation is a clue, never a proof.
Before believing a correlation, ask: is there a plausible mechanism? Could a confounder explain it? Does it survive in other data?
ImpactMojoData Literacy 101www.impactmojo.in
Never trust the number without the picture
Anscombe's quartet is four datasets with identical means, variances and correlation (r = 0.82) — yet utterly different shapes: one linear, one curved, one a single outlier driving everything.
The lesson, proven in 1973 and true today: always plot your data. Summary statistics alone can hide the truth.
ImpactMojoData Literacy 101www.impactmojo.in
Group patterns ≠ individual truths
Ecological fallacy
Wrongly inferring something about individuals from a pattern seen only at the group level.
A district with higher average income may have higher literacy on average — that does not mean the richer individuals within it are the literate ones. What holds for districts need not hold for people.
ImpactMojoData Literacy 101www.impactmojo.in
A trend can reverse when you split the data
Simpson's paradox: a relationship visible in the whole dataset can flip when you break it into subgroups. A scheme can look worse overall yet be better in every region — if the regions differ in size and baseline.
Always disaggregate before concluding. The aggregate can point the opposite way to the truth.
ImpactMojoData Literacy 101www.impactmojo.in
Extreme results drift back to average
When you pick the worst-performing districts and re-measure them later, they usually look better — even with no intervention. Extreme values contain extra luck that does not repeat. This is regression to the mean.
It is a notorious trap in evaluation: target the bottom 10% of schools, see them improve, and credit your programme — when much of the gain would have happened anyway. A comparison group is the cure.
ImpactMojoData Literacy 101www.impactmojo.in
07
Section Seven
Sampling & Surveys
ImpactMojoData Literacy 101www.impactmojo.in
You rarely need to ask everyone
A well-chosen sample of a few thousand can describe a population of millions — the principle behind NFHS, NSS and every opinion poll. The magic is not size; it is representativeness.
Representative sample
A sample whose composition mirrors the population on the characteristics that matter, so findings can be generalised back to the whole.
ImpactMojoData Literacy 101www.impactmojo.in
Define the population first
01
TARGET POPULATION: whom you want to learn about
02
SAMPLING FRAME: the list you can actually draw from
03
SAMPLE: who you end up measuring
04
RESPONDENTS: who actually answers
Each gap — frame missing people, non-response, refusals — is a place bias creeps in. The frame is often the weakest link: a list of phone owners is not a list of citizens.
ImpactMojoData Literacy 101www.impactmojo.in
Let chance choose — it removes bias
MethodHowUse when
Simple randomEvery unit equal chanceYou have a full list
SystematicEvery k-th unit from a listOrdered list, no hidden cycle
StratifiedSplit into groups, sample eachYou must represent subgroups
ClusterSample whole groups (villages)People are geographically spread
MultistageClusters, then units withinLarge national surveys (NFHS)
Only probability sampling lets you calculate a margin of error and generalise honestly.
ImpactMojoData Literacy 101www.impactmojo.in
Convenient, but you cannot generalise
  • Convenience: whoever is easy to reach — the people at the camp
  • Purposive: hand-picked for a reason — key informants
  • Snowball: respondents refer others — hidden populations
  • Quota: fill fixed counts per group, but non-randomly
These are legitimate for qualitative depth and hard-to-reach groups — but you cannot attach a margin of error or claim population-level numbers from them.
ImpactMojoData Literacy 101www.impactmojo.in
How many do I actually need?
Margin of error vs sample size (95% confidence, p=0.5)
Standard sampling theory
Note the curve flattens: ~1,067 gives ±3%, but halving the error to ±1.5% needs ~4,000. Precision gets expensive fast.
ImpactMojoData Literacy 101www.impactmojo.in
It's the sample size, not the fraction
A counter-intuitive truth: for a large population, accuracy depends on the absolute sample size, not the share of the population sampled. 1,500 people describe a state and a country about equally well.
This is why a national survey of ~600,000 households can speak about all of India — and why your block of 2,000 households still needs a few hundred interviews, not twenty.
ImpactMojoData Literacy 101www.impactmojo.in
The errors that size cannot fix
  • Selection bias: the frame or method systematically misses people
  • Non-response bias: those who refuse differ from those who answer
  • Survivorship bias: you only see who remained (drop-outs vanish)
  • Social-desirability bias: people answer how they think they should
A bigger biased sample is just a more confident wrong answer. Size fixes noise, never bias.
ImpactMojoData Literacy 101www.impactmojo.in
Why survey results come 'weighted'
When some groups are deliberately over-sampled (to study them reliably) or respond less, surveys apply weights so each respondent represents the right number of real people.
Practical warning: using NFHS or PLFS unit data without the survey weights gives wrong totals. Always weight when the documentation says to.
ImpactMojoData Literacy 101www.impactmojo.in
A survey is only as good as its questions
  • Avoid leading questions ('Don't you agree that…?')
  • Avoid double-barrelled ones ('clean and safe?' — which one?)
  • Use language and units respondents actually use locally
  • Pilot every instrument before the real round — always
ImpactMojoData Literacy 101www.impactmojo.in
08
Section Eight
Data Quality & Cleaning
ImpactMojoData Literacy 101www.impactmojo.in
Most data work is cleaning
Analysts often spend the majority of a project just preparing data — finding errors, reconciling formats, handling gaps. Glamorous analysis sits on a large, unglamorous foundation of cleaning.
Garbage in, garbage out. No model rescues bad data.
— computing proverb, truer than ever
ImpactMojoData Literacy 101www.impactmojo.in
What 'dirty' data looks like
ProblemExampleRisk
Missing valuesBlank income fieldBiased averages if not random
DuplicatesSame beneficiary twiceInflated counts
Inconsistent codes'F' / 'Female' / '2'Broken grouping
Outliers / impossibleAge = 200, −5 childrenDistorted statistics
Format driftDD/MM vs MM/DD datesSilent miscalculation
Typos in keysMisspelt village nameFailed merges
ImpactMojoData Literacy 101www.impactmojo.in
Why values are missing matters more than how many
  • Missing at random: gaps unrelated to the value — least harmful
  • Missing not at random: the richest refuse to state income — this biases results
  • Dropping rows with gaps can quietly delete the very people you care about
Before deleting or filling missing values, ask why they are missing. The pattern of absence is itself data.
ImpactMojoData Literacy 101www.impactmojo.in
A repeatable cleaning workflow
01
INSPECT: look at every column's range & uniques
02
VALIDATE: rules (age 0–120, % in 0–100)
03
FIX: standardise codes, dates, units
04
DOCUMENT: log every change
05
FREEZE: keep raw data untouched
Golden rule: never edit the raw file. Clean in a script or a copy so every change is reversible and visible.
ImpactMojoData Literacy 101www.impactmojo.in
If you can't redo it, you can't trust it
Fragile
Manual edits in a spreadsheet, no record of what changed. Next month, nobody can reproduce the number — including you.
Robust
A documented script from raw to result. Re-run it any time, audit every step, hand it to a colleague.
ImpactMojoData Literacy 101www.impactmojo.in
Build checks in, don't hope
  • Range checks: can this value exist at all?
  • Logic checks: a 6-year-old cannot be married with children
  • Cross-checks: do parts sum to the reported total?
  • Sense checks: does the headline number pass the smell test?
ImpactMojoData Literacy 101www.impactmojo.in
Document so future-you can understand
Metadata is data about your data: what each variable means, its units, allowed values, how and when it was collected, and what you changed.
A dataset without a data dictionary is a puzzle with no key. The six months it takes to forget your own coding is shorter than you think.
ImpactMojoData Literacy 101www.impactmojo.in
Keep the trail
  • Keep the raw extract read-only and dated
  • Name files with versions and dates, not 'final_FINAL_v3'
  • Record the source, download date and any filters applied
  • Save the cleaning script alongside the data
ImpactMojoData Literacy 101www.impactmojo.in
09
Section Nine
Reading Data Critically
ImpactMojoData Literacy 101www.impactmojo.in
Every estimate has a range
A survey figure of 42% is shorthand for 'about 42%, give or take'. The confidence interval — say 39–45% — is the honest version. A point estimate without its range overstates certainty.
If two groups' confidence intervals overlap heavily, a difference between them may be noise, not signal. Look for the range, not just the dot.
ImpactMojoData Literacy 101www.impactmojo.in
'Significant' has a narrow technical meaning
Statistical significance
A result unlikely to have arisen by chance alone if there were truly no effect. It says nothing about whether the effect is large or important.
Statistically significant ≠ practically important. With a huge sample, a trivially small difference can be 'significant'. Always ask: how big is the effect, and does it matter?
ImpactMojoData Literacy 101www.impactmojo.in
The base-rate trap
Even a 99%-accurate test for a rare condition produces mostly false positives — because the healthy vastly outnumber the sick. Ignoring the underlying rate is one of the commonest reasoning errors.
Whenever you read 'X% accurate', ask how common the thing is to begin with. The base rate changes everything.
ImpactMojoData Literacy 101www.impactmojo.in
Percent vs percentage points
If coverage rises from 40% to 50%, that is a 10 percentage-point increase — but a 25 percent increase. Mixing the two is a classic way to exaggerate or hide change.
+10 pp
percentage-point change (50 − 40)
+25%
relative change (10 ÷ 40)
ImpactMojoData Literacy 101www.impactmojo.in
'Doubled' can hide tiny numbers
'Cases doubled!' sounds alarming — but a rise from 2 to 4 is a doubling of almost nothing. Relative change without the absolute base is designed to impress, not inform.
Always pair the relative figure with the raw counts. '100% increase, from 2 to 4 cases' tells the honest story.
ImpactMojoData Literacy 101www.impactmojo.in
Beware the chosen baseline
  • Start the time axis at a low year to exaggerate growth
  • Quote the one indicator that improved, ignore the rest
  • Compare to an unusual reference period (a drought, a peak)
  • Report only the subgroup that helps the argument
Ask: why this start date, this indicator, this comparison? What is left out?
ImpactMojoData Literacy 101www.impactmojo.in
Test enough things and something 'works'
If you slice the data twenty ways, roughly one slice will show a 'significant' result by chance. Reporting only that slice — p-hacking — manufactures false findings.
Trust analyses that were specified before seeing the data, and findings that replicate. Be wary of a single surprising subgroup result.
ImpactMojoData Literacy 101www.impactmojo.in
Eight questions for any statistic
  • Who produced it, and what is their interest?
  • How was it measured — and what is the denominator?
  • Is it a sample? How big, how chosen?
  • What is the uncertainty / margin of error?
  • Percent or percentage points? Relative or absolute?
  • Who is missing from the count?
  • Correlation, or genuine causation?
  • Does it pass the common-sense smell test?
ImpactMojoData Literacy 101www.impactmojo.in
10
Section Ten
Data Ethics, Privacy & Equity
ImpactMojoData Literacy 101www.impactmojo.in
Every data point is a person
In development data, rows are people — often poor, often without power over how their information is used. Data ethics is not paperwork; it is respect made operational.
Data are not just numbers; they are people reduced to numbers. The reduction is never neutral.
— a principle of feminist data practice
ImpactMojoData Literacy 101www.impactmojo.in
Informed consent is the floor
  • People should know what is collected and why
  • How it will be used, stored and shared — and for how long
  • That they can refuse or stop, with no penalty
  • Consent must be in a language and form they genuinely understand
A thumbprint on a form nobody explained is not consent. For children and other vulnerable groups, extra safeguards apply.
ImpactMojoData Literacy 101www.impactmojo.in
Anonymisation is harder than deleting names
Removing names is not enough. A combination of village + age + caste + occupation can re-identify one person — especially in small areas where few people share those traits.
Direct IDs
Name, Aadhaar, phone — remove
Quasi-IDs
Age + place + caste can re-identify — aggregate or coarsen
ImpactMojoData Literacy 101www.impactmojo.in
India's Digital Personal Data Protection Act, 2023
The DPDP Act, 2023 is India's first comprehensive data-protection law. It sets duties for anyone handling personal digital data — including NGOs and researchers.
  • Collect only what you need, for a stated purpose (purpose limitation)
  • Obtain free, informed, specific consent
  • Protect data with reasonable security safeguards
  • Stronger protections for children's data
Know your obligations before you collect. 'We're a small NGO' is not an exemption.
ImpactMojoData Literacy 101www.impactmojo.in
What you don't disaggregate, you can't see
An average hides the people behind it. A programme can report good overall numbers while failing Dalit, Adivasi, disabled, or women beneficiaries. Disaggregation is how inequity becomes visible.
An overall average can mask large gaps between groups
Illustrative
ImpactMojoData Literacy 101www.impactmojo.in
Count the people the data forgets
  • Homeless and pavement-dwellers absent from household frames
  • Migrants counted nowhere — neither origin nor destination
  • Informal workers invisible to formal employment statistics
  • Trans and non-binary people erased by binary-only forms
'Missing data' is rarely random. The uncounted are usually the most marginalised — and policy built on the count leaves them out twice.
ImpactMojoData Literacy 101www.impactmojo.in
Data colonialism and who benefits
Communities are often extracted from — surveyed repeatedly, with the knowledge and value flowing to outside institutions while respondents see nothing back.
  • Give findings back to the community in usable form
  • Involve people in defining what gets measured
  • Ask: who owns this data, and who profits from it?
ImpactMojoData Literacy 101www.impactmojo.in
11
Section Eleven
Tools & Further Reading
ImpactMojoData Literacy 101www.impactmojo.in
Tools to grow into
ToolGood forNote
Spreadsheets (Excel, Google Sheets)Most everyday analysisStart here; learn pivot tables
RStatistics, reproducible analysis, graphicsFree, powerful, steeper curve
Python (pandas)Cleaning, large data, automationFree, general-purpose
KoboToolbox / ODKMobile survey data collectionFree, offline-capable
QGISMaps and spatial dataFree, open-source GIS
Power BI / Looker StudioDashboardsQuick visual reporting
Tools matter less than habits. A clear spreadsheet beats a confused script. Master the thinking first.
ImpactMojoData Literacy 101www.impactmojo.in
Open data you can use today
  • data.gov.in — India's open government data portal
  • censusindia.gov.in — Census tables & maps
  • NFHS / DHS Program — health & demographic data
  • MoSPI — NSS, PLFS, national accounts
  • World Bank Open Data & Our World in Data — global comparisons
ImpactMojoData Literacy 101www.impactmojo.in
A short, honest reading list
  • How to Lie with Statistics — Darrell Huff (still the classic primer)
  • The Visual Display of Quantitative Information — Edward Tufte
  • Data Feminism — D'Ignazio & Klein (power and data)
  • Factfulness — Hans Rosling (reading global data well)
  • Poor Economics — Banerjee & Duflo (evidence in development)
Pair this deck with ImpactMojo's Exploratory Data Analysis, Qualitative Methods and Research Ethics 101 courses.
ImpactMojoData Literacy 101www.impactmojo.in
If you remember five things
  • Always ask where the data came from — and who is missing
  • Plot it before you trust any summary number
  • Median over mean for skewed things like money
  • Correlation is not causation — look for the confounder
  • Behind every row is a person — handle with care
ImpactMojoData Literacy 101www.impactmojo.in
Data Literacy 101 · Complete
Now go question
the next number.
CC BY-NC-ND 4.0·Free Forever·ImpactMojo 101 Series