Data Literacy 101

ImpactMojoData Literacy 101www.impactmojo.in

ImpactMojo 101 Series · Free Forever

Data
Literacy
101

Reading, Questioning & Using Data Responsibly — a Foundational Course for Development Practitioners in South Asia

Research-BackedSouth Asia Focus100 SlidesFree Access

ImpactMojoData Literacy 101www.impactmojo.in

Agenda

What We Cover

01

What Is Data Literacy?

Slides 3–9

02

Types & Sources of Data

Slides 10–19

03

From Concept to Indicator

Slides 20–28

04

Describing Data

Slides 29–38

05

Visualising Data

Slides 39–48

06

Relationships & Correlation

Slides 49–58

07

Sampling & Surveys

Slides 59–68

08

Data Quality & Cleaning

Slides 69–77

09

Reading Data Critically

Slides 78–86

10

Data Ethics, Privacy & Equity

Slides 87–94

11

Tools & Further Reading

Slides 95–99

ImpactMojoData Literacy 101www.impactmojo.in

01

Section One

What Is Data Literacy?

ImpactMojoData Literacy 101www.impactmojo.in

Definition

Data literacy is a survival skill

Development work runs on numbers — targets, indicators, budgets, surveys, dashboards. Data literacy is the ability to read, question, interpret and communicate data, and to use it to make better decisions. It is not statistics for its own sake; it is judgement.

Data literacy

The capacity to find, read, interpret, critically evaluate and communicate with data — and to recognise when data is being misused. It sits between raw numbers and good decisions.

You do not need to be a statistician. You need to ask the right questions of any number that lands on your desk.

ImpactMojoData Literacy 101www.impactmojo.in

Why It Matters

Decisions you already make with data

Programme decisions

Which blocks or wards to prioritise
Whether an intervention is working
How to set a realistic target
Where the budget actually goes

Daily judgement calls

Is this survey finding trustworthy?
Does this chart mislead?
Is the sample like our population?
Who is missing from this count?

Every one of these is a data-literacy question before it is a technical one.

ImpactMojoData Literacy 101www.impactmojo.in

The Ladder

Data → Information → Knowledge → Decision

01

DATA: raw records — 1,240 children weighed

→

02

INFORMATION: 18% are underweight

→

03

KNOWLEDGE: underweight is concentrated in 3 hamlets

→

04

DECISION: target supplementary feeding there

Data literacy is what moves you up the ladder without slipping — each step adds interpretation, and each step can introduce error.

ImpactMojoData Literacy 101www.impactmojo.in

Mindset

Five habits of a data-literate practitioner

Ask where it came from. Who collected it, when, how, and why?
Ask what it measures. Is the indicator really capturing the concept?
Ask who is missing. Whom does this number leave out?
Ask how sure we are. What is the uncertainty, the sample, the error?
Ask what decision it serves. Data without a question is noise.

ImpactMojoData Literacy 101www.impactmojo.in

Caution

Numbers feel objective. They are not neutral.

Not everything that counts can be counted, and not everything that can be counted counts.

— commonly attributed to William Bruce Cameron

Every dataset embeds choices: what to measure, what category to use, whom to ask, what to ignore. Those choices carry power. A data-literate practitioner reads the choices, not just the digits.

ImpactMojoData Literacy 101www.impactmojo.in

Roadmap

How this course is built

Foundations

Types and sources of data
Turning concepts into indicators
Describing and visualising data

Judgement

Correlation, sampling and surveys
Data quality and critical reading
Ethics, privacy and equity

Throughout, examples come from India and the wider region — the data you will actually meet at work.

ImpactMojoData Literacy 101www.impactmojo.in

02

Section Two

Types & Sources of Data

ImpactMojoData Literacy 101www.impactmojo.in

Two Families

Quantitative and qualitative

Quantitative

Numbers and counts — how many, how much, how often. Strong for measuring scale, comparing groups, tracking change.

Qualitative

Words, meanings, experiences — why, how, in what context. Strong for understanding process, mechanism and lived reality.

They are partners, not rivals. Numbers tell you that something changed; stories tell you why. The best evidence usually uses both.

ImpactMojoData Literacy 101www.impactmojo.in

Provenance

Primary vs secondary data

	Primary	Secondary
Source	You collect it	Someone else collected it
Example	Your baseline survey, FGDs	Census, NFHS, district HMIS
Control	High — you design it	Low — you take it as given
Cost / time	High	Usually low
Risk	Fieldwork error, bias	May not fit your question

Rule of thumb: exhaust good secondary data before collecting primary data. Much of what you need already exists — and re-collecting it wastes respondents' time.

ImpactMojoData Literacy 101www.impactmojo.in

Structure

Structured, semi-structured, unstructured

Structured

Neat rows & columns — survey tables, registers, spreadsheets

Semi

Some structure — forms with open text, tagged records, JSON

Unstructured

Free text, audio, images, video, field notes

Most development M&E lives in structured data, but a growing share — call-centre logs, photos, social media, satellite imagery — is unstructured and needs different tools.

ImpactMojoData Literacy 101www.impactmojo.in

Forms of Data

Cross-section, time series, panel

Cross-section: many units at one time (one NFHS round)
Time series: one unit over time (national TFR, 1990–2024)
Panel / longitudinal: same units tracked over time (a cohort re-surveyed every year)

Panel data is powerful — it can follow the same household as it changes, separating real change from differences between households.

ImpactMojoData Literacy 101www.impactmojo.in

India's Backbone

India's official data ecosystem

Source	What it covers	Frequency
Census of India	Every person — population, literacy, housing, migration	Decennial (2011 latest)
NFHS	Health, nutrition, fertility, anaemia, women's status	~5 years (NFHS-5: 2019–21)
NSS / PLFS	Consumption, employment, unemployment	PLFS annual since 2017–18
SRS	Birth & death rates, infant mortality, life expectancy	Annual
HMIS	Facility-level health service delivery	Monthly
SECC 2011	Socio-economic & caste deprivation indicators	One-off (2011)

Know these by name. Most of your secondary-data needs are met by one of them — free and downloadable.

ImpactMojoData Literacy 101www.impactmojo.in

Sample vs Census

NFHS, NSS and the Census do different jobs

Census = everyone

Counts every person. Best for small-area detail (a village, a ward). Expensive, so it is rare.

Surveys = a sample

NFHS & NSS interview a carefully chosen sample and infer the whole. Cheaper, frequent — but only reliable down to the level they were designed for (usually state or district).

Common error: using a state-level survey estimate to make claims about a single block. The sample was never designed to say anything that local.

ImpactMojoData Literacy 101www.impactmojo.in

Scale

How big are these datasets?

1.21 bn

people enumerated in Census 2011

Census of India 2011

~636,000

households interviewed in NFHS-5

NFHS-5, 2019–21

707

districts covered by NFHS-5

IIPS / MoHFW

These are among the largest demographic and health surveys in the world. Their size is what lets them speak reliably about districts — but not about your single panchayat.

ImpactMojoData Literacy 101www.impactmojo.in

Administrative Data

The data your programme already generates

Every scheme produces administrative data as a by-product of delivery: MGNREGA muster rolls, school enrolment (UDISE+), health records (HMIS), ration transactions, immunisation registers.

Strengths

Continuous, cheap, already collected
Universal coverage of beneficiaries
Real-time-ish monitoring

Watch-outs

Records who is served, not who is missed
Incentives to over- or under-report
Gaps, duplicates, stale entries

ImpactMojoData Literacy 101www.impactmojo.in

New Frontiers

Big data and digital traces

Mobile-phone records, satellite night-lights, transaction logs and remote sensing increasingly supplement official statistics — useful where surveys are slow or coverage is thin.

But digital traces over-represent the connected and under-represent the poor, women, the elderly and remote areas. Big data can deepen exclusion if read uncritically.

ImpactMojoData Literacy 101www.impactmojo.in

03

Section Three

From Concept to Indicator

ImpactMojoData Literacy 101www.impactmojo.in

The Core Problem

You cannot measure 'wellbeing' directly

Most things we care about — poverty, empowerment, health, learning — are concepts, not numbers. Measurement is the bridge from an abstract concept to an observable indicator.

01

CONCEPT: women's empowerment

→

02

DIMENSIONS: mobility, decision-making, assets

→

03

INDICATORS: % who can visit a health centre alone

→

04

DATA: survey responses

ImpactMojoData Literacy 101www.impactmojo.in

Definition

Indicators, defined

Indicator

An observable, measurable marker that stands in for something we cannot observe directly. A good indicator is a faithful proxy for the concept — no more, no less.

Operationalisation

The precise rule that turns a concept into a measurement: exactly what to count, for whom, over what period, in what units.

ImpactMojoData Literacy 101www.impactmojo.in

Levels of Measurement

Four kinds of variable

Level	Meaning	Example	Valid maths
Nominal	Labels, no order	District, caste, religion	Counts, mode
Ordinal	Ordered, unequal gaps	Wealth quintile, Likert scale	Median, rank
Interval	Equal gaps, no true zero	Temperature (°C), calendar year	Mean, difference
Ratio	Equal gaps, true zero	Income, age, children ever born	All, ratios

Why it matters: you cannot take a meaningful average of caste categories, and a wealth quintile is a rank, not a rupee amount. The level decides which statistics are legal.

ImpactMojoData Literacy 101www.impactmojo.in

Good Indicators

What makes an indicator trustworthy?

Valid

Measures what it claims to measure

Reliable

Gives the same answer on repeat measurement

Sensitive

Moves when the real thing moves

Feasible

Can actually be collected, affordably

ImpactMojoData Literacy 101www.impactmojo.in

Validity vs Reliability

Accurate is not the same as consistent

Reliable, not valid

A miscalibrated weighing scale: it reads 2 kg high every time. Perfectly consistent — consistently wrong.

Valid and reliable

A calibrated scale: same answer each time, and the right answer. This is the target.

You can have reliability without validity, but never validity without reliability. Check both.

ImpactMojoData Literacy 101www.impactmojo.in

Proxies

When you measure A to learn about B

A proxy stands in for something hard to measure. Household assets proxy for wealth; night-light intensity proxies for economic activity; mid-upper-arm circumference proxies for acute malnutrition.

Every proxy leaks. Asset indices miss debt; night-lights miss the informal economy. Name the gap between your proxy and the concept — and report it.

ImpactMojoData Literacy 101www.impactmojo.in

Composite Indices

Bundling many indicators into one number

Indices like the Human Development Index or the Multidimensional Poverty Index (MPI) combine several indicators into a single score for easy comparison.

Upside

One memorable number; ranks and headlines; captures several dimensions at once.

Downside

Weights are value judgements; aggregation hides trade-offs; a good score can mask a terrible component.

ImpactMojoData Literacy 101www.impactmojo.in

Worked Example

India's National MPI

NITI Aayog's National MPI bundles 12 indicators across health, education and standard of living — nutrition, child mortality, schooling, cooking fuel, sanitation, housing, assets and more.

12

indicators in 3 dimensions

NITI Aayog National MPI

Headcount × Intensity

MPI = share who are poor × how deeply poor they are

Notice the design choice: a household is 'MPI poor' if deprived in a weighted third or more of indicators. Change that threshold and the poverty rate changes.

ImpactMojoData Literacy 101www.impactmojo.in

04

Section Four

Describing Data

ImpactMojoData Literacy 101www.impactmojo.in

First Question

Where is the centre? How spread out?

Before any fancy analysis, describe the data. Two questions answer most of it: what is typical (central tendency) and how much do values vary (dispersion).

01

CENTRE: mean, median, mode

→

02

SPREAD: range, IQR, standard deviation

→

03

SHAPE: skew, peaks, outliers

ImpactMojoData Literacy 101www.impactmojo.in

Central Tendency

Mean, median and mode

Measure	What it is	Best when
Mean	Arithmetic average	Roughly symmetric data, no wild outliers
Median	Middle value when sorted	Skewed data — income, land, wealth
Mode	Most frequent value	Categories — commonest crop, caste, response

For money — income, consumption, landholding — prefer the median. A few crorepatis drag the mean far above what a typical household actually has.

ImpactMojoData Literacy 101www.impactmojo.in

Why It Matters

Mean vs median: the same village, two stories

Monthly income, 11 households (₹000s)

Illustrative example

Median = ₹12k (typical household). Mean = ₹31k, pulled up by one rich household. Report the mean here and you describe a village that does not exist.

ImpactMojoData Literacy 101www.impactmojo.in

Dispersion

Spread: range, IQR and standard deviation

Range: max − min. Simple, but one outlier wrecks it.
IQR (interquartile range): the middle 50% — robust to outliers.
Standard deviation: typical distance from the mean. The everyday measure of variability.

Two districts can share the same average income yet feel completely different — one equal, one polarised. The mean hides that; the spread reveals it.

ImpactMojoData Literacy 101www.impactmojo.in

Percentiles

Percentiles, quartiles and quintiles

A percentile is the value below which a given share of cases fall. The 25th percentile (Q1) has a quarter of households below it. Wealth quintiles — five 20% bands — are how NFHS and NSS routinely report inequality.

Q1–Q5

Quintiles: five equal-size groups

p50

The 50th percentile is the median

p90/p10

A common inequality ratio

ImpactMojoData Literacy 101www.impactmojo.in

Distributions

The shape of the data matters

Symmetric vs right-skewed distributions

Illustrative

Income, landholding and firm size are almost always right-skewed — a long tail of large values. That is exactly when the mean misleads.

ImpactMojoData Literacy 101www.impactmojo.in

The Normal Curve

The bell curve and the 68–95–99.7 rule

Many natural measurements (height, birth weight, measurement error) follow a roughly normal distribution — symmetric, bell-shaped, defined by its mean and standard deviation (SD).

68%

of values fall within 1 SD of the mean

95%

fall within 2 SD

99.7%

fall within 3 SD

But do not assume normality. Development data — income, expenditure, programme size — is usually skewed, so the rule does not apply. Always look at the shape first.

ImpactMojoData Literacy 101www.impactmojo.in

Outliers

Outliers: error, or the most important case?

Could be an error

A 9-foot-tall respondent, a household of 80, an income of ₹0 — check for data-entry slips before analysing.

Could be the story

The one block with triple the dropout rate may be precisely where the programme is needed. Do not delete it — investigate it.

Never silently drop outliers. Flag them, explain them, and decide transparently.

ImpactMojoData Literacy 101www.impactmojo.in

Rates vs Counts

Always ask: out of how many?

A raw count without its denominator is almost meaningless. '500 dropouts' could be a crisis or a triumph depending on whether the base is 600 or 60,000.

Count

How many? (numerator alone)

Rate

How many out of how many? (numerator ÷ denominator)

The denominator is where most honest comparison lives — per 1,000 people, per eligible child, per year. Demand it.

ImpactMojoData Literacy 101www.impactmojo.in

05

Section Five

Visualising Data

ImpactMojoData Literacy 101www.impactmojo.in

Why Visualise

A good chart is an argument you can see

Visualisation is not decoration. A well-made chart reveals patterns the table hides — trends, gaps, outliers, relationships — and lets a busy decision-maker grasp them in seconds.

The greatest value of a picture is when it forces us to notice what we never expected to see.

— John Tukey, pioneer of exploratory data analysis

ImpactMojoData Literacy 101www.impactmojo.in

Match Chart to Job

Pick the chart for the question

You want to show…	Use	Avoid
Change over time	Line chart	Pie chart
Comparison across categories	Bar chart	3-D anything
Composition / shares of a whole	Stacked bar (or 1 pie, few slices)	Many pies
Relationship between two variables	Scatter plot	Dual-axis tricks
Distribution of one variable	Histogram / box plot	Single average
Geographic pattern	Choropleth map	Map coloured by raw counts

ImpactMojoData Literacy 101www.impactmojo.in

Anatomy

What every honest chart needs

A clear title that states the takeaway, not just the topic
Labelled axes with units
An honest baseline — usually zero for bar charts
A source and a date
A note on the denominator and any exclusions

ImpactMojoData Literacy 101www.impactmojo.in

The Big Lie

The truncated axis

Y-axis starts at 90 (misleading)

Illustrative

Y-axis starts at 0 (honest)

Illustrative

Same data. The left chart makes a 4-point rise look like a tripling. Truncating the axis is the most common way charts lie.

ImpactMojoData Literacy 101www.impactmojo.in

Chartjunk

More ink is not more information

Edward Tufte's principle: maximise the data-ink ratio. Every gradient, shadow, 3-D effect and clip-art icon competes with the data for attention — and usually wins.

No 3-D bars — they distort the very lengths you are comparing
No pie charts with 8 slices — the eye cannot rank angles
No rainbow palettes — colour should carry meaning, not noise

ImpactMojoData Literacy 101www.impactmojo.in

Colour

Use colour with intent — and for everyone

Good colour

Sequential for ordered data (light→dark)
One accent colour to highlight the point
Consistent meaning across charts

Accessibility

~8% of men have colour-vision deficiency
Never rely on red-vs-green alone
Add labels, patterns or direct text

ImpactMojoData Literacy 101www.impactmojo.in

Small Multiples

Repeat a small chart to compare many

Instead of cramming ten states into one tangled line chart, draw ten tiny identical charts side by side — small multiples. The eye compares shapes effortlessly when scale and layout are shared.

Rule: when a single chart gets crowded, split it into a grid of small, identical ones rather than adding more colours.

ImpactMojoData Literacy 101www.impactmojo.in

Tables Too

Sometimes a table beats a chart

Use a table for exact values people will look up or quote
Use a chart for pattern, trend and comparison
Right-align numbers, fix decimal places, and add row/column totals
A heat-shaded table can do both — precise and patterned

ImpactMojoData Literacy 101www.impactmojo.in

Checklist

Before you publish a chart, ask…

Does the title state the finding?
Is the baseline honest (zero where it should be)?
Are axes, units and the denominator labelled?
Could a colour-blind reader read it?
Is the source and date on the chart?
Have I removed everything that is not data?

ImpactMojoData Literacy 101www.impactmojo.in

06

Section Six

Relationships & Correlation

ImpactMojoData Literacy 101www.impactmojo.in

The Idea

Do two things move together?

Correlation

A measure of how strongly two variables move together. Positive: both rise together. Negative: one rises as the other falls. Zero: no linear relationship.

The correlation coefficient r runs from −1 (perfect negative) through 0 (none) to +1 (perfect positive).

ImpactMojoData Literacy 101www.impactmojo.in

See It

Female literacy and fertility across Indian states

Female literacy (%) vs total fertility rate, major states

Illustrative, patterned on Census 2011 & NFHS-5

A clear negative correlation: states with higher female literacy tend to have lower fertility. But does literacy cause lower fertility? Hold that thought.

ImpactMojoData Literacy 101www.impactmojo.in

The Golden Rule

Correlation is not causation

Two variables can move together for several reasons, only one of which is 'A causes B'.

Reverse causation: B might cause A
Confounding: a third factor C drives both
Selection: the sample was chosen in a way that creates the link
Chance: with enough variables, some correlate by luck

ImpactMojoData Literacy 101www.impactmojo.in

Confounding

The lurking third variable

Ice-cream sales correlate with drowning deaths. Ice cream does not cause drowning — summer heat drives both. The confounder is the real story.

01

Hot weather (confounder C)

→

02

drives ice-cream sales (A)

→

03

AND drives swimming & drowning (B)

→

04

so A and B correlate — with no causal link

ImpactMojoData Literacy 101www.impactmojo.in

Spurious

Patterns appear in pure noise

Test enough pairs of unrelated variables and some will correlate strongly by sheer chance. A tight correlation is a clue, never a proof.

Before believing a correlation, ask: is there a plausible mechanism? Could a confounder explain it? Does it survive in other data?

ImpactMojoData Literacy 101www.impactmojo.in

Anscombe

Never trust the number without the picture

Anscombe's quartet is four datasets with identical means, variances and correlation (r = 0.82) — yet utterly different shapes: one linear, one curved, one a single outlier driving everything.

The lesson, proven in 1973 and true today: always plot your data. Summary statistics alone can hide the truth.

ImpactMojoData Literacy 101www.impactmojo.in

Ecological Fallacy

Group patterns ≠ individual truths

Ecological fallacy

Wrongly inferring something about individuals from a pattern seen only at the group level.

A district with higher average income may have higher literacy on average — that does not mean the richer individuals within it are the literate ones. What holds for districts need not hold for people.

ImpactMojoData Literacy 101www.impactmojo.in

Simpson's Paradox

A trend can reverse when you split the data

Simpson's paradox: a relationship visible in the whole dataset can flip when you break it into subgroups. A scheme can look worse overall yet be better in every region — if the regions differ in size and baseline.

Always disaggregate before concluding. The aggregate can point the opposite way to the truth.

ImpactMojoData Literacy 101www.impactmojo.in

Regression to the Mean

Extreme results drift back to average

When you pick the worst-performing districts and re-measure them later, they usually look better — even with no intervention. Extreme values contain extra luck that does not repeat. This is regression to the mean.

It is a notorious trap in evaluation: target the bottom 10% of schools, see them improve, and credit your programme — when much of the gain would have happened anyway. A comparison group is the cure.

ImpactMojoData Literacy 101www.impactmojo.in

07

Section Seven

Sampling & Surveys

ImpactMojoData Literacy 101www.impactmojo.in

Why Sample

You rarely need to ask everyone

A well-chosen sample of a few thousand can describe a population of millions — the principle behind NFHS, NSS and every opinion poll. The magic is not size; it is representativeness.

Representative sample

A sample whose composition mirrors the population on the characteristics that matter, so findings can be generalised back to the whole.

ImpactMojoData Literacy 101www.impactmojo.in

Population vs Sample

Define the population first

01

TARGET POPULATION: whom you want to learn about

→

02

SAMPLING FRAME: the list you can actually draw from

→

03

SAMPLE: who you end up measuring

→

04

RESPONDENTS: who actually answers

Each gap — frame missing people, non-response, refusals — is a place bias creeps in. The frame is often the weakest link: a list of phone owners is not a list of citizens.

ImpactMojoData Literacy 101www.impactmojo.in

Probability Sampling

Let chance choose — it removes bias

Method	How	Use when
Simple random	Every unit equal chance	You have a full list
Systematic	Every k-th unit from a list	Ordered list, no hidden cycle
Stratified	Split into groups, sample each	You must represent subgroups
Cluster	Sample whole groups (villages)	People are geographically spread
Multistage	Clusters, then units within	Large national surveys (NFHS)

Only probability sampling lets you calculate a margin of error and generalise honestly.

ImpactMojoData Literacy 101www.impactmojo.in

Non-Probability

Convenient, but you cannot generalise

Convenience: whoever is easy to reach — the people at the camp
Purposive: hand-picked for a reason — key informants
Snowball: respondents refer others — hidden populations
Quota: fill fixed counts per group, but non-randomly

These are legitimate for qualitative depth and hard-to-reach groups — but you cannot attach a margin of error or claim population-level numbers from them.

ImpactMojoData Literacy 101www.impactmojo.in

Sample Size

How many do I actually need?

Margin of error vs sample size (95% confidence, p=0.5)

Standard sampling theory

Note the curve flattens: ~1,067 gives ±3%, but halving the error to ±1.5% needs ~4,000. Precision gets expensive fast.

ImpactMojoData Literacy 101www.impactmojo.in

Key Insight

It's the sample size, not the fraction

A counter-intuitive truth: for a large population, accuracy depends on the absolute sample size, not the share of the population sampled. 1,500 people describe a state and a country about equally well.

This is why a national survey of ~600,000 households can speak about all of India — and why your block of 2,000 households still needs a few hundred interviews, not twenty.

ImpactMojoData Literacy 101www.impactmojo.in

Bias

The errors that size cannot fix

Selection bias: the frame or method systematically misses people
Non-response bias: those who refuse differ from those who answer
Survivorship bias: you only see who remained (drop-outs vanish)
Social-desirability bias: people answer how they think they should

A bigger biased sample is just a more confident wrong answer. Size fixes noise, never bias.

ImpactMojoData Literacy 101www.impactmojo.in

Weights

Why survey results come 'weighted'

When some groups are deliberately over-sampled (to study them reliably) or respond less, surveys apply weights so each respondent represents the right number of real people.

Practical warning: using NFHS or PLFS unit data without the survey weights gives wrong totals. Always weight when the documentation says to.

ImpactMojoData Literacy 101www.impactmojo.in

Good Questions

A survey is only as good as its questions

Avoid leading questions ('Don't you agree that…?')
Avoid double-barrelled ones ('clean and safe?' — which one?)
Use language and units respondents actually use locally
Pilot every instrument before the real round — always

ImpactMojoData Literacy 101www.impactmojo.in

08

Section Eight

Data Quality & Cleaning

ImpactMojoData Literacy 101www.impactmojo.in

Reality Check

Most data work is cleaning

Analysts often spend the majority of a project just preparing data — finding errors, reconciling formats, handling gaps. Glamorous analysis sits on a large, unglamorous foundation of cleaning.

Garbage in, garbage out. No model rescues bad data.

— computing proverb, truer than ever

ImpactMojoData Literacy 101www.impactmojo.in

Dirty Data

What 'dirty' data looks like

Problem	Example	Risk
Missing values	Blank income field	Biased averages if not random
Duplicates	Same beneficiary twice	Inflated counts
Inconsistent codes	'F' / 'Female' / '2'	Broken grouping
Outliers / impossible	Age = 200, −5 children	Distorted statistics
Format drift	DD/MM vs MM/DD dates	Silent miscalculation
Typos in keys	Misspelt village name	Failed merges

ImpactMojoData Literacy 101www.impactmojo.in

Missing Data

Why values are missing matters more than how many

Missing at random: gaps unrelated to the value — least harmful
Missing not at random: the richest refuse to state income — this biases results
Dropping rows with gaps can quietly delete the very people you care about

Before deleting or filling missing values, ask why they are missing. The pattern of absence is itself data.

ImpactMojoData Literacy 101www.impactmojo.in

The Pipeline

A repeatable cleaning workflow

01

INSPECT: look at every column's range & uniques

→

02

VALIDATE: rules (age 0–120, % in 0–100)

→

03

FIX: standardise codes, dates, units

→

04

DOCUMENT: log every change

→

05

FREEZE: keep raw data untouched

Golden rule: never edit the raw file. Clean in a script or a copy so every change is reversible and visible.

ImpactMojoData Literacy 101www.impactmojo.in

Reproducibility

If you can't redo it, you can't trust it

Fragile

Manual edits in a spreadsheet, no record of what changed. Next month, nobody can reproduce the number — including you.

Robust

A documented script from raw to result. Re-run it any time, audit every step, hand it to a colleague.

ImpactMojoData Literacy 101www.impactmojo.in

Validation

Build checks in, don't hope

Range checks: can this value exist at all?
Logic checks: a 6-year-old cannot be married with children
Cross-checks: do parts sum to the reported total?
Sense checks: does the headline number pass the smell test?

ImpactMojoData Literacy 101www.impactmojo.in

Metadata

Document so future-you can understand

Metadata is data about your data: what each variable means, its units, allowed values, how and when it was collected, and what you changed.

A dataset without a data dictionary is a puzzle with no key. The six months it takes to forget your own coding is shorter than you think.

ImpactMojoData Literacy 101www.impactmojo.in

Versioning

Keep the trail

Keep the raw extract read-only and dated
Name files with versions and dates, not 'final_FINAL_v3'
Record the source, download date and any filters applied
Save the cleaning script alongside the data

ImpactMojoData Literacy 101www.impactmojo.in

09

Section Nine

Reading Data Critically

ImpactMojoData Literacy 101www.impactmojo.in

Uncertainty

Every estimate has a range

A survey figure of 42% is shorthand for 'about 42%, give or take'. The confidence interval — say 39–45% — is the honest version. A point estimate without its range overstates certainty.

If two groups' confidence intervals overlap heavily, a difference between them may be noise, not signal. Look for the range, not just the dot.

ImpactMojoData Literacy 101www.impactmojo.in

Significance

'Significant' has a narrow technical meaning

Statistical significance

A result unlikely to have arisen by chance alone if there were truly no effect. It says nothing about whether the effect is large or important.

Statistically significant ≠ practically important. With a huge sample, a trivially small difference can be 'significant'. Always ask: how big is the effect, and does it matter?

ImpactMojoData Literacy 101www.impactmojo.in

Base Rates

The base-rate trap

Even a 99%-accurate test for a rare condition produces mostly false positives — because the healthy vastly outnumber the sick. Ignoring the underlying rate is one of the commonest reasoning errors.

Whenever you read 'X% accurate', ask how common the thing is to begin with. The base rate changes everything.

ImpactMojoData Literacy 101www.impactmojo.in

Percentages

Percent vs percentage points

If coverage rises from 40% to 50%, that is a 10 percentage-point increase — but a 25 percent increase. Mixing the two is a classic way to exaggerate or hide change.

+10 pp

percentage-point change (50 − 40)

+25%

relative change (10 ÷ 40)

ImpactMojoData Literacy 101www.impactmojo.in

Relative vs Absolute

'Doubled' can hide tiny numbers

'Cases doubled!' sounds alarming — but a rise from 2 to 4 is a doubling of almost nothing. Relative change without the absolute base is designed to impress, not inform.

Always pair the relative figure with the raw counts. '100% increase, from 2 to 4 cases' tells the honest story.

ImpactMojoData Literacy 101www.impactmojo.in

Cherry-Picking

Beware the chosen baseline

Start the time axis at a low year to exaggerate growth
Quote the one indicator that improved, ignore the rest
Compare to an unusual reference period (a drought, a peak)
Report only the subgroup that helps the argument

Ask: why this start date, this indicator, this comparison? What is left out?

ImpactMojoData Literacy 101www.impactmojo.in

Garden of Forking Paths

Test enough things and something 'works'

If you slice the data twenty ways, roughly one slice will show a 'significant' result by chance. Reporting only that slice — p-hacking — manufactures false findings.

Trust analyses that were specified before seeing the data, and findings that replicate. Be wary of a single surprising subgroup result.

ImpactMojoData Literacy 101www.impactmojo.in

A Reader's Checklist

Eight questions for any statistic

Who produced it, and what is their interest?
How was it measured — and what is the denominator?
Is it a sample? How big, how chosen?
What is the uncertainty / margin of error?
Percent or percentage points? Relative or absolute?
Who is missing from the count?
Correlation, or genuine causation?
Does it pass the common-sense smell test?

ImpactMojoData Literacy 101www.impactmojo.in

10

Section Ten

Data Ethics, Privacy & Equity

ImpactMojoData Literacy 101www.impactmojo.in

Behind Every Row

Every data point is a person

In development data, rows are people — often poor, often without power over how their information is used. Data ethics is not paperwork; it is respect made operational.

Data are not just numbers; they are people reduced to numbers. The reduction is never neutral.

— a principle of feminist data practice

ImpactMojoData Literacy 101www.impactmojo.in

Consent

Informed consent is the floor

People should know what is collected and why
How it will be used, stored and shared — and for how long
That they can refuse or stop, with no penalty
Consent must be in a language and form they genuinely understand

A thumbprint on a form nobody explained is not consent. For children and other vulnerable groups, extra safeguards apply.

ImpactMojoData Literacy 101www.impactmojo.in

Privacy

Anonymisation is harder than deleting names

Removing names is not enough. A combination of village + age + caste + occupation can re-identify one person — especially in small areas where few people share those traits.

Direct IDs

Name, Aadhaar, phone — remove

Quasi-IDs

Age + place + caste can re-identify — aggregate or coarsen

ImpactMojoData Literacy 101www.impactmojo.in

The Law

India's Digital Personal Data Protection Act, 2023

The DPDP Act, 2023 is India's first comprehensive data-protection law. It sets duties for anyone handling personal digital data — including NGOs and researchers.

Collect only what you need, for a stated purpose (purpose limitation)
Obtain free, informed, specific consent
Protect data with reasonable security safeguards
Stronger protections for children's data

Know your obligations before you collect. 'We're a small NGO' is not an exemption.

ImpactMojoData Literacy 101www.impactmojo.in

Equity in Counting

What you don't disaggregate, you can't see

An average hides the people behind it. A programme can report good overall numbers while failing Dalit, Adivasi, disabled, or women beneficiaries. Disaggregation is how inequity becomes visible.

An overall average can mask large gaps between groups

Illustrative

ImpactMojoData Literacy 101www.impactmojo.in

The Missing

Count the people the data forgets

Homeless and pavement-dwellers absent from household frames
Migrants counted nowhere — neither origin nor destination
Informal workers invisible to formal employment statistics
Trans and non-binary people erased by binary-only forms

'Missing data' is rarely random. The uncounted are usually the most marginalised — and policy built on the count leaves them out twice.

ImpactMojoData Literacy 101www.impactmojo.in

Data Power

Data colonialism and who benefits

Communities are often extracted from — surveyed repeatedly, with the knowledge and value flowing to outside institutions while respondents see nothing back.

Give findings back to the community in usable form
Involve people in defining what gets measured
Ask: who owns this data, and who profits from it?

ImpactMojoData Literacy 101www.impactmojo.in

11

Section Eleven

Tools & Further Reading

ImpactMojoData Literacy 101www.impactmojo.in

Getting Hands-On

Tools to grow into

Tool	Good for	Note
Spreadsheets (Excel, Google Sheets)	Most everyday analysis	Start here; learn pivot tables
R	Statistics, reproducible analysis, graphics	Free, powerful, steeper curve
Python (pandas)	Cleaning, large data, automation	Free, general-purpose
KoboToolbox / ODK	Mobile survey data collection	Free, offline-capable
QGIS	Maps and spatial data	Free, open-source GIS
Power BI / Looker Studio	Dashboards	Quick visual reporting

Tools matter less than habits. A clear spreadsheet beats a confused script. Master the thinking first.

ImpactMojoData Literacy 101www.impactmojo.in

Where to Get Data

Open data you can use today

data.gov.in — India's open government data portal
censusindia.gov.in — Census tables & maps
NFHS / DHS Program — health & demographic data
MoSPI — NSS, PLFS, national accounts
World Bank Open Data & Our World in Data — global comparisons

ImpactMojoData Literacy 101www.impactmojo.in

Keep Learning

A short, honest reading list

How to Lie with Statistics — Darrell Huff (still the classic primer)
The Visual Display of Quantitative Information — Edward Tufte
Data Feminism — D'Ignazio & Klein (power and data)
Factfulness — Hans Rosling (reading global data well)
Poor Economics — Banerjee & Duflo (evidence in development)

Pair this deck with ImpactMojo's Exploratory Data Analysis, Qualitative Methods and Research Ethics 101 courses.

ImpactMojoData Literacy 101www.impactmojo.in

The Takeaways

If you remember five things

Always ask where the data came from — and who is missing
Plot it before you trust any summary number
Median over mean for skewed things like money
Correlation is not causation — look for the confounder
Behind every row is a person — handle with care

ImpactMojoData Literacy 101www.impactmojo.in

Data Literacy 101 · Complete

Now go question
the next number.

You don't need to be a statistician — you need to ask the right questions of every chart, survey and dashboard that crosses your desk. Explore the rest of the ImpactMojo 101 Series, free forever.

More 101 Courses Explore ImpactMojo Dataverse

CC BY-NC-ND 4.0·Free Forever·ImpactMojo 101 Series