Exploratory Data Analysis 101

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

ImpactMojo 101 Series · Free Forever

Exploratory
Data
Analysis 101

Inspecting, Cleaning & Questioning Household Survey Data — a Foundational Course for Development Practitioners in South Asia

Household SurveysSouth Asia Focus~90 SlidesFree Access

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Agenda

What We Cover

01

What EDA Is & Why

Slides 3–10

02

Household Survey Data in India

Slides 11–21

03

The EDA Workflow

Slides 22–29

04

Knowing Your Variables

Slides 30–38

05

Univariate Exploration

Slides 39–48

06

Spotting Trouble

Slides 49–58

07

Bivariate & Multivariate

Slides 59–68

08

Survey Weights & Design

Slides 69–77

09

Visual EDA Done Well

Slides 78–85

10

From EDA to Questions

Slides 86–93

11

Tools & Reproducibility

Slides 94–99

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

01

Section One

What EDA Is & Why

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Definition

Look before you leap

Before you fit a model, build an index or write a finding, you must look at the data. Exploratory Data Analysis (EDA) is the open-minded first pass — describing, plotting and questioning a dataset to understand its shape, gaps and surprises.

Exploratory Data Analysis (EDA)

An attitude and a toolkit for examining data with few prior assumptions — using summaries and pictures to reveal structure, spot problems and generate questions, before any formal testing or modelling.

EDA is detective work, not decoration. You are interrogating the data to find out what it can — and cannot — honestly say.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

The Origin

Tukey and the case for exploring

Exploratory data analysis is an attitude, a flexibility, and a reliance on display, not a bundle of techniques.

— John W. Tukey, who named EDA in 1977

Tukey argued that statistics had become obsessed with confirming hypotheses and neglected the prior, humbler task of finding them. EDA restored looking, sketching and questioning to the centre of data work.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Confirm vs Explore

Two modes of data work

Exploratory

Open-ended. What is going on here? Generates questions and hypotheses. Forgiving of surprises. Where every project should start.

Confirmatory

Pre-specified. Is this specific claim true? Tests a hypothesis fixed in advance. Where EDA leads, but is not the same step.

Danger: do not let exploration masquerade as confirmation. A pattern you found by digging is a hypothesis to test on fresh data, not a proven result.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Let It Speak

Let the data speak first

Practitioners often arrive with a story they want the data to confirm. EDA asks you to hold that story loosely and let the numbers talk back — to notice the variable that is half missing, the district that behaves oddly, the impossible age.

The greatest value of a picture is when it forces us to notice what we never expected to see.

— John W. Tukey

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Understand First

Understand before you model

01

EXPLORE: what is in this dataset? what is wrong with it?

→

02

DESCRIBE: distributions, gaps, relationships

→

03

QUESTION: what hypotheses does this suggest?

→

04

ONLY THEN: model, test, conclude

Skipping EDA is how analysts end up modelling a coding error, averaging a top-coded variable, or reporting on a subgroup that is mostly missing data.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Why It Pays

What good EDA catches early

Errors — impossible ages, negative incomes, duplicated households
Gaps — a variable missing for 30% of women, not at random
Shape — consumption is skewed, so the mean misleads
Structure — the survey is clustered and weighted, not a simple random sample
Surprises — the outlier district that is the real story

Every hour of EDA saves a day of rework — and prevents a wrong number reaching a decision.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Roadmap

How this course is built

Foundations

Indian household survey data
The EDA workflow
Variables, codebooks & measurement

Practice

Univariate, bivariate & multivariate views
Missingness, outliers, weights
Honest visuals & reproducible tools

Throughout, the data you meet is patterned on India's NSS, PLFS and NFHS — the surveys you will actually open at work.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

02

Section Two

Household Survey Data in India

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

The Unit

What is a household survey?

Household survey

A survey that samples dwellings (households), then collects information about the household as a whole and about each member — the workhorse design behind NSS, PLFS, NFHS and most development data.

Because so much of welfare — consumption, sanitation, who eats, who decides — happens at the household level, the household is the natural sampling unit. But many questions are about people within it.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Structure

Two rosters: households and members

Household roster

One row per household: assets, dwelling type, water source, ration card, total members. Keyed by a household ID.

Member roster

One row per person: age, sex, relation to head, education, work, health. Keyed by household ID + person line number.

EDA must respect this hierarchy. Household-level and person-level variables live in different files and must be merged on the household ID before you can analyse them together.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Long & Wide

Hierarchical data is nested

HH ID	Person line	Age	Sex	Education
10231	1	44	Male	Secondary
10231	2	40	Female	Primary
10231	3	16	Female	Secondary
10231	4	11	Male	Primary
10232	1	67	Female	None

Four rows, one household (10231). To count households you need unique HH IDs; to count people you count rows. Confusing the two is a classic early error.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

The Big Three

India's major household surveys

Survey	What it covers	Who runs it	Frequency
NSS (consumption, etc.)	Consumption expenditure, employment, social consumption	NSSO / MoSPI	Rounds (subject rotates)
PLFS	Labour force: work, unemployment, wages	NSSO / MoSPI	Annual since 2017–18
NFHS	Health, nutrition, fertility, anaemia, women's status	IIPS / MoHFW	~5 yrs (NFHS-5: 2019–21)
CMIE-CPHS	Household income, consumption, sentiment	CMIE (private)	Continuous, 3 waves/yr

Know these by name and by job. The right survey for a question depends on what it measures and how recently.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

NSS & PLFS

NSS and PLFS: the official workhorses

The National Sample Survey (NSS) has measured consumption and employment for decades. The Periodic Labour Force Survey (PLFS) took over labour statistics in 2017–18, giving annual rural and quarterly urban estimates.

NSS

Consumption & expenditure rounds underpin official poverty estimates

NSSO / MoSPI

Annual

PLFS gives the labour force participation and unemployment rates since 2017–18

PLFS, MoSPI

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

NFHS

NFHS: health, nutrition and gender

The National Family Health Survey is India's Demographic and Health Survey. NFHS-5 (2019–21) covered the health, nutrition and demographic situation of women, children and households across every district.

~636,000

households interviewed in NFHS-5

NFHS-5, 2019–21

707

districts covered

IIPS / MoHFW

5

rounds since 1992–93

IIPS

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

CMIE-CPHS

CMIE-CPHS: high-frequency private data

The Consumer Pyramids Household Survey (CPHS), run by the private firm CMIE, tracks a large panel of households continuously, giving fast readings on income, spending and unemployment between official rounds.

Useful for timeliness, but debated on sampling and representativeness. Treat it as complementary to — not a replacement for — the official surveys, and read the methodology critics.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Unit of Analysis

Decide your unit before you compute

The unit of analysis is the entity each row of your working table represents — a household, a person, a child under five, a woman aged 15–49. Every statistic is implicitly 'per' some unit.

A rate like the anaemia prevalence is per eligible person, not per household. Computing it on the household file, or without restricting to the eligible group, gives the wrong answer.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Design

These are not simple random samples

01

STRATIFY: split by state × rural/urban

→

02

STAGE 1: sample villages / urban blocks (clusters)

→

03

STAGE 2: sample households within each cluster

→

04

RESULT: a stratified, multistage, clustered sample

Because selection happens in stages and some groups are over-sampled, every household carries a survey weight. We return to weights in Section Eight — they change your EDA.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

First Look

What to check the moment a file opens

How many rows, and what is one row — a household or a person?
Is there a household ID, a person line number, a weight variable?
Which file is the household roster, which the member roster?
What does the documentation say the unit and reference period are?

Read the survey's report and documentation before the data. Five minutes there saves hours of confusion later.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

03

Section Three

The EDA Workflow

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

The Loop

Six steps, repeated

01

LOAD: read the data correctly

→

02

INSPECT: structure, types, size

→

03

CLEAN: codes, gaps, impossible values

→

04

DESCRIBE: summaries per variable

→

05

VISUALISE: plot distributions & relations

→

06

QUESTION: form hypotheses, then loop back

EDA is iterative, not linear. Each plot raises a question that sends you back to inspect or clean. Expect to go round this loop many times.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Load

Step 1 — load it right

Read the right file format — fixed-width, CSV, .dta, .sav, .csv.gz
Make sure ID and code columns load as text, not numbers (leading zeros vanish otherwise)
Preserve special missing codes; do not let them become real values
Check the row count against the documentation

A household ID like 0457 silently becoming 457 will break every merge. Type matters from the very first line.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Inspect

Step 2 — inspect the skeleton

Question	What you are checking
How many rows & columns?	Size and whether the file is complete
What type is each column?	Numeric, text, date, categorical codes
What is the range of each?	Min, max, plausibility
How many distinct values?	Categorical levels, accidental duplicates
How many missing per column?	Where the gaps are

This is the data-equivalent of a doctor's first examination — vital signs before any diagnosis.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Clean

Step 3 — clean transparently

Cleaning means standardising codes, recoding missing values, fixing types and flagging impossible entries — in a script, never by hand-editing the raw file.

Golden rule: keep the raw extract read-only. Every change lives in code, so it is documented, reversible and reproducible. We expand on this in Section Eleven.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Describe

Step 4 — describe every variable

For continuous variables: min, max, mean, median, quartiles, SD, % missing
For categorical variables: a frequency table of every level
Flag anything that looks impossible or surprising for a second look
Always describe before you visualise — numbers anchor the eye

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Visualise

Step 5 — plot to see the shape

Summaries can hide as much as they reveal. A histogram shows skew and bimodality a mean cannot; a scatter shows a curved relationship a correlation cannot. Always plot.

Numerical calculations are exact, but graphs are rough.

— John W. Tukey — and we need both

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Question

Step 6 — turn findings into questions

Good EDA ends not with answers but with sharper questions: why is consumption bimodal here? why is this variable missing more for women? is the district outlier real or an error?

Write the questions down. They become your analysis plan — and the line between what EDA suggested and what a later test confirmed.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

04

Section Four

Knowing Your Variables

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

The Key

The data dictionary is your map

Data dictionary / codebook

The document that defines every variable: its name, meaning, units, allowed values, the codes for each category, the special missing codes, and who it applies to.

A survey dataset without its codebook is a locked box. The number '2' could mean female, urban, 'no', or a missing code — only the dictionary tells you which.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Read It First

What the codebook tells you

Codebook field	Why EDA needs it
Variable label	What it actually measures
Value codes	1 = yes, 2 = no, 9 = missing
Units	Rupees? months? per week?
Universe / who answers	Only women 15–49? only workers?
Reference period	Last 7 days? last 30? last year?
Skip patterns	Why a block is blank for some rows

The 'universe' field explains most 'missing' data: a question about pregnancy is blank for men by design, not by error.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Levels

Four levels of measurement

Level	Meaning	Survey example	Valid maths
Nominal	Labels, no order	Religion, state, ration-card type	Counts, mode
Ordinal	Ordered, unequal gaps	Education level, wealth quintile	Median, rank
Interval	Equal gaps, no true zero	Year of birth	Mean, difference
Ratio	Equal gaps, true zero	Age, income, consumption	All, ratios

The level decides which statistics are legal. You cannot average religion codes, and a wealth quintile is a rank (1–5), not a rupee amount.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Categorical vs Continuous

The first sort: categories or measures

Categorical

A fixed set of labels — sex, caste category, district, employment status. You count and tabulate these.

Continuous

A measured quantity — age, income, MPCE, height. You take means, medians and histograms of these.

Watch the trap of coded categoricals: education stored as 1–8 looks numeric, but its mean is meaningless. Check the codebook, not the column type.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Coded Numbers

Numbers that are really labels

Survey files store almost everything as numbers to save space. State = 9 means Uttar Pradesh, not the quantity nine. Treating such codes as measurements is one of the commonest EDA errors.

Before computing any mean, ask: is this a quantity or a code? The dictionary, not the data type, decides.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Special Codes

Missing codes hide in plain sight

Code	Often means	Risk if treated as a value
99 / 999	Not known / not stated	Inflates the mean enormously
97 / 98	Refused / not applicable	Phantom category in tables
0	Sometimes a real zero, sometimes 'none'	Ambiguous — check codebook
Blank	Skip pattern or true missing	Silent loss of cases

Recode special codes to explicit missing before any summary. A mean income of ₹1,400 lakh usually means a 9999999 'not stated' code slipped through.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Derived Variables

Variables you build yourself

Much of EDA involves deriving variables: monthly per-capita consumption from total expenditure and household size; age groups from age; a binary 'has toilet' from a coded sanitation variable.

Derive in a script, label the new variable clearly, and sanity-check its distribution. A derived variable inherits every error in its inputs.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Profiling

Build a variable profile

Name and meaning (from the codebook)
Type: categorical or continuous; level of measurement
Range or set of levels; special missing codes
Universe: who is supposed to have a value
% missing, and whether the missingness looks patterned

A one-line profile per variable, written as you go, is the backbone of trustworthy analysis.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

05

Section Five

Univariate Exploration

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

One at a Time

Start with one variable

Univariate exploration looks at each variable on its own — its frequencies, its distribution, its centre and spread. It is where most data problems first surface.

01

CATEGORICAL: frequency table, bar chart

→

02

CONTINUOUS: histogram, density, summary stats

→

03

BOTH: count and inspect the missing

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Frequency Tables

Count every level

Sanitation facility	Households	%
Improved, not shared	6,420	64.2
Improved, shared	1,180	11.8
Unimproved	910	9.1
Open defecation	1,390	13.9
Not stated	100	1.0

Illustrative, patterned on NFHS-style categories. A frequency table is the first thing to run on any categorical variable — it reveals tiny categories, typos and stray codes at a glance.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Histograms

The histogram of a skewed variable

Monthly per-capita consumption expenditure (MPCE), illustrative

Illustrative, patterned on NSS consumption distributions

Note the long right tail. Consumption, like income and landholding, is right-skewed — a few households spend many times the typical amount.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Reading Shape

What a histogram reveals

Centre: where the bulk of households sit
Spread: how wide the distribution is
Skew: a long tail on one side
Modes: one peak, or two (a hidden subgroup?)
Gaps & spikes: heaping at round numbers, or a wall at a top-code

The bin width matters: too wide hides structure, too narrow shows noise. Try a few widths before you conclude.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Density

Density: the smoothed histogram

A density plot smooths the histogram into a curve, making it easier to compare shapes — for example, the consumption distribution for rural versus urban households on one set of axes.

Smoothing is a choice: too much smoothing erases real peaks; too little invents them. Treat the curve as one view, not the truth — and keep the histogram alongside.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Summary Stats

The five-number summary

Min & Max

the extremes — sanity-check both

Q1, Median, Q3

the middle and the quartiles

IQR

Q3 − Q1: the robust spread of the middle 50%

Tukey's five-number summary (min, Q1, median, Q3, max) describes a distribution without assuming any shape — the heart of EDA.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Mean vs Median

For skewed money, trust the median

Mean is pulled above the median by the right tail (illustrative)

Illustrative

The mean sits well above the median because the long tail of high spenders drags it up. For consumption, income and land, report the median.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Skew

Naming the skew

Right-skewed

Long tail to the right. Mean > median. Income, MPCE, landholding, firm size. The common case in development data.

Left-skewed

Long tail to the left. Mean < median. Rarer — e.g. age at death in a high-survival population.

Skew is a signal, not a defect. It tells you which centre to report and warns you off methods that assume symmetry.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Transforms

The log scale tames the tail

For strongly right-skewed money variables, plotting on a logarithmic scale spreads the squashed low values and pulls in the tail, often revealing a near-symmetric shape that is easier to read.

A transform is an exploratory lens, not a fact about the world. Always label the axis as logged, and remember to back-transform before reporting rupee figures.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

06

Section Six

Spotting Trouble

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

The Watchlist

What can be wrong with a column

Problem	Survey example	How EDA spots it
Missing values	Income blank for some	Count of NA per variable
Impossible values	Age = 230, −3 children	Min/max range check
Special codes as data	99 = 'not stated'	Spike at 99 in histogram
Top-coding	Income capped at a max	Wall of cases at the ceiling
Heaping	Ages bunched at 0, 5, 10	Comb pattern in histogram
Duplicates	Same HH twice	Duplicate household IDs

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Missingness

Map where the gaps are

Share of records missing, by variable (illustrative)

Illustrative

A missingness bar chart across variables is one of the most useful EDA plots. Income and 'not applicable' fields are usually the worst — and that is rarely random.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Why Missing

Why a value is missing matters most

By design (skip): the question did not apply — not a problem
Missing at random: gaps unrelated to the value — least harmful
Missing not at random: the richest refuse to state income — this biases results

Dropping rows with gaps can quietly delete the very households you care about. Before deleting or filling, ask why the value is absent — the pattern of absence is itself data.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Missingness Patterns

Look for structure in the gaps

Cross-tabulate missingness against other variables: is income missing more for richer households? is a health variable missing more in one state? Patterned missingness is a finding, not just a nuisance.

Create a 'missing flag' variable and explore it like any other — who is missing, and how do they differ from who is present?

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Outliers

Outliers: error, or the story?

Could be an error

Age 230, a household of 60, income of ₹0 with five earners — check for a data-entry slip or a stray code before analysing.

Could be the story

The one district with triple the stunting rate may be exactly where the programme is needed. Do not delete it — investigate it.

Never silently drop outliers. Flag them, explain them, and decide transparently — and record it in the EDA log.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Finding Outliers

Rules of thumb for flagging extremes

Range checks: can this value exist at all? (% in 0–100)
IQR rule: flag points beyond Q1 − 1.5×IQR or Q3 + 1.5×IQR
Logic checks: a 6-year-old cannot be married with children
Visual: the point detached from the cloud in a scatter

These rules flag candidates for human review. They do not decide — judgement, not a threshold, removes a value.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Impossible Values

Logic the data must obey

Rule	Violation to flag
Age between 0 and ~110	Age = 230, age = −1
Percentages in 0–100	Vaccination = 140%
Members ≥ earners	8 earners in a 4-person household
Mother older than child	Mother 14, child 10
Consumption > 0	MPCE = 0 with members present

Codify these checks once and re-run them on every extract. Built-in validation beats hoping you will notice.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Top-Coding

Top-coding: the artificial ceiling

Top-coding

Capping a variable at a maximum value to protect privacy or limit outliers — e.g. all incomes above ₹10 lakh recorded as exactly ₹10 lakh. It creates a spike of identical values at the ceiling.

Top-coding makes the mean and the upper tail unreliable. Spot it as a wall of cases at one value, and note that you cannot study the top accurately from such data.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Before / After

Cleaning changes the picture

Age distribution before and after removing impossible values (illustrative)

Illustrative

The impossible '>110' sliver vanishes after recoding it to missing. Small in count, but it would have distorted any mean age — and revealed a data-entry problem worth reporting.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

07

Section Seven

Bivariate & Multivariate Exploration

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Two at a Time

Now look at variables together

Bivariate exploration asks how two variables move together; multivariate brings in a third. This is where relationships, gaps and confounders start to appear.

01

CAT × CAT: cross-tabulation

→

02

CAT × CONTINUOUS: grouped summaries

→

03

CONTINUOUS × CONTINUOUS: scatter plot

→

04

MANY: correlation matrix, faceting

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Cross-Tabs

Cross-tabulation: two categoricals

Sector	Has toilet	No toilet	% with toilet
Rural	5,180	2,020	71.9
Urban	2,610	190	93.2
All	7,790	2,210	77.9

Illustrative. A cross-tab is the bivariate workhorse. The key choice is the direction of percentages: row percents answer 'of rural households, what share have a toilet?' — usually what you want.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Row vs Column

Percentage in the right direction

Row %

Within each row group: 'of rural households, X% have a toilet.' Compares the outcome across groups.

Column %

Within each column: 'of households with a toilet, X% are rural.' A different question entirely.

Choosing the wrong direction silently answers a different question. State the comparison in words before you tabulate.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Grouped Summaries

A continuous variable, split by groups

Median MPCE by social group (illustrative, patterned on NSS)

Illustrative, patterned on NSS consumption by social group

Grouping a continuous variable by a category — here median consumption by social group — is the single most common development EDA move. Use the median for skewed money.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Quartiles by Group

Compare spread, not just centre

MPCE quartiles by sector — the spread differs (illustrative)

Illustrative

A true box plot is awkward in this toolkit, so we plot the quartiles directly. Urban consumption is both higher and more spread out — a fact the medians alone would hide.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Scatter

Two continuous variables: the scatter

Female literacy (%) vs total fertility rate, major states

Illustrative, patterned on Census 2011 & NFHS-5

A clear negative pattern — but remember it is a state-level picture. The ecological fallacy (next) warns against reading it as an individual truth.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Correlation

Correlation, and its limits

Correlation (r)

A number from −1 to +1 summarising how strongly two continuous variables move together linearly. It captures direction and strength — but only of a straight-line relationship.

Correlation is not causation — a confounder may drive both
r misses curves — a strong U-shape can give r near 0
Outliers move r a lot — one point can fake a relationship

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Correlation Heatmap

The correlation matrix idea

With many continuous variables, compute every pairwise correlation and lay them out as a colour-shaded heatmap — dark for strong, pale for weak. It is a fast scan for which variables travel together.

Treat the heatmap as a map of questions, not answers. A strong cell says 'look here', then you plot the pair to see whether the relationship is real, curved or driven by an outlier.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Disaggregate

Always split before you conclude

A relationship in the whole sample can hide, or even reverse, within subgroups — Simpson's paradox. A scheme can look worse overall yet be better in every state if the states differ in size and baseline.

Disaggregate by sex, sector, caste, state as part of routine EDA. The aggregate can point the opposite way to the truth.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

08

Section Eight

Survey Weights & Design Effects in EDA

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

The Catch

Unweighted exploration misleads

Household surveys are not simple random samples. Some groups are deliberately over-sampled and response rates vary, so the raw sample does not mirror the population. Exploring it unweighted gives the wrong picture.

A raw mean from NFHS or PLFS unit data is an estimate for the sample, not the population. To speak about India, you must apply the survey weights.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

What Weights Are

Weights: how many people a row represents

Survey weight

A number attached to each record saying how many people or households in the population that one respondent stands for. It corrects for unequal selection probabilities and non-response.

If poor districts were over-sampled, their households carry smaller weights so they do not dominate the national figure; under-sampled groups carry larger weights.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Why Over-Sample

Why surveys over-sample on purpose

To report reliably on a small group — a small state, a minority, an urban slum — the survey must interview enough of them. So it deliberately over-samples, then uses weights to restore the correct national balance.

This is a feature, not a flaw. But it means the unweighted sample over-represents those groups — which is exactly why weights exist.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

See the Gap

Weighted vs unweighted: a worked gap

Estimated open-defecation rate, weighted vs unweighted (illustrative)

Illustrative

State estimates match, but the national figure differs: an over-sampled high-rate state inflates the unweighted national average. Weighting fixes it — a four-point swing in the headline.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Design Effect

The design effect (DEFF)

Design effect (DEFF)

How much the clustered, weighted design inflates the variance of an estimate compared with a simple random sample of the same size. DEFF = 2 means your effective sample is half the nominal one.

Because households in the same village are similar, clustering means each extra interview adds less new information than a fresh random draw would. The DEFF quantifies that loss.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Effective Sample

Your sample is smaller than it looks

n = 10,000

nominal sample size

DEFF = 2

typical for a clustered survey

≈ 5,000

effective sample size (n ÷ DEFF)

Ignore the design effect and your confidence intervals will be too narrow — you will claim more precision than the data supports.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

In Practice

Handling weights and design in EDA

Find the weight, cluster (PSU) and stratum variables in the codebook
Apply weights to every estimate meant to describe the population
Use survey-aware tools so standard errors account for the design
Report whether each figure is weighted — and at what level it is valid

Rule of thumb for EDA: explore shapes unweighted to find problems, but report any population number weighted.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Level of Validity

Respect the design's resolution

A survey is only reliable down to the level it was designed for — usually state or district. Using a state-level estimate to claim something about one block asks the data a question it cannot answer.

Common error: drilling NFHS or PLFS to a tiny subgroup until the cell has 11 households, then reporting a precise percentage. Check the unweighted count behind every estimate.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

09

Section Nine

Visual EDA Done Well

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Why Plot

Plotting is the heart of EDA

EDA leans on display because the eye catches what tables hide — skew, clusters, gaps, outliers, curved relationships. The goal in exploration is speed and honesty, not polish.

There is no excuse for failing to plot and look.

— J. W. Tukey & F. Mosteller

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Small Multiples

Small multiples: compare many at once

Rather than cramming ten states into one tangled chart, draw ten tiny identical charts in a grid — small multiples. Shared scales let the eye compare shapes effortlessly.

Ideal for EDA across states, sectors or social groups: one consumption histogram per state, same axes, side by side. Patterns and exceptions jump out.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Faceting

Faceting: one plot, split by a variable

Faceting

Splitting a single plot into a grid of panels, one per category of a variable — the same scatter or histogram drawn separately for rural and urban, or for each social group.

Faceting is small multiples generated automatically from your data. It is the fastest way to disaggregate visually and catch a Simpson's-paradox reversal.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Box Plots

Box plots: distributions side by side

A box plot draws Tukey's five-number summary as a box (Q1 to Q3) with a median line and whiskers, marking outliers as points. Lined up by group, it compares whole distributions at a glance.

Box plots show centre, spread and outliers together — perfect for comparing MPCE across states. Where a tool lacks them, plotting the quartiles (as we did earlier) is a fair substitute.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Honest Axes

Honest axes — usually start at zero

Y starts at 90 (misleading)

Illustrative

Y starts at 0 (honest)

Illustrative

Same data. Truncating the y-axis makes a 4-point rise look like a leap. Bar charts should start at zero.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Overplotting

When the scatter becomes a blob

With tens of thousands of households, a scatter turns into a solid blob and hides its own density. EDA fixes this with transparency, smaller points, sampling, or binning into a 2-D histogram.

Large household surveys almost always overplot. If your scatter is a black cloud, you are seeing the count, not the pattern — thin it out.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Checklist

An exploratory-plot checklist

Is the variable type matched to the chart (histogram for continuous, bar for categorical)?
Is the baseline honest, and are axes and units labelled?
Have you tried a log scale for skewed money?
Is the plot weighted if it is meant to describe the population?
Did you disaggregate to check for hidden subgroups?

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

10

Section Ten

From EDA to Questions & Hypotheses

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

The Hand-off

EDA ends where testing begins

Exploration is generative: it surfaces patterns and hunches. But a pattern found by exploring is a hypothesis, not a result. The next, separate step is to test it — ideally on fresh data.

01

EDA: notice a pattern (median consumption lower in one social group)

→

02

QUESTION: is the gap real, or sampling noise?

→

03

HYPOTHESIS: state it precisely, in advance

→

04

TEST: confirm with appropriate, weighted methods

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Can Do

What EDA can legitimately conclude

Describe the sample: distributions, gaps, structure
Reveal data-quality problems and their extent
Compare groups descriptively (with weights, with caution)
Suggest relationships worth testing
Generate hypotheses and sharpen questions

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Cannot Do

What EDA cannot conclude

Prove causation — a pattern is not a cause
Confirm a hypothesis it was used to find
Establish significance by eyeballing — that needs a test
Generalise below the survey's design level

The cardinal sin: exploring until something 'looks significant', then reporting it as a confirmed finding. That is p-hacking by another name.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Forking Paths

Explore enough, and noise looks real

Slice a dataset twenty ways and roughly one slice will show a striking pattern by chance alone. Exploration is meant to roam — which is exactly why its findings must be confirmed elsewhere.

Be honest about how many things you looked at. A surprising subgroup result found on the twentieth cut deserves scepticism, not a headline.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Documenting

Document findings so they travel

State each finding with its denominator and unit
Say whether it is weighted, and at what level it holds
Note the data-quality caveats behind it
Separate what EDA suggested from what was tested

A finding without its caveats is a liability. The caveats are what make it usable by someone else — and by future-you.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Telling the Story

From exploration to a clear claim

When EDA is done, translate it into plain language a decision-maker can act on: what is typical, where the gaps are, who is missing, and which questions still need a formal test.

Far better an approximate answer to the right question than an exact answer to the wrong one.

— John W. Tukey

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Takeaways

The judgement EDA builds

Look before you model — always plot and summarise first
Know your unit, codes and universe before any statistic
Why a value is missing matters more than how many
Weight any number meant to describe the population
A found pattern is a question, not yet an answer

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

11

Section Eleven

Tools & Reproducibility

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

The Toolkit

Tools for household-survey EDA

Tool	Good for	Note
R + tidyverse	Cleaning, plotting, reproducible analysis	Free; survey & srvyr packages handle weights
Python + pandas	Cleaning, large data, automation	Free; samplics / statsmodels for survey design
Stata	Standard for official microdata	svyset built-in for weights & design; widely used
Spreadsheets	Quick first look, small tables	Fine to start; not for weighted survey estimates

Whichever you pick, choose a tool that understands survey weights and clustering — spreadsheets do not.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Reproducibility

Scripts, not manual edits

Fragile

Hand-edits in a spreadsheet, no record of what changed. Next month nobody can reproduce the number — including you.

Robust

A documented script from raw extract to result. Re-run it any time, audit every step, hand it to a colleague.

Golden rule, restated: never edit the raw file. Every clean, recode and exclusion lives in code.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Project Hygiene

A tidy, reproducible project

Keep the raw extract read-only and dated
One script does the cleaning, another the analysis
Name files with versions and dates, not 'final_FINAL_v3'
Record the source, download date and any filters applied
Keep the codebook beside the data

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Where to Get Data

Open microdata you can explore today

microdata.gov.in — MoSPI's NSS & PLFS unit-level data
dhsprogram.com — NFHS / DHS datasets (on request)
data.gov.in — India's open government data portal
censusindia.gov.in — Census tables and maps
World Bank Microdata Library — regional comparisons

Always download the codebook and survey report alongside the data — the file is unreadable without them.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Keep Learning

A short reading list

Exploratory Data Analysis — John W. Tukey (the founding text)
R for Data Science — Wickham & Grolemund (free online)
The Visual Display of Quantitative Information — Edward Tufte
Analysis of Health Surveys — Korn & Graubard (survey design)
Your survey's own report and methodology note — read it first

Pair this deck with ImpactMojo's Data Literacy, Quantitative Methods and Research Ethics 101 courses.

ImpactMojoExploratory Data Analysis 101www.impactmojo.in

Exploratory Data Analysis 101 · Complete

Now go open
the data — and look.

Before the model, before the finding, before the headline: inspect, clean, describe, plot and question. That habit, applied to every household survey you touch, is what makes your numbers trustworthy. Explore the rest of the ImpactMojo 101 Series, free forever.

More 101 Courses Explore ImpactMojo Dataverse

CC BY-NC-ND 4.0·Free Forever·ImpactMojo 101 Series