fullscreen
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
ImpactMojo 101 Series · Free Forever
Exploratory
Data
Analysis 101
Inspecting, Cleaning & Questioning Household Survey Data — a Foundational Course for Development Practitioners in South Asia
Household SurveysSouth Asia Focus~90 SlidesFree Access
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
What We Cover
01
What EDA Is & Why
Slides 3–10
02
Household Survey Data in India
Slides 11–21
03
The EDA Workflow
Slides 22–29
04
Knowing Your Variables
Slides 30–38
05
Univariate Exploration
Slides 39–48
06
Spotting Trouble
Slides 49–58
07
Bivariate & Multivariate
Slides 59–68
08
Survey Weights & Design
Slides 69–77
09
Visual EDA Done Well
Slides 78–85
10
From EDA to Questions
Slides 86–93
11
Tools & Reproducibility
Slides 94–99
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
01
Section One
What EDA Is & Why
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Look before you leap
Before you fit a model, build an index or write a finding, you must look at the data. Exploratory Data Analysis (EDA) is the open-minded first pass — describing, plotting and questioning a dataset to understand its shape, gaps and surprises.
Exploratory Data Analysis (EDA)
An attitude and a toolkit for examining data with few prior assumptions — using summaries and pictures to reveal structure, spot problems and generate questions, before any formal testing or modelling.
EDA is detective work, not decoration. You are interrogating the data to find out what it can — and cannot — honestly say.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Tukey and the case for exploring
Exploratory data analysis is an attitude, a flexibility, and a reliance on display, not a bundle of techniques.
— John W. Tukey, who named EDA in 1977
Tukey argued that statistics had become obsessed with confirming hypotheses and neglected the prior, humbler task of finding them. EDA restored looking, sketching and questioning to the centre of data work.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Two modes of data work
Exploratory
Open-ended. What is going on here? Generates questions and hypotheses. Forgiving of surprises. Where every project should start.
Confirmatory
Pre-specified. Is this specific claim true? Tests a hypothesis fixed in advance. Where EDA leads, but is not the same step.
Danger: do not let exploration masquerade as confirmation. A pattern you found by digging is a hypothesis to test on fresh data, not a proven result.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Let the data speak first
Practitioners often arrive with a story they want the data to confirm. EDA asks you to hold that story loosely and let the numbers talk back — to notice the variable that is half missing, the district that behaves oddly, the impossible age.
The greatest value of a picture is when it forces us to notice what we never expected to see.
— John W. Tukey
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Understand before you model
01
EXPLORE: what is in this dataset? what is wrong with it?
02
DESCRIBE: distributions, gaps, relationships
03
QUESTION: what hypotheses does this suggest?
04
ONLY THEN: model, test, conclude
Skipping EDA is how analysts end up modelling a coding error, averaging a top-coded variable, or reporting on a subgroup that is mostly missing data.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
What good EDA catches early
  • Errors — impossible ages, negative incomes, duplicated households
  • Gaps — a variable missing for 30% of women, not at random
  • Shape — consumption is skewed, so the mean misleads
  • Structure — the survey is clustered and weighted, not a simple random sample
  • Surprises — the outlier district that is the real story
Every hour of EDA saves a day of rework — and prevents a wrong number reaching a decision.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
How this course is built
Foundations
  • Indian household survey data
  • The EDA workflow
  • Variables, codebooks & measurement
Practice
  • Univariate, bivariate & multivariate views
  • Missingness, outliers, weights
  • Honest visuals & reproducible tools
Throughout, the data you meet is patterned on India's NSS, PLFS and NFHS — the surveys you will actually open at work.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
02
Section Two
Household Survey Data in India
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
What is a household survey?
Household survey
A survey that samples dwellings (households), then collects information about the household as a whole and about each member — the workhorse design behind NSS, PLFS, NFHS and most development data.
Because so much of welfare — consumption, sanitation, who eats, who decides — happens at the household level, the household is the natural sampling unit. But many questions are about people within it.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Two rosters: households and members
Household roster
One row per household: assets, dwelling type, water source, ration card, total members. Keyed by a household ID.
Member roster
One row per person: age, sex, relation to head, education, work, health. Keyed by household ID + person line number.
EDA must respect this hierarchy. Household-level and person-level variables live in different files and must be merged on the household ID before you can analyse them together.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Hierarchical data is nested
HH IDPerson lineAgeSexEducation
10231144MaleSecondary
10231240FemalePrimary
10231316FemaleSecondary
10231411MalePrimary
10232167FemaleNone
Four rows, one household (10231). To count households you need unique HH IDs; to count people you count rows. Confusing the two is a classic early error.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
India's major household surveys
SurveyWhat it coversWho runs itFrequency
NSS (consumption, etc.)Consumption expenditure, employment, social consumptionNSSO / MoSPIRounds (subject rotates)
PLFSLabour force: work, unemployment, wagesNSSO / MoSPIAnnual since 2017–18
NFHSHealth, nutrition, fertility, anaemia, women's statusIIPS / MoHFW~5 yrs (NFHS-5: 2019–21)
CMIE-CPHSHousehold income, consumption, sentimentCMIE (private)Continuous, 3 waves/yr
Know these by name and by job. The right survey for a question depends on what it measures and how recently.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
NSS and PLFS: the official workhorses
The National Sample Survey (NSS) has measured consumption and employment for decades. The Periodic Labour Force Survey (PLFS) took over labour statistics in 2017–18, giving annual rural and quarterly urban estimates.
NSS
Consumption & expenditure rounds underpin official poverty estimates
NSSO / MoSPI
Annual
PLFS gives the labour force participation and unemployment rates since 2017–18
PLFS, MoSPI
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
NFHS: health, nutrition and gender
The National Family Health Survey is India's Demographic and Health Survey. NFHS-5 (2019–21) covered the health, nutrition and demographic situation of women, children and households across every district.
~636,000
households interviewed in NFHS-5
NFHS-5, 2019–21
707
districts covered
IIPS / MoHFW
5
rounds since 1992–93
IIPS
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
CMIE-CPHS: high-frequency private data
The Consumer Pyramids Household Survey (CPHS), run by the private firm CMIE, tracks a large panel of households continuously, giving fast readings on income, spending and unemployment between official rounds.
Useful for timeliness, but debated on sampling and representativeness. Treat it as complementary to — not a replacement for — the official surveys, and read the methodology critics.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Decide your unit before you compute
The unit of analysis is the entity each row of your working table represents — a household, a person, a child under five, a woman aged 15–49. Every statistic is implicitly 'per' some unit.
A rate like the anaemia prevalence is per eligible person, not per household. Computing it on the household file, or without restricting to the eligible group, gives the wrong answer.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
These are not simple random samples
01
STRATIFY: split by state × rural/urban
02
STAGE 1: sample villages / urban blocks (clusters)
03
STAGE 2: sample households within each cluster
04
RESULT: a stratified, multistage, clustered sample
Because selection happens in stages and some groups are over-sampled, every household carries a survey weight. We return to weights in Section Eight — they change your EDA.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
What to check the moment a file opens
  • How many rows, and what is one row — a household or a person?
  • Is there a household ID, a person line number, a weight variable?
  • Which file is the household roster, which the member roster?
  • What does the documentation say the unit and reference period are?
Read the survey's report and documentation before the data. Five minutes there saves hours of confusion later.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
03
Section Three
The EDA Workflow
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Six steps, repeated
01
LOAD: read the data correctly
02
INSPECT: structure, types, size
03
CLEAN: codes, gaps, impossible values
04
DESCRIBE: summaries per variable
05
VISUALISE: plot distributions & relations
06
QUESTION: form hypotheses, then loop back
EDA is iterative, not linear. Each plot raises a question that sends you back to inspect or clean. Expect to go round this loop many times.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Step 1 — load it right
  • Read the right file format — fixed-width, CSV, .dta, .sav, .csv.gz
  • Make sure ID and code columns load as text, not numbers (leading zeros vanish otherwise)
  • Preserve special missing codes; do not let them become real values
  • Check the row count against the documentation
A household ID like 0457 silently becoming 457 will break every merge. Type matters from the very first line.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Step 2 — inspect the skeleton
QuestionWhat you are checking
How many rows & columns?Size and whether the file is complete
What type is each column?Numeric, text, date, categorical codes
What is the range of each?Min, max, plausibility
How many distinct values?Categorical levels, accidental duplicates
How many missing per column?Where the gaps are
This is the data-equivalent of a doctor's first examination — vital signs before any diagnosis.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Step 3 — clean transparently
Cleaning means standardising codes, recoding missing values, fixing types and flagging impossible entries — in a script, never by hand-editing the raw file.
Golden rule: keep the raw extract read-only. Every change lives in code, so it is documented, reversible and reproducible. We expand on this in Section Eleven.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Step 4 — describe every variable
  • For continuous variables: min, max, mean, median, quartiles, SD, % missing
  • For categorical variables: a frequency table of every level
  • Flag anything that looks impossible or surprising for a second look
  • Always describe before you visualise — numbers anchor the eye
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Step 5 — plot to see the shape
Summaries can hide as much as they reveal. A histogram shows skew and bimodality a mean cannot; a scatter shows a curved relationship a correlation cannot. Always plot.
Numerical calculations are exact, but graphs are rough.
— John W. Tukey — and we need both
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Step 6 — turn findings into questions
Good EDA ends not with answers but with sharper questions: why is consumption bimodal here? why is this variable missing more for women? is the district outlier real or an error?
Write the questions down. They become your analysis plan — and the line between what EDA suggested and what a later test confirmed.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
04
Section Four
Knowing Your Variables
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
The data dictionary is your map
Data dictionary / codebook
The document that defines every variable: its name, meaning, units, allowed values, the codes for each category, the special missing codes, and who it applies to.
A survey dataset without its codebook is a locked box. The number '2' could mean female, urban, 'no', or a missing code — only the dictionary tells you which.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
What the codebook tells you
Codebook fieldWhy EDA needs it
Variable labelWhat it actually measures
Value codes1 = yes, 2 = no, 9 = missing
UnitsRupees? months? per week?
Universe / who answersOnly women 15–49? only workers?
Reference periodLast 7 days? last 30? last year?
Skip patternsWhy a block is blank for some rows
The 'universe' field explains most 'missing' data: a question about pregnancy is blank for men by design, not by error.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Four levels of measurement
LevelMeaningSurvey exampleValid maths
NominalLabels, no orderReligion, state, ration-card typeCounts, mode
OrdinalOrdered, unequal gapsEducation level, wealth quintileMedian, rank
IntervalEqual gaps, no true zeroYear of birthMean, difference
RatioEqual gaps, true zeroAge, income, consumptionAll, ratios
The level decides which statistics are legal. You cannot average religion codes, and a wealth quintile is a rank (1–5), not a rupee amount.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
The first sort: categories or measures
Categorical
A fixed set of labels — sex, caste category, district, employment status. You count and tabulate these.
Continuous
A measured quantity — age, income, MPCE, height. You take means, medians and histograms of these.
Watch the trap of coded categoricals: education stored as 1–8 looks numeric, but its mean is meaningless. Check the codebook, not the column type.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Numbers that are really labels
Survey files store almost everything as numbers to save space. State = 9 means Uttar Pradesh, not the quantity nine. Treating such codes as measurements is one of the commonest EDA errors.
Before computing any mean, ask: is this a quantity or a code? The dictionary, not the data type, decides.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Missing codes hide in plain sight
CodeOften meansRisk if treated as a value
99 / 999Not known / not statedInflates the mean enormously
97 / 98Refused / not applicablePhantom category in tables
0Sometimes a real zero, sometimes 'none'Ambiguous — check codebook
BlankSkip pattern or true missingSilent loss of cases
Recode special codes to explicit missing before any summary. A mean income of ₹1,400 lakh usually means a 9999999 'not stated' code slipped through.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Variables you build yourself
Much of EDA involves deriving variables: monthly per-capita consumption from total expenditure and household size; age groups from age; a binary 'has toilet' from a coded sanitation variable.
Derive in a script, label the new variable clearly, and sanity-check its distribution. A derived variable inherits every error in its inputs.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Build a variable profile
  • Name and meaning (from the codebook)
  • Type: categorical or continuous; level of measurement
  • Range or set of levels; special missing codes
  • Universe: who is supposed to have a value
  • % missing, and whether the missingness looks patterned
A one-line profile per variable, written as you go, is the backbone of trustworthy analysis.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
05
Section Five
Univariate Exploration
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Start with one variable
Univariate exploration looks at each variable on its own — its frequencies, its distribution, its centre and spread. It is where most data problems first surface.
01
CATEGORICAL: frequency table, bar chart
02
CONTINUOUS: histogram, density, summary stats
03
BOTH: count and inspect the missing
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Count every level
Sanitation facilityHouseholds%
Improved, not shared6,42064.2
Improved, shared1,18011.8
Unimproved9109.1
Open defecation1,39013.9
Not stated1001.0
Illustrative, patterned on NFHS-style categories. A frequency table is the first thing to run on any categorical variable — it reveals tiny categories, typos and stray codes at a glance.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
The histogram of a skewed variable
Monthly per-capita consumption expenditure (MPCE), illustrative
Illustrative, patterned on NSS consumption distributions
Note the long right tail. Consumption, like income and landholding, is right-skewed — a few households spend many times the typical amount.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
What a histogram reveals
  • Centre: where the bulk of households sit
  • Spread: how wide the distribution is
  • Skew: a long tail on one side
  • Modes: one peak, or two (a hidden subgroup?)
  • Gaps & spikes: heaping at round numbers, or a wall at a top-code
The bin width matters: too wide hides structure, too narrow shows noise. Try a few widths before you conclude.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Density: the smoothed histogram
A density plot smooths the histogram into a curve, making it easier to compare shapes — for example, the consumption distribution for rural versus urban households on one set of axes.
Smoothing is a choice: too much smoothing erases real peaks; too little invents them. Treat the curve as one view, not the truth — and keep the histogram alongside.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
The five-number summary
Min & Max
the extremes — sanity-check both
Q1, Median, Q3
the middle and the quartiles
IQR
Q3 − Q1: the robust spread of the middle 50%
Tukey's five-number summary (min, Q1, median, Q3, max) describes a distribution without assuming any shape — the heart of EDA.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
For skewed money, trust the median
Mean is pulled above the median by the right tail (illustrative)
Illustrative
The mean sits well above the median because the long tail of high spenders drags it up. For consumption, income and land, report the median.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Naming the skew
Right-skewed
Long tail to the right. Mean > median. Income, MPCE, landholding, firm size. The common case in development data.
Left-skewed
Long tail to the left. Mean < median. Rarer — e.g. age at death in a high-survival population.
Skew is a signal, not a defect. It tells you which centre to report and warns you off methods that assume symmetry.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
The log scale tames the tail
For strongly right-skewed money variables, plotting on a logarithmic scale spreads the squashed low values and pulls in the tail, often revealing a near-symmetric shape that is easier to read.
A transform is an exploratory lens, not a fact about the world. Always label the axis as logged, and remember to back-transform before reporting rupee figures.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
06
Section Six
Spotting Trouble
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
What can be wrong with a column
ProblemSurvey exampleHow EDA spots it
Missing valuesIncome blank for someCount of NA per variable
Impossible valuesAge = 230, −3 childrenMin/max range check
Special codes as data99 = 'not stated'Spike at 99 in histogram
Top-codingIncome capped at a maxWall of cases at the ceiling
HeapingAges bunched at 0, 5, 10Comb pattern in histogram
DuplicatesSame HH twiceDuplicate household IDs
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Map where the gaps are
Share of records missing, by variable (illustrative)
Illustrative
A missingness bar chart across variables is one of the most useful EDA plots. Income and 'not applicable' fields are usually the worst — and that is rarely random.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Why a value is missing matters most
  • By design (skip): the question did not apply — not a problem
  • Missing at random: gaps unrelated to the value — least harmful
  • Missing not at random: the richest refuse to state income — this biases results
Dropping rows with gaps can quietly delete the very households you care about. Before deleting or filling, ask why the value is absent — the pattern of absence is itself data.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Look for structure in the gaps
Cross-tabulate missingness against other variables: is income missing more for richer households? is a health variable missing more in one state? Patterned missingness is a finding, not just a nuisance.
Create a 'missing flag' variable and explore it like any other — who is missing, and how do they differ from who is present?
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Outliers: error, or the story?
Could be an error
Age 230, a household of 60, income of ₹0 with five earners — check for a data-entry slip or a stray code before analysing.
Could be the story
The one district with triple the stunting rate may be exactly where the programme is needed. Do not delete it — investigate it.
Never silently drop outliers. Flag them, explain them, and decide transparently — and record it in the EDA log.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Rules of thumb for flagging extremes
  • Range checks: can this value exist at all? (% in 0–100)
  • IQR rule: flag points beyond Q1 − 1.5×IQR or Q3 + 1.5×IQR
  • Logic checks: a 6-year-old cannot be married with children
  • Visual: the point detached from the cloud in a scatter
These rules flag candidates for human review. They do not decide — judgement, not a threshold, removes a value.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Logic the data must obey
RuleViolation to flag
Age between 0 and ~110Age = 230, age = −1
Percentages in 0–100Vaccination = 140%
Members ≥ earners8 earners in a 4-person household
Mother older than childMother 14, child 10
Consumption > 0MPCE = 0 with members present
Codify these checks once and re-run them on every extract. Built-in validation beats hoping you will notice.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Top-coding: the artificial ceiling
Top-coding
Capping a variable at a maximum value to protect privacy or limit outliers — e.g. all incomes above ₹10 lakh recorded as exactly ₹10 lakh. It creates a spike of identical values at the ceiling.
Top-coding makes the mean and the upper tail unreliable. Spot it as a wall of cases at one value, and note that you cannot study the top accurately from such data.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Cleaning changes the picture
Age distribution before and after removing impossible values (illustrative)
Illustrative
The impossible '>110' sliver vanishes after recoding it to missing. Small in count, but it would have distorted any mean age — and revealed a data-entry problem worth reporting.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
07
Section Seven
Bivariate & Multivariate Exploration
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Now look at variables together
Bivariate exploration asks how two variables move together; multivariate brings in a third. This is where relationships, gaps and confounders start to appear.
01
CAT × CAT: cross-tabulation
02
CAT × CONTINUOUS: grouped summaries
03
CONTINUOUS × CONTINUOUS: scatter plot
04
MANY: correlation matrix, faceting
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Cross-tabulation: two categoricals
SectorHas toiletNo toilet% with toilet
Rural5,1802,02071.9
Urban2,61019093.2
All7,7902,21077.9
Illustrative. A cross-tab is the bivariate workhorse. The key choice is the direction of percentages: row percents answer 'of rural households, what share have a toilet?' — usually what you want.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Percentage in the right direction
Row %
Within each row group: 'of rural households, X% have a toilet.' Compares the outcome across groups.
Column %
Within each column: 'of households with a toilet, X% are rural.' A different question entirely.
Choosing the wrong direction silently answers a different question. State the comparison in words before you tabulate.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
A continuous variable, split by groups
Median MPCE by social group (illustrative, patterned on NSS)
Illustrative, patterned on NSS consumption by social group
Grouping a continuous variable by a category — here median consumption by social group — is the single most common development EDA move. Use the median for skewed money.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Compare spread, not just centre
MPCE quartiles by sector — the spread differs (illustrative)
Illustrative
A true box plot is awkward in this toolkit, so we plot the quartiles directly. Urban consumption is both higher and more spread out — a fact the medians alone would hide.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Two continuous variables: the scatter
Female literacy (%) vs total fertility rate, major states
Illustrative, patterned on Census 2011 & NFHS-5
A clear negative pattern — but remember it is a state-level picture. The ecological fallacy (next) warns against reading it as an individual truth.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Correlation, and its limits
Correlation (r)
A number from −1 to +1 summarising how strongly two continuous variables move together linearly. It captures direction and strength — but only of a straight-line relationship.
  • Correlation is not causation — a confounder may drive both
  • r misses curves — a strong U-shape can give r near 0
  • Outliers move r a lot — one point can fake a relationship
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
The correlation matrix idea
With many continuous variables, compute every pairwise correlation and lay them out as a colour-shaded heatmap — dark for strong, pale for weak. It is a fast scan for which variables travel together.
Treat the heatmap as a map of questions, not answers. A strong cell says 'look here', then you plot the pair to see whether the relationship is real, curved or driven by an outlier.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Always split before you conclude
A relationship in the whole sample can hide, or even reverse, within subgroups — Simpson's paradox. A scheme can look worse overall yet be better in every state if the states differ in size and baseline.
Disaggregate by sex, sector, caste, state as part of routine EDA. The aggregate can point the opposite way to the truth.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
08
Section Eight
Survey Weights & Design Effects in EDA
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Unweighted exploration misleads
Household surveys are not simple random samples. Some groups are deliberately over-sampled and response rates vary, so the raw sample does not mirror the population. Exploring it unweighted gives the wrong picture.
A raw mean from NFHS or PLFS unit data is an estimate for the sample, not the population. To speak about India, you must apply the survey weights.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Weights: how many people a row represents
Survey weight
A number attached to each record saying how many people or households in the population that one respondent stands for. It corrects for unequal selection probabilities and non-response.
If poor districts were over-sampled, their households carry smaller weights so they do not dominate the national figure; under-sampled groups carry larger weights.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Why surveys over-sample on purpose
To report reliably on a small group — a small state, a minority, an urban slum — the survey must interview enough of them. So it deliberately over-samples, then uses weights to restore the correct national balance.
This is a feature, not a flaw. But it means the unweighted sample over-represents those groups — which is exactly why weights exist.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Weighted vs unweighted: a worked gap
Estimated open-defecation rate, weighted vs unweighted (illustrative)
Illustrative
State estimates match, but the national figure differs: an over-sampled high-rate state inflates the unweighted national average. Weighting fixes it — a four-point swing in the headline.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
The design effect (DEFF)
Design effect (DEFF)
How much the clustered, weighted design inflates the variance of an estimate compared with a simple random sample of the same size. DEFF = 2 means your effective sample is half the nominal one.
Because households in the same village are similar, clustering means each extra interview adds less new information than a fresh random draw would. The DEFF quantifies that loss.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Your sample is smaller than it looks
n = 10,000
nominal sample size
DEFF = 2
typical for a clustered survey
≈ 5,000
effective sample size (n ÷ DEFF)
Ignore the design effect and your confidence intervals will be too narrow — you will claim more precision than the data supports.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Handling weights and design in EDA
  • Find the weight, cluster (PSU) and stratum variables in the codebook
  • Apply weights to every estimate meant to describe the population
  • Use survey-aware tools so standard errors account for the design
  • Report whether each figure is weighted — and at what level it is valid
Rule of thumb for EDA: explore shapes unweighted to find problems, but report any population number weighted.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Respect the design's resolution
A survey is only reliable down to the level it was designed for — usually state or district. Using a state-level estimate to claim something about one block asks the data a question it cannot answer.
Common error: drilling NFHS or PLFS to a tiny subgroup until the cell has 11 households, then reporting a precise percentage. Check the unweighted count behind every estimate.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
09
Section Nine
Visual EDA Done Well
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Plotting is the heart of EDA
EDA leans on display because the eye catches what tables hide — skew, clusters, gaps, outliers, curved relationships. The goal in exploration is speed and honesty, not polish.
There is no excuse for failing to plot and look.
— J. W. Tukey & F. Mosteller
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Small multiples: compare many at once
Rather than cramming ten states into one tangled chart, draw ten tiny identical charts in a grid — small multiples. Shared scales let the eye compare shapes effortlessly.
Ideal for EDA across states, sectors or social groups: one consumption histogram per state, same axes, side by side. Patterns and exceptions jump out.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Faceting: one plot, split by a variable
Faceting
Splitting a single plot into a grid of panels, one per category of a variable — the same scatter or histogram drawn separately for rural and urban, or for each social group.
Faceting is small multiples generated automatically from your data. It is the fastest way to disaggregate visually and catch a Simpson's-paradox reversal.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Box plots: distributions side by side
A box plot draws Tukey's five-number summary as a box (Q1 to Q3) with a median line and whiskers, marking outliers as points. Lined up by group, it compares whole distributions at a glance.
Box plots show centre, spread and outliers together — perfect for comparing MPCE across states. Where a tool lacks them, plotting the quartiles (as we did earlier) is a fair substitute.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Honest axes — usually start at zero
Y starts at 90 (misleading)
Illustrative
Y starts at 0 (honest)
Illustrative
Same data. Truncating the y-axis makes a 4-point rise look like a leap. Bar charts should start at zero.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
When the scatter becomes a blob
With tens of thousands of households, a scatter turns into a solid blob and hides its own density. EDA fixes this with transparency, smaller points, sampling, or binning into a 2-D histogram.
Large household surveys almost always overplot. If your scatter is a black cloud, you are seeing the count, not the pattern — thin it out.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
An exploratory-plot checklist
  • Is the variable type matched to the chart (histogram for continuous, bar for categorical)?
  • Is the baseline honest, and are axes and units labelled?
  • Have you tried a log scale for skewed money?
  • Is the plot weighted if it is meant to describe the population?
  • Did you disaggregate to check for hidden subgroups?
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
10
Section Ten
From EDA to Questions & Hypotheses
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
EDA ends where testing begins
Exploration is generative: it surfaces patterns and hunches. But a pattern found by exploring is a hypothesis, not a result. The next, separate step is to test it — ideally on fresh data.
01
EDA: notice a pattern (median consumption lower in one social group)
02
QUESTION: is the gap real, or sampling noise?
03
HYPOTHESIS: state it precisely, in advance
04
TEST: confirm with appropriate, weighted methods
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
What EDA can legitimately conclude
  • Describe the sample: distributions, gaps, structure
  • Reveal data-quality problems and their extent
  • Compare groups descriptively (with weights, with caution)
  • Suggest relationships worth testing
  • Generate hypotheses and sharpen questions
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
What EDA cannot conclude
  • Prove causation — a pattern is not a cause
  • Confirm a hypothesis it was used to find
  • Establish significance by eyeballing — that needs a test
  • Generalise below the survey's design level
The cardinal sin: exploring until something 'looks significant', then reporting it as a confirmed finding. That is p-hacking by another name.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Explore enough, and noise looks real
Slice a dataset twenty ways and roughly one slice will show a striking pattern by chance alone. Exploration is meant to roam — which is exactly why its findings must be confirmed elsewhere.
Be honest about how many things you looked at. A surprising subgroup result found on the twentieth cut deserves scepticism, not a headline.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Document findings so they travel
  • State each finding with its denominator and unit
  • Say whether it is weighted, and at what level it holds
  • Note the data-quality caveats behind it
  • Separate what EDA suggested from what was tested
A finding without its caveats is a liability. The caveats are what make it usable by someone else — and by future-you.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
From exploration to a clear claim
When EDA is done, translate it into plain language a decision-maker can act on: what is typical, where the gaps are, who is missing, and which questions still need a formal test.
Far better an approximate answer to the right question than an exact answer to the wrong one.
— John W. Tukey
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
The judgement EDA builds
  • Look before you model — always plot and summarise first
  • Know your unit, codes and universe before any statistic
  • Why a value is missing matters more than how many
  • Weight any number meant to describe the population
  • A found pattern is a question, not yet an answer
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
11
Section Eleven
Tools & Reproducibility
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Tools for household-survey EDA
ToolGood forNote
R + tidyverseCleaning, plotting, reproducible analysisFree; survey & srvyr packages handle weights
Python + pandasCleaning, large data, automationFree; samplics / statsmodels for survey design
StataStandard for official microdatasvyset built-in for weights & design; widely used
SpreadsheetsQuick first look, small tablesFine to start; not for weighted survey estimates
Whichever you pick, choose a tool that understands survey weights and clustering — spreadsheets do not.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Scripts, not manual edits
Fragile
Hand-edits in a spreadsheet, no record of what changed. Next month nobody can reproduce the number — including you.
Robust
A documented script from raw extract to result. Re-run it any time, audit every step, hand it to a colleague.
Golden rule, restated: never edit the raw file. Every clean, recode and exclusion lives in code.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
A tidy, reproducible project
  • Keep the raw extract read-only and dated
  • One script does the cleaning, another the analysis
  • Name files with versions and dates, not 'final_FINAL_v3'
  • Record the source, download date and any filters applied
  • Keep the codebook beside the data
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Open microdata you can explore today
  • microdata.gov.in — MoSPI's NSS & PLFS unit-level data
  • dhsprogram.com — NFHS / DHS datasets (on request)
  • data.gov.in — India's open government data portal
  • censusindia.gov.in — Census tables and maps
  • World Bank Microdata Library — regional comparisons
Always download the codebook and survey report alongside the data — the file is unreadable without them.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
A short reading list
  • Exploratory Data Analysis — John W. Tukey (the founding text)
  • R for Data Science — Wickham & Grolemund (free online)
  • The Visual Display of Quantitative Information — Edward Tufte
  • Analysis of Health Surveys — Korn & Graubard (survey design)
  • Your survey's own report and methodology note — read it first
Pair this deck with ImpactMojo's Data Literacy, Quantitative Methods and Research Ethics 101 courses.
ImpactMojoExploratory Data Analysis 101www.impactmojo.in
Exploratory Data Analysis 101 · Complete
Now go open
the data — and look.
CC BY-NC-ND 4.0·Free Forever·ImpactMojo 101 Series