fullscreen
ImpactMojoMultivariate Analysis 101www.impactmojo.in
ImpactMojo 101 Series · Free Forever
Multivariate
Analysis
101
Controlling for Several Things at Once — Multiple Regression and Its Cousins for Development Practitioners in South Asia
Research-BackedSouth Asia Focus100 SlidesFree Access
ImpactMojoMultivariate Analysis 101www.impactmojo.in
What We Cover
01
From Bivariate to Multivariate
Slides 3–11
02
Multiple Linear Regression
Slides 12–20
03
Interpreting Coefficients
Slides 21–29
04
Model Fit & Inference
Slides 30–39
05
Assumptions & Residual Diagnostics
Slides 40–48
06
Multicollinearity
Slides 49–57
07
Interactions & Non-Linearity
Slides 58–65
08
Logistic Regression
Slides 66–74
09
Data Reduction — PCA & Factor Analysis
Slides 75–83
10
Pitfalls
Slides 84–92
11
Reporting & Tools
Slides 93–99
ImpactMojoMultivariate Analysis 101www.impactmojo.in
01
Section One
From Bivariate to Multivariate
ImpactMojoMultivariate Analysis 101www.impactmojo.in
One outcome, one predictor
Bivariate analysis looks at two variables at a time: child stunting and household income; learning scores and class size. A simple regression or correlation summarises how they move together.
Bivariate
An analysis of the relationship between exactly two variables — one outcome and one predictor — with nothing else held constant.
The trouble: in the real world, almost nothing varies one at a time. Income, education, caste and location all move together.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Outcomes have many causes at once
Whether a child is stunted depends on income, the mother's education, sanitation, diet, birth order and more — simultaneously. Look at any one in isolation and you mix up its effect with all the others.
01
Mother's education
02
Household income
03
Sanitation & water
04
→ Child stunting
Multivariate analysis is simply analysis that handles several predictors at the same time.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Confounding: the lurking variable
Confounder
A variable that influences both the predictor and the outcome, creating a misleading association between them when it is left out of the analysis.
Districts with more private clinics may show worse average health — not because clinics harm anyone, but because clinics open where the population is older and sicker. Age confounds the clinic–health link.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Why a raw comparison misleads
Suppose richer households both eat more diverse diets and have less stunting. A bivariate look at 'diet diversity vs stunting' credits diet with the whole gap — part of which is really just income.
01
Income (confounder)
02
raises diet diversity
03
AND lowers stunting
04
→ diet looks more powerful than it is
Without controlling for income, the diet coefficient is biased. This is the problem multivariate analysis exists to solve.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
'Holding other variables constant'
Multiple regression estimates the effect of each predictor while statistically holding the others fixed — comparing households that differ in diet but have the same income, education and sanitation.
This is the single most important idea in the course. A multivariate coefficient is a partial effect: the contribution of one variable, net of the others in the model.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Control is not the same as causation
Holding observed variables constant removes those confounders — but only those. Anything you did not measure (motivation, local prices, unobserved health) can still bias the estimate.
Controlling for what you can measure is necessary but not sufficient for causal claims. Keep your conclusions honest about what remains unobserved.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
The words analysts use
TermAlso calledMeaning
OutcomeDependent variable, YWhat you are trying to explain
PredictorIndependent variable, X, covariateWhat you use to explain it
CoefficientSlope, βEffect of a predictor on the outcome
ControlAdjust for, condition onHold a variable constant
ResidualError, eWhat the model fails to predict
We will use 'predictor' and 'covariate' interchangeably; both just mean a right-hand-side variable.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Where this course goes
The workhorse
  • Multiple linear regression
  • Reading coefficients correctly
  • Model fit, inference, diagnostics
Beyond the basics
  • Interactions and non-linearity
  • Logistic regression for yes/no outcomes
  • PCA & factor analysis; pitfalls; tools
Examples lean on NFHS, PLFS and the kind of data you actually meet at work.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
02
Section Two
Multiple Linear Regression
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Predicting Y from several Xs
Multiple linear regression models the outcome as a straight-line combination of predictors plus an error term:
Y = β0 + β1X1 + β2X2 + … + βkXk + e
It is the same machinery as simple regression — just with several Xs at once. Everything that follows builds on this one line.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
What each symbol means
SymbolNameWhat it is
YOutcomeWhat you predict (e.g. test score)
β0InterceptPredicted Y when every X = 0
β1…βkSlope coefficientsPartial effect of each predictor
X1…XkPredictorsYour covariates
eError / residualEverything the model misses
The intercept is rarely interesting on its own — 'all Xs = 0' is often impossible (a household with zero adults). The slopes carry the story.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
An example we will reuse
Suppose we model a child's height-for-age z-score (an anthropometric outcome) on three predictors:
HAZ = β0 + β1(income) + β2(mother's years of schooling) + β3(improved toilet) + e
All coefficients in this deck are illustrative — chosen to teach interpretation, not reported as real NFHS findings.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Fitting a plane, not a line
With one predictor, regression fits a line through a cloud of points. With two predictors, it fits a plane; with many, a hyper-plane you cannot draw. The idea is unchanged: find the surface closest to the data.
'Closest' has a precise meaning — the surface that makes the prediction errors as small as possible, in a specific sense.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Ordinary least squares (OLS)
Ordinary least squares
The method that chooses the coefficients minimising the sum of the squared residuals — the squared vertical gaps between each observed Y and the value the model predicts.
Squaring the residuals penalises big misses heavily and treats over- and under-prediction symmetrically. The software solves it instantly; your job is to set up and read the model well.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Each slope is a partial effect
In our HAZ model, β2 is the change in height-for-age associated with one more year of the mother's schooling, holding income and toilet access constant.
Compared with a bivariate estimate, the multivariate slope strips out the part of schooling's apparent effect that was really income or sanitation in disguise.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Prediction = signal + leftover
For each child the model gives a fitted value (its best guess of HAZ) and a residual (observed − fitted). The residual is what the predictors could not explain.
01
Observed Y
02
− Fitted Y (the model's guess)
03
= Residual (the leftover)
04
→ diagnostics live in the residuals
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Start simple, then complicate
Linear regression is the default workhorse because it is transparent, fast, and surprisingly flexible — you can add interactions, squared terms and log transforms without leaving the framework.
Master the linear model and you have the scaffolding for almost every method that follows, including logistic regression.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
03
Section Three
Interpreting Coefficients
ImpactMojoMultivariate Analysis 101www.impactmojo.in
How to read any coefficient
Read every slope with one template: 'A one-unit increase in X is associated with a β-unit change in Y, holding the other variables constant.'
Get this sentence right and most interpretation errors vanish. The three load-bearing phrases are one-unit, associated with, and holding the others constant.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
A coefficient is glued to its units
If income is measured in rupees, βincome is the effect of one extra rupee — a tiny number. Measure income in thousands of rupees and the coefficient is 1,000 times larger. The relationship is identical; only the scale changed.
Always state units. A coefficient of '0.002' is meaningless until you know 0.002 of what, per what.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Worked reading: a continuous X
Illustrative result: βschooling = 0.06 in our HAZ model.
Read it as: 'Each additional year of the mother's schooling is associated with a 0.06 higher height-for-age z-score, holding income and toilet access constant.' Five extra years ≈ 0.30 z-score.
Illustrative figure — not a reported NFHS coefficient.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Encoding categories as 0/1
Dummy variable
A 0/1 indicator standing for a category — e.g. improved toilet = 1, else 0. Its coefficient is the gap between that category and the baseline.
For a dummy, 'a one-unit increase' simply means switching from 0 to 1 — from the baseline group to the indicated group.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Every dummy needs something to compare to
A categorical variable with k categories becomes k−1 dummies; the omitted one is the baseline. For religion with Hindu/Muslim/Christian/Other, if 'Hindu' is omitted, each coefficient is that group's gap relative to Hindu.
Never interpret a dummy without naming the baseline. 'β = −0.4 for urban' means nothing until you know it is relative to rural.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Reading a 0/1 coefficient
Illustrative result: βimproved toilet = 0.18 (baseline = no improved toilet).
Read it as: 'Children in households with an improved toilet have, on average, a 0.18 higher height-for-age z-score than otherwise-similar children without one.' The phrase 'otherwise-similar' is the controls doing their work.
Illustrative figure.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Unstandardised vs standardised coefficients
Unstandardised (b)Standardised (β)
UnitsOriginal (rupees, years)Standard deviations
Reads asEffect of 1 real unitEffect of a 1-SD change
Comparable across Xs?No — different scalesYes — common scale
Best forReal-world meaningRanking relative importance
Standardised coefficients let you ask 'which predictor matters most?' — but lose the plain-language 'per rupee' meaning. Report unstandardised for interpretation, standardised for comparison.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Direction first, then magnitude
  • Sign: + means Y rises with X; − means Y falls as X rises
  • Size: how much Y moves per unit of X — in context
  • Always ask 'is this big?' — a 0.02 z-score gain may be real but trivial; a 0.4 gain may be programme-changing
A coefficient is not 'important' just because it is non-zero. Judge magnitude against what would matter for your decision.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
04
Section Four
Model Fit & Inference
ImpactMojoMultivariate Analysis 101www.impactmojo.in
R²: variance explained
R² (R-squared)
The share of the variation in the outcome that the model's predictors explain — ranging from 0 (explains nothing) to 1 (explains everything).
An R² of 0.34 means the predictors account for 34% of the variation in the outcome; the other 66% is residual — unmeasured causes and noise.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
R² as a slice of total variation
Variation in the outcome: explained vs unexplained (illustrative)
Illustrative
In social data, R² values of 0.1–0.4 are common and not shameful — human behaviour is genuinely noisy. A high R² is not the goal; an honest, well-specified model is.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
R² only ever goes up
Add any predictor — even random noise — and R² will rise or stay flat; it can never fall. So a bigger R² does not prove the new variable belongs in the model.
Chasing R² by piling in predictors is a recipe for overfitting. You need a measure that penalises needless complexity.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Adjusted R²
Adjusted R²
A version of R² that penalises each extra predictor. It rises only when a new variable improves the model by more than chance would predict — and can fall when you add a useless one.
Compare models on adjusted R², not raw R². If adjusted R² drops when you add a variable, that variable is earning its place by less than it costs.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
The F-test
The F-test asks a whole-model question: do the predictors together explain more than nothing? Its null hypothesis is that all slope coefficients are zero at once.
A small F-test p-value says 'this model beats a model with no predictors'. It does not tell you which predictor matters — that is the t-tests' job.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
t-tests on each coefficient
Each coefficient gets its own t-test: is this particular slope distinguishable from zero, given its uncertainty? The output is a t-statistic and a p-value per predictor.
Standard error
The estimated uncertainty in a coefficient. t = coefficient ÷ standard error; a large t (small p) means the slope is unlikely to be zero by chance.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Confidence intervals beat stars
A coefficient of 0.06 with a 95% confidence interval of 0.02 to 0.10 says the plausible effect lies in that band. The interval shows both direction and precision — far more than a lone p-value or a row of asterisks.
If a 95% interval comfortably excludes zero, the effect is 'significant' at the 5% level — but always read the width. A wide interval means you really do not know much.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Coefficients with confidence intervals
Illustrative coefficients on height-for-age, with 95% CIs
Illustrative — not reported NFHS estimates
Bars are the point estimates; the listed intervals (e.g. toilet: 0.07–0.29) are the 95% CIs. None crosses zero here, so each is 'significant' — but the toilet effect is the least precise.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Statistically real, practically tiny
With a large sample — and NFHS has hundreds of thousands of records — even a microscopic effect can be 'statistically significant'. Significance is about certainty, not size.
Always report the effect size and its units alongside the p-value. 'Significant' answers 'is it real?', never 'does it matter?'
ImpactMojoMultivariate Analysis 101www.impactmojo.in
05
Section Five
Assumptions & Residual Diagnostics
ImpactMojoMultivariate Analysis 101www.impactmojo.in
OLS rests on assumptions
Least-squares estimates are trustworthy only if a few assumptions roughly hold. Violations do not always bias the coefficients, but they can wreck the standard errors — and so the p-values and intervals.
Most assumptions are checked by looking at the residuals — the model's leftovers. Plotting them is non-negotiable.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
What linear regression assumes
AssumptionPlain meaningCheck with
LinearityThe true relationship is a straight lineResidual vs fitted plot
IndependenceObservations don't lean on each otherStudy design; clustering
HomoscedasticityError spread is constant across XResidual vs fitted plot
Normal residualsErrors are roughly bell-shapedQ–Q plot, histogram
Note what is not required: the predictors themselves need not be normal, and Y need not be normal — only the residuals.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Is a straight line the right shape?
If the real relationship curves — income's effect on nutrition flattening at high incomes — a straight-line model will systematically over- and under-predict in patterns.
Diagnosis: plot residuals against fitted values. A curve or a smile in that plot says the linearity assumption is failing — consider a transform or a squared term (Section 7).
ImpactMojoMultivariate Analysis 101www.impactmojo.in
When observations cluster
Children in the same village share water, clinics and shocks, so their outcomes are correlated. Treating them as fully independent makes standard errors look smaller than they are — falsely confident results.
Survey data like NFHS is clustered by design. Use clustered or survey-adjusted standard errors, or you will overstate significance.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Constant error spread
Homoscedasticity
The residuals have roughly the same spread across all fitted values. Its opposite, heteroscedasticity, is a fan or funnel shape in the residual plot.
Income data is a classic offender: the rich vary far more in spending than the poor, so residuals widen as predicted spending rises. Coefficients stay unbiased, but the standard errors are wrong.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Good residuals vs a heteroscedastic fan
Residual vs fitted: even band (good) vs widening fan (bad)
Illustrative
The green band stays flat; the red points fan out as the fitted value grows. The fan is the warning sign — reach for robust standard errors or a log transform.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
The Q–Q plot
Inference (t-tests, CIs) assumes the residuals are roughly normal. Check with a Q–Q plot: if the residuals are normal, the points hug the diagonal line; fat tails or skew show as departures at the ends.
Good news: with a large sample, mild non-normality barely matters thanks to the central limit theorem. Worry most about it in small samples.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Plot first, conclude later
  • Residual vs fitted: checks linearity and equal spread at once
  • Q–Q plot: checks normality of residuals
  • Leverage / influence plot: finds the points bending the fit
  • Scale–location plot: a sharper look at spread
Anscombe's lesson applies here too: numbers alone hide trouble. Run the diagnostic plots on every model before you trust a single coefficient.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
06
Section Six
Multicollinearity
ImpactMojoMultivariate Analysis 101www.impactmojo.in
When predictors overlap
Multicollinearity
When two or more predictors are highly correlated with each other, so they carry overlapping information about the outcome.
Household income, monthly expenditure and asset ownership all measure roughly the same thing — living standards. Put all three in a model and the regression struggles to separate their individual effects.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Unstable, imprecise coefficients
When predictors overlap, the model cannot tell which one deserves the credit. The result: coefficients with huge standard errors, wild swings if you add or drop a variable, and sometimes nonsensical signs.
Crucially, multicollinearity does not bias the coefficients — it makes them imprecise. The overall prediction can still be fine; the individual effects become untrustworthy.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
How to spot it
  • A high overall R² and significant F-test, but no individual predictor is significant
  • Coefficients flip sign or size when you add/remove a variable
  • Implausibly large standard errors on variables you expected to matter
  • Predictors you know are related (income & expenditure) are both in the model
These symptoms point you toward a formal diagnostic — the variance inflation factor.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
The Variance Inflation Factor (VIF)
VIF
For each predictor, how much its coefficient's variance is inflated by correlation with the other predictors. VIF = 1 means no overlap; higher means more.
It is computed by regressing each predictor on all the others. The more predictable a variable is from the rest, the higher its VIF — and the shakier its coefficient.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
How high is too high?
Illustrative VIFs — expenditure & income overlap badly
Illustrative
Common rule of thumb: VIF above 5–10 signals a problem. Here income and expenditure both blow past it — they are measuring the same underlying wealth.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
What to do about it
RemedyHowWhen
Drop oneRemove a redundant predictorTwo variables measure the same thing
CombineBuild one index (e.g. PCA)Several proxies for one concept
Centre / rescaleSubtract the mean before squaring/interactingCollinearity from interaction terms
Get more dataLarger / more varied sampleOverlap is mild, not structural
Accept itKeep the model, widen the CIsYou only care about prediction
If two predictors are basically the same concept, do not agonise — keep one, or fold them into a single index. The wealth index in Section 9 is exactly this move.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Don't 'fix' collinearity you need
If two predictors are genuinely distinct concepts that happen to correlate — education and income — dropping one can reintroduce confounding. The cure may be worse than the disease.
Decide based on your question. If you need the partial effect of education net of income, keep both and accept wider intervals rather than deleting a real confounder.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Multicollinearity in one breath
  • It is about predictors overlapping with each other, not with Y
  • It inflates standard errors; it does not bias coefficients
  • Diagnose with VIF (watch for > 5–10) and unstable signs
  • Remedy by dropping, combining, or simply accepting wider uncertainty
ImpactMojoMultivariate Analysis 101www.impactmojo.in
07
Section Seven
Interactions & Non-Linearity
ImpactMojoMultivariate Analysis 101www.impactmojo.in
When one slope isn't enough
The plain model assumes each predictor's effect is the same for everyone and constant at every level. Reality is richer: effects can depend on other variables, or change as a variable grows.
Interaction
X's effect depends on another variable (e.g. schooling helps more where toilets exist).
Non-linearity
X's effect changes with its own level (e.g. income matters more at the bottom).
ImpactMojoMultivariate Analysis 101www.impactmojo.in
An effect that depends on context
Interaction term
A predictor formed by multiplying two variables (X1 × X2). Its coefficient captures how the effect of one variable changes across levels of the other.
Model: HAZ = … + β1(schooling) + β2(toilet) + β3(schooling × toilet). β3 tells you whether schooling's payoff differs for households with and without a toilet.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Two slopes from one interaction
Schooling's effect on HAZ, by toilet access (illustrative)
Illustrative
The steeper green line means schooling helps more where sanitation is in place — an interaction. Two different slopes, one model. Without the interaction term you would force them to be parallel.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
The main effect changes meaning
Once you include X1×X2, the coefficient on X1 alone is no longer 'the effect of X1'. It is the effect of X1 when X2 = 0. The full effect of X1 is β1 + β3·X2.
This trips people up constantly. In a model with interactions, never read a main-effect coefficient in isolation — always say 'at what level of the other variable?'
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Quadratic and polynomial terms
Add a squared term — β1X + β2X² — to bend the line into a curve. This captures diminishing returns (income's effect flattening) or U-shapes (age and earnings).
Centre X before squaring to tame the collinearity between X and X², and resist going past a quadratic — high-order polynomials wiggle wildly and overfit.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Taming skew, reading elasticities
Logging a right-skewed variable like income compresses its long tail and often straightens a curved relationship. It also changes how you read the coefficient — in percentage terms.
Elasticity
In a log–log model, the coefficient is an elasticity: the percentage change in Y for a 1% change in X. A coefficient of 0.4 means a 1% rise in X is linked to a 0.4% rise in Y.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Four ways logs change the reading
ModelCoefficient β reads as…
Y on X (level–level)β-unit change in Y per 1-unit X
Y on log X (level–log)β/100 change in Y per 1% rise in X
log Y on X (log–level)~100·β % change in Y per 1-unit X
log Y on log X (log–log)β % change in Y per 1% change in X (elasticity)
Logs are the practitioner's friend for money variables: they fix skew, ease heteroscedasticity, and give intuitive percentage interpretations.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
08
Section Eight
Logistic Regression
ImpactMojoMultivariate Analysis 101www.impactmojo.in
When Y is yes/no
Many development outcomes are binary: a child is fully immunised or not; a woman delivered in a facility or not; a household is below the poverty line or not. Linear regression is the wrong tool for these.
Fit a straight line to a 0/1 outcome and it will happily predict probabilities below 0 or above 1 — nonsense. We need a model that stays inside 0–1.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Model the probability with an S-curve
Logistic regression models the probability of 'yes' using an S-shaped (logistic) curve that flattens near 0 and 1, so predictions always stay in the valid range.
The trick: instead of modelling the probability directly, it models the log-odds of the outcome as a linear function of the predictors. That keeps the familiar 'linear in the Xs' structure.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
From probability to log-odds
QuantityDefinitionRange
Probability pChance of 'yes'0 to 1
Oddsp ÷ (1 − p)0 to ∞
Log-odds (logit)ln(odds)−∞ to +∞
A probability of 0.8 is odds of 4 (4-to-1 on) and log-odds of about 1.39. Logistic regression's coefficients live on this log-odds scale — which is why they are not directly readable.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Coefficients are in log-odds — exponentiate them
A logistic coefficient β is the change in log-odds per one-unit increase in X. That is hard to feel. So we take exp(β) to get an odds ratio — a multiplicative effect on the odds.
This is the most-mangled idea in applied statistics: the raw coefficient is log-odds; exp(coefficient) is the odds ratio. Never read the raw logit coefficient as a probability.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Above 1, below 1, equal to 1
  • OR > 1: the predictor raises the odds of 'yes' (OR 1.5 → odds 50% higher)
  • OR < 1: the predictor lowers the odds (OR 0.7 → odds 30% lower)
  • OR = 1: no association — this is the null value
Because odds ratios are multiplicative, the null is 1, not 0. A 95% CI for an OR that includes 1 is 'not significant'.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Odds ratios for facility delivery
Illustrative odds ratios for facility delivery (vs home)
Illustrative — patterned on NFHS-style predictors
Read across the line at OR = 1: urban residence and wealth raise the odds of a facility delivery; high birth order (OR 0.62) lowers them by about 38%. All illustrative.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Odds ratios are not risk ratios
An odds ratio of 2 does not mean the probability doubles. Odds and probability diverge sharply when outcomes are common. For a rare outcome the two are close; for a common one they are not.
Say 'the odds are 85% higher', not '85% more likely'. If your audience needs probabilities, report predicted probabilities at chosen covariate values instead.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Everything else carries over
  • Coefficients are still partial effects — net of the other predictors
  • Each odds ratio still comes with a confidence interval — report it
  • Multicollinearity, interactions and dummies all behave as before
  • Fit is judged differently (pseudo-R², classification, AUC), not by OLS R²
Logistic regression is the same way of thinking on a new scale — learn the log-odds/odds-ratio translation and you are most of the way there.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
09
Section Nine
Data Reduction — PCA & Factor Analysis
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Too many indicators, one concept
You may have twenty asset and housing variables — fridge, TV, motorcycle, pucca walls, electricity, toilet type — all proxying one idea: household living standards. Putting all twenty in a regression is messy and collinear.
Data reduction compresses many correlated indicators into a few summary scores that capture most of their shared information.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Principal Component Analysis
Principal Component Analysis
A technique that re-expresses many correlated variables as a smaller set of uncorrelated 'components', ordered so the first captures the most variance, the second the next-most, and so on.
The first principal component is the single weighted combination of the variables that explains the largest share of their joint variation — usually the 'size' or 'level' they share.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
The scree plot
Variance explained by each component (illustrative scree)
Illustrative
The first component dwarfs the rest — the classic signal that one dimension (wealth) underlies the asset variables. Keep components up to the 'elbow' where the bars level off.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
The DHS / NFHS wealth index
The standard DHS/NFHS wealth index is built with PCA. Asset, housing and utility variables go in; the first principal component becomes each household's wealth score — the textbook example of PCA in development.
Households are then ranked and split into five wealth quintiles. This is the 'wealth quintile' you see throughout NFHS tables — a PCA score in disguise.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
The wealth-index distribution
Distribution of a PCA wealth score, split into quintiles (illustrative)
Illustrative
By construction the quintiles each hold ~20% of households — the index is a continuous score cut into five equal bands. The underlying score itself is continuous and right-skewed in level.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Cousins, not twins
PCAFactor analysis
GoalSummarise / compress varianceFind latent underlying factors
DirectionVariables → componentsFactors → cause the variables
Model of errorNone — pure re-expressionSeparates shared vs unique variance
Typical useIndices (wealth index)Psychometrics, attitude scales
In practice they often give similar results for index-building. PCA is the default for wealth indices; factor analysis suits questionnaire scales where a true latent trait is assumed.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Reduction has costs
  • Components can be hard to name — what does PC2 mean?
  • PCA is scale-sensitive — standardise variables (or use a suitable method) first
  • An asset index reflects what is in the basket — urban-skewed assets bias it
  • Compressing always discards some information by design
Used well, data reduction tames multicollinearity and yields one clean predictor. Used blindly, it buries the very structure you wanted to study.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Data reduction in one breath
  • Many correlated indicators → a few uncorrelated summary scores
  • PCA's first component = the dominant shared dimension (often 'level' or 'wealth')
  • The NFHS/DHS wealth index is the first principal component of assets
  • Quintiles are that continuous score cut into five equal groups
ImpactMojoMultivariate Analysis 101www.impactmojo.in
10
Section Ten
Pitfalls
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Overfitting
Throw enough predictors at a small sample and the model fits the noise, not the signal. It looks brilliant in your data and fails on the next dataset — it memorised rather than learned.
Symptoms: many predictors relative to observations, dazzling in-sample R², coefficients that change wildly across samples. Cure: simpler models, adjusted R², out-of-sample testing.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Omitted-variable bias
Leave out a real confounder and its effect contaminates the variables you did include. This is the mirror image of controlling: omit income and the diet coefficient absorbs income's effect.
The bias's direction depends on how the omitted variable relates to both X and Y. Unlike multicollinearity, this one genuinely biases coefficients — it is the more dangerous problem.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Multicollinearity vs causality
These two pitfalls pull in opposite directions. Dropping a collinear predictor cures the imprecision — but if that predictor was a real confounder, dropping it creates omitted-variable bias.
The resolution is your question. For an unbiased causal estimate, keep the confounder and tolerate wide intervals. For pure prediction, collinearity may not matter at all.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Controlling for a mediator
Mediator
A variable on the causal path between X and Y — X causes the mediator, which in turn causes Y. It transmits the effect rather than confounding it.
Education raises income, and income improves child nutrition. Income is a mediator of education's effect — not a confounder to be controlled away.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
'Controlling for' can bias the answer
If you 'control for' income while estimating education's effect on nutrition, you block the very pathway through which education works — and understate its total effect. More controls is not always better.
Rule: control for confounders (common causes), never for mediators (intermediate effects) or things caused by the outcome. Draw the causal story before choosing controls.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Extrapolation
A model fitted on households earning ₹5,000–₹40,000 a month says nothing reliable about a household earning ₹5 lakh. Predicting outside the range of your data assumes the line keeps going — it rarely does.
Stay within the support of your data. The neat straight line is an artefact of the range you observed, not a law of nature beyond it.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
p-hacking & the forking paths
Try enough specifications — add this control, drop that one, split by region, redefine the outcome — and something will cross p < 0.05 by chance. Reporting only the winning run manufactures false findings.
Defend against it: pre-specify your model, report what you tried, show robustness across specifications, and trust replication over a single surprising result.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Six traps, one habit
PitfallWhat goes wrongGuard
OverfittingFits noise, fails out of sampleSimplicity, adjusted R², hold-out
Omitted-variable biasConfounder contaminates coefficientsInclude real confounders
Collinearity confusionDrop a confounder to 'fix' VIFLet the question decide
Controlling a mediatorBlocks the causal pathwayDon't control intermediates
ExtrapolationPredicts beyond the dataStay within range
p-hackingChance result sold as findingPre-specify, replicate
ImpactMojoMultivariate Analysis 101www.impactmojo.in
11
Section Eleven
Reporting & Tools
ImpactMojoMultivariate Analysis 101www.impactmojo.in
How to build a model, in order
01
ASK: a precise question & outcome
02
DRAW: the causal story — confounders vs mediators
03
FIT: choose linear or logistic; add controls
04
CHECK: residuals, VIF, fit
05
REPORT: effects, CIs, caveats
Notice the thinking happens before the software runs. Drawing the causal story first is what tells you which variables to control — and which to leave alone.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
What a results table must show
PredictorCoefficient95% CIp
Income (per ₹1,000)0.030.01 – 0.05<0.01
Mother's schooling (yr)0.060.02 – 0.10<0.01
Improved toilet (vs none)0.180.07 – 0.29<0.01
Intercept−1.42
Always report coefficients with confidence intervals, the sample size, R²/adjusted R², and the units. Illustrative figures shown.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Habits that earn trust
  • State your question and model before the numbers
  • Give effect sizes with units, not just stars
  • Show confidence intervals; note the sample size and any weights
  • Report the diagnostics you ran and what you found
  • Be explicit about what is association vs causal claim
ImpactMojoMultivariate Analysis 101www.impactmojo.in
R, Python and Stata
ToolStrengthsNote
RStats-first, superb diagnostics & graphicsFree; lm(), glm(), broom
Python (statsmodels)Cleaning + modelling in one placeFree; pandas, scikit-learn
StataSurvey data, clustered SEs, ubiquitous in econPaid; svy: prefix
SPSSMenu-driven, common in academiaPaid; gentle on-ramp
For NFHS/PLFS work, whatever you choose must handle survey weights and clustered standard errors — R's survey package, Python's statsmodels, or Stata's svy commands.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
A short, honest reading list
  • Mostly Harmless Econometrics — Angrist & Pischke
  • Introductory Econometrics — Jeffrey Wooldridge
  • An Introduction to Statistical Learning — James et al. (free PDF)
  • Regression and Other Stories — Gelman, Hill & Vehtari
  • DHS Wealth Index methodology — Rutstein & Johnson (the PCA reference)
Pair this deck with ImpactMojo's Data Literacy, Exploratory Data Analysis and Causal Inference 101 courses.
ImpactMojoMultivariate Analysis 101www.impactmojo.in
If you remember five things
  • A coefficient is a partial effect — the rest held constant
  • Adjusted R² over R², and read confidence intervals, not just stars
  • Plot your residuals — diagnostics live there
  • Logistic coefficients are log-odds; exp() gives odds ratios (OR > 1 raises odds)
  • Control confounders, never mediators — and never claim cause from control alone
ImpactMojoMultivariate Analysis 101www.impactmojo.in
Multivariate Analysis 101 · Complete
Now hold the
other things constant.
CC BY-NC-ND 4.0·Free Forever·ImpactMojo 101 Series