Multivariate Analysis 101

ImpactMojoMultivariate Analysis 101www.impactmojo.in

ImpactMojo 101 Series · Free Forever

Multivariate
Analysis
101

Controlling for Several Things at Once — Multiple Regression and Its Cousins for Development Practitioners in South Asia

Research-BackedSouth Asia Focus100 SlidesFree Access

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Agenda

What We Cover

01

From Bivariate to Multivariate

Slides 3–11

02

Multiple Linear Regression

Slides 12–20

03

Interpreting Coefficients

Slides 21–29

04

Model Fit & Inference

Slides 30–39

05

Assumptions & Residual Diagnostics

Slides 40–48

06

Multicollinearity

Slides 49–57

07

Interactions & Non-Linearity

Slides 58–65

08

Logistic Regression

Slides 66–74

09

Data Reduction — PCA & Factor Analysis

Slides 75–83

10

Pitfalls

Slides 84–92

11

Reporting & Tools

Slides 93–99

ImpactMojoMultivariate Analysis 101www.impactmojo.in

01

Section One

From Bivariate to Multivariate

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Starting Point

One outcome, one predictor

Bivariate analysis looks at two variables at a time: child stunting and household income; learning scores and class size. A simple regression or correlation summarises how they move together.

Bivariate

An analysis of the relationship between exactly two variables — one outcome and one predictor — with nothing else held constant.

The trouble: in the real world, almost nothing varies one at a time. Income, education, caste and location all move together.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

The Real World

Outcomes have many causes at once

Whether a child is stunted depends on income, the mother's education, sanitation, diet, birth order and more — simultaneously. Look at any one in isolation and you mix up its effect with all the others.

01

Mother's education

→

02

Household income

→

03

Sanitation & water

→

04

→ Child stunting

Multivariate analysis is simply analysis that handles several predictors at the same time.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

The Core Problem

Confounding: the lurking variable

Confounder

A variable that influences both the predictor and the outcome, creating a misleading association between them when it is left out of the analysis.

Districts with more private clinics may show worse average health — not because clinics harm anyone, but because clinics open where the population is older and sicker. Age confounds the clinic–health link.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

See It

Why a raw comparison misleads

Suppose richer households both eat more diverse diets and have less stunting. A bivariate look at 'diet diversity vs stunting' credits diet with the whole gap — part of which is really just income.

01

Income (confounder)

→

02

raises diet diversity

→

03

AND lowers stunting

→

04

→ diet looks more powerful than it is

Without controlling for income, the diet coefficient is biased. This is the problem multivariate analysis exists to solve.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

The Fix

'Holding other variables constant'

Multiple regression estimates the effect of each predictor while statistically holding the others fixed — comparing households that differ in diet but have the same income, education and sanitation.

This is the single most important idea in the course. A multivariate coefficient is a partial effect: the contribution of one variable, net of the others in the model.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Caveat

Control is not the same as causation

Holding observed variables constant removes those confounders — but only those. Anything you did not measure (motivation, local prices, unobserved health) can still bias the estimate.

Controlling for what you can measure is necessary but not sufficient for causal claims. Keep your conclusions honest about what remains unobserved.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Vocabulary

The words analysts use

Term	Also called	Meaning
Outcome	Dependent variable, Y	What you are trying to explain
Predictor	Independent variable, X, covariate	What you use to explain it
Coefficient	Slope, β	Effect of a predictor on the outcome
Control	Adjust for, condition on	Hold a variable constant
Residual	Error, e	What the model fails to predict

We will use 'predictor' and 'covariate' interchangeably; both just mean a right-hand-side variable.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Roadmap

Where this course goes

The workhorse

Multiple linear regression
Reading coefficients correctly
Model fit, inference, diagnostics

Beyond the basics

Interactions and non-linearity
Logistic regression for yes/no outcomes
PCA & factor analysis; pitfalls; tools

Examples lean on NFHS, PLFS and the kind of data you actually meet at work.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

02

Section Two

Multiple Linear Regression

ImpactMojoMultivariate Analysis 101www.impactmojo.in

The Model

Predicting Y from several Xs

Multiple linear regression models the outcome as a straight-line combination of predictors plus an error term:

Y = β0 + β1X1 + β2X2 + … + βkXk + e

It is the same machinery as simple regression — just with several Xs at once. Everything that follows builds on this one line.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

The Pieces

What each symbol means

Symbol	Name	What it is
Y	Outcome	What you predict (e.g. test score)
β₀	Intercept	Predicted Y when every X = 0
β₁…β_k	Slope coefficients	Partial effect of each predictor
X₁…X_k	Predictors	Your covariates
e	Error / residual	Everything the model misses

The intercept is rarely interesting on its own — 'all Xs = 0' is often impossible (a household with zero adults). The slopes carry the story.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Worked Setup

An example we will reuse

Suppose we model a child's height-for-age z-score (an anthropometric outcome) on three predictors:

HAZ = β0 + β1(income) + β2(mother's years of schooling) + β3(improved toilet) + e

All coefficients in this deck are illustrative — chosen to teach interpretation, not reported as real NFHS findings.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

The Geometry

Fitting a plane, not a line

With one predictor, regression fits a line through a cloud of points. With two predictors, it fits a plane; with many, a hyper-plane you cannot draw. The idea is unchanged: find the surface closest to the data.

'Closest' has a precise meaning — the surface that makes the prediction errors as small as possible, in a specific sense.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

How It's Fit

Ordinary least squares (OLS)

Ordinary least squares

The method that chooses the coefficients minimising the sum of the squared residuals — the squared vertical gaps between each observed Y and the value the model predicts.

Squaring the residuals penalises big misses heavily and treats over- and under-prediction symmetrically. The software solves it instantly; your job is to set up and read the model well.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Key Idea

Each slope is a partial effect

In our HAZ model, β₂ is the change in height-for-age associated with one more year of the mother's schooling, holding income and toilet access constant.

Compared with a bivariate estimate, the multivariate slope strips out the part of schooling's apparent effect that was really income or sanitation in disguise.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Fitted vs Residual

Prediction = signal + leftover

For each child the model gives a fitted value (its best guess of HAZ) and a residual (observed − fitted). The residual is what the predictors could not explain.

01

Observed Y

→

02

− Fitted Y (the model's guess)

→

03

= Residual (the leftover)

→

04

→ diagnostics live in the residuals

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Why Linear First

Start simple, then complicate

Linear regression is the default workhorse because it is transparent, fast, and surprisingly flexible — you can add interactions, squared terms and log transforms without leaving the framework.

Master the linear model and you have the scaffolding for almost every method that follows, including logistic regression.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

03

Section Three

Interpreting Coefficients

ImpactMojoMultivariate Analysis 101www.impactmojo.in

The Golden Sentence

How to read any coefficient

Read every slope with one template: 'A one-unit increase in X is associated with a β-unit change in Y, holding the other variables constant.'

Get this sentence right and most interpretation errors vanish. The three load-bearing phrases are one-unit, associated with, and holding the others constant.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Units Matter

A coefficient is glued to its units

If income is measured in rupees, β_income is the effect of one extra rupee — a tiny number. Measure income in thousands of rupees and the coefficient is 1,000 times larger. The relationship is identical; only the scale changed.

Always state units. A coefficient of '0.002' is meaningless until you know 0.002 of what, per what.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Continuous Predictors

Worked reading: a continuous X

Illustrative result: β_schooling = 0.06 in our HAZ model.

Read it as: 'Each additional year of the mother's schooling is associated with a 0.06 higher height-for-age z-score, holding income and toilet access constant.' Five extra years ≈ 0.30 z-score.

Illustrative figure — not a reported NFHS coefficient.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Dummy Variables

Encoding categories as 0/1

Dummy variable

A 0/1 indicator standing for a category — e.g. improved toilet = 1, else 0. Its coefficient is the gap between that category and the baseline.

For a dummy, 'a one-unit increase' simply means switching from 0 to 1 — from the baseline group to the indicated group.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Baseline Categories

Every dummy needs something to compare to

A categorical variable with k categories becomes k−1 dummies; the omitted one is the baseline. For religion with Hindu/Muslim/Christian/Other, if 'Hindu' is omitted, each coefficient is that group's gap relative to Hindu.

Never interpret a dummy without naming the baseline. 'β = −0.4 for urban' means nothing until you know it is relative to rural.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Dummy Example

Reading a 0/1 coefficient

Illustrative result: β_{improved toilet} = 0.18 (baseline = no improved toilet).

Read it as: 'Children in households with an improved toilet have, on average, a 0.18 higher height-for-age z-score than otherwise-similar children without one.' The phrase 'otherwise-similar' is the controls doing their work.

Illustrative figure.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Standardised

Unstandardised vs standardised coefficients

	Unstandardised (b)	Standardised (β)
Units	Original (rupees, years)	Standard deviations
Reads as	Effect of 1 real unit	Effect of a 1-SD change
Comparable across Xs?	No — different scales	Yes — common scale
Best for	Real-world meaning	Ranking relative importance

Standardised coefficients let you ask 'which predictor matters most?' — but lose the plain-language 'per rupee' meaning. Report unstandardised for interpretation, standardised for comparison.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Sign & Size

Direction first, then magnitude

Sign: + means Y rises with X; − means Y falls as X rises
Size: how much Y moves per unit of X — in context
Always ask 'is this big?' — a 0.02 z-score gain may be real but trivial; a 0.4 gain may be programme-changing

A coefficient is not 'important' just because it is non-zero. Judge magnitude against what would matter for your decision.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

04

Section Four

Model Fit & Inference

ImpactMojoMultivariate Analysis 101www.impactmojo.in

How Good Is the Fit?

R²: variance explained

R² (R-squared)

The share of the variation in the outcome that the model's predictors explain — ranging from 0 (explains nothing) to 1 (explains everything).

An R² of 0.34 means the predictors account for 34% of the variation in the outcome; the other 66% is residual — unmeasured causes and noise.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

See It

R² as a slice of total variation

Variation in the outcome: explained vs unexplained (illustrative)

Illustrative

In social data, R² values of 0.1–0.4 are common and not shameful — human behaviour is genuinely noisy. A high R² is not the goal; an honest, well-specified model is.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

The Catch

R² only ever goes up

Add any predictor — even random noise — and R² will rise or stay flat; it can never fall. So a bigger R² does not prove the new variable belongs in the model.

Chasing R² by piling in predictors is a recipe for overfitting. You need a measure that penalises needless complexity.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

The Fix

Adjusted R²

A version of R² that penalises each extra predictor. It rises only when a new variable improves the model by more than chance would predict — and can fall when you add a useless one.

Compare models on adjusted R², not raw R². If adjusted R² drops when you add a variable, that variable is earning its place by less than it costs.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Is the Model Worth Anything?

The F-test

The F-test asks a whole-model question: do the predictors together explain more than nothing? Its null hypothesis is that all slope coefficients are zero at once.

A small F-test p-value says 'this model beats a model with no predictors'. It does not tell you which predictor matters — that is the t-tests' job.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Is This Predictor Real?

t-tests on each coefficient

Each coefficient gets its own t-test: is this particular slope distinguishable from zero, given its uncertainty? The output is a t-statistic and a p-value per predictor.

Standard error

The estimated uncertainty in a coefficient. t = coefficient ÷ standard error; a large t (small p) means the slope is unlikely to be zero by chance.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

The Honest Range

Confidence intervals beat stars

A coefficient of 0.06 with a 95% confidence interval of 0.02 to 0.10 says the plausible effect lies in that band. The interval shows both direction and precision — far more than a lone p-value or a row of asterisks.

If a 95% interval comfortably excludes zero, the effect is 'significant' at the 5% level — but always read the width. A wide interval means you really do not know much.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

See It

Coefficients with confidence intervals

Illustrative coefficients on height-for-age, with 95% CIs

Illustrative — not reported NFHS estimates

Bars are the point estimates; the listed intervals (e.g. toilet: 0.07–0.29) are the 95% CIs. None crosses zero here, so each is 'significant' — but the toilet effect is the least precise.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Significance ≠ Importance

Statistically real, practically tiny

With a large sample — and NFHS has hundreds of thousands of records — even a microscopic effect can be 'statistically significant'. Significance is about certainty, not size.

Always report the effect size and its units alongside the p-value. 'Significant' answers 'is it real?', never 'does it matter?'

ImpactMojoMultivariate Analysis 101www.impactmojo.in

05

Section Five

Assumptions & Residual Diagnostics

ImpactMojoMultivariate Analysis 101www.impactmojo.in

The Fine Print

OLS rests on assumptions

Least-squares estimates are trustworthy only if a few assumptions roughly hold. Violations do not always bias the coefficients, but they can wreck the standard errors — and so the p-values and intervals.

Most assumptions are checked by looking at the residuals — the model's leftovers. Plotting them is non-negotiable.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

The Four

What linear regression assumes

Assumption	Plain meaning	Check with
Linearity	The true relationship is a straight line	Residual vs fitted plot
Independence	Observations don't lean on each other	Study design; clustering
Homoscedasticity	Error spread is constant across X	Residual vs fitted plot
Normal residuals	Errors are roughly bell-shaped	Q–Q plot, histogram

Note what is not required: the predictors themselves need not be normal, and Y need not be normal — only the residuals.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Linearity

Is a straight line the right shape?

If the real relationship curves — income's effect on nutrition flattening at high incomes — a straight-line model will systematically over- and under-predict in patterns.

Diagnosis: plot residuals against fitted values. A curve or a smile in that plot says the linearity assumption is failing — consider a transform or a squared term (Section 7).

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Independence

When observations cluster

Children in the same village share water, clinics and shocks, so their outcomes are correlated. Treating them as fully independent makes standard errors look smaller than they are — falsely confident results.

Survey data like NFHS is clustered by design. Use clustered or survey-adjusted standard errors, or you will overstate significance.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Homoscedasticity

Constant error spread

Homoscedasticity

The residuals have roughly the same spread across all fitted values. Its opposite, heteroscedasticity, is a fan or funnel shape in the residual plot.

Income data is a classic offender: the rich vary far more in spending than the poor, so residuals widen as predicted spending rises. Coefficients stay unbiased, but the standard errors are wrong.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

See It

Good residuals vs a heteroscedastic fan

Residual vs fitted: even band (good) vs widening fan (bad)

Illustrative

The green band stays flat; the red points fan out as the fitted value grows. The fan is the warning sign — reach for robust standard errors or a log transform.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Normal Residuals

The Q–Q plot

Inference (t-tests, CIs) assumes the residuals are roughly normal. Check with a Q–Q plot: if the residuals are normal, the points hug the diagonal line; fat tails or skew show as departures at the ends.

Good news: with a large sample, mild non-normality barely matters thanks to the central limit theorem. Worry most about it in small samples.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

The Habit

Plot first, conclude later

Residual vs fitted: checks linearity and equal spread at once
Q–Q plot: checks normality of residuals
Leverage / influence plot: finds the points bending the fit
Scale–location plot: a sharper look at spread

Anscombe's lesson applies here too: numbers alone hide trouble. Run the diagnostic plots on every model before you trust a single coefficient.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

06

Section Six

Multicollinearity

ImpactMojoMultivariate Analysis 101www.impactmojo.in

The Problem

When predictors overlap

Multicollinearity

When two or more predictors are highly correlated with each other, so they carry overlapping information about the outcome.

Household income, monthly expenditure and asset ownership all measure roughly the same thing — living standards. Put all three in a model and the regression struggles to separate their individual effects.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Why It Hurts

Unstable, imprecise coefficients

When predictors overlap, the model cannot tell which one deserves the credit. The result: coefficients with huge standard errors, wild swings if you add or drop a variable, and sometimes nonsensical signs.

Crucially, multicollinearity does not bias the coefficients — it makes them imprecise. The overall prediction can still be fine; the individual effects become untrustworthy.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

The Symptoms

How to spot it

A high overall R² and significant F-test, but no individual predictor is significant
Coefficients flip sign or size when you add/remove a variable
Implausibly large standard errors on variables you expected to matter
Predictors you know are related (income & expenditure) are both in the model

These symptoms point you toward a formal diagnostic — the variance inflation factor.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

The Diagnostic

The Variance Inflation Factor (VIF)

VIF

For each predictor, how much its coefficient's variance is inflated by correlation with the other predictors. VIF = 1 means no overlap; higher means more.

It is computed by regressing each predictor on all the others. The more predictable a variable is from the rest, the higher its VIF — and the shakier its coefficient.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Rules of Thumb

How high is too high?

Illustrative VIFs — expenditure & income overlap badly

Illustrative

Common rule of thumb: VIF above 5–10 signals a problem. Here income and expenditure both blow past it — they are measuring the same underlying wealth.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

The Remedies

What to do about it

Remedy	How	When
Drop one	Remove a redundant predictor	Two variables measure the same thing
Combine	Build one index (e.g. PCA)	Several proxies for one concept
Centre / rescale	Subtract the mean before squaring/interacting	Collinearity from interaction terms
Get more data	Larger / more varied sample	Overlap is mild, not structural
Accept it	Keep the model, widen the CIs	You only care about prediction

If two predictors are basically the same concept, do not agonise — keep one, or fold them into a single index. The wealth index in Section 9 is exactly this move.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

A Subtlety

Don't 'fix' collinearity you need

If two predictors are genuinely distinct concepts that happen to correlate — education and income — dropping one can reintroduce confounding. The cure may be worse than the disease.

Decide based on your question. If you need the partial effect of education net of income, keep both and accept wider intervals rather than deleting a real confounder.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Recap

Multicollinearity in one breath

It is about predictors overlapping with each other, not with Y
It inflates standard errors; it does not bias coefficients
Diagnose with VIF (watch for > 5–10) and unstable signs
Remedy by dropping, combining, or simply accepting wider uncertainty

ImpactMojoMultivariate Analysis 101www.impactmojo.in

07

Section Seven

Interactions & Non-Linearity

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Beyond Straight Lines

When one slope isn't enough

The plain model assumes each predictor's effect is the same for everyone and constant at every level. Reality is richer: effects can depend on other variables, or change as a variable grows.

Interaction

X's effect depends on another variable (e.g. schooling helps more where toilets exist).

Non-linearity

X's effect changes with its own level (e.g. income matters more at the bottom).

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Interactions

An effect that depends on context

Interaction term

A predictor formed by multiplying two variables (X1 × X2). Its coefficient captures how the effect of one variable changes across levels of the other.

Model: HAZ = … + β₁(schooling) + β₂(toilet) + β₃(schooling × toilet). β₃ tells you whether schooling's payoff differs for households with and without a toilet.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

See It

Two slopes from one interaction

Schooling's effect on HAZ, by toilet access (illustrative)

Illustrative

The steeper green line means schooling helps more where sanitation is in place — an interaction. Two different slopes, one model. Without the interaction term you would force them to be parallel.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Reading Interactions

The main effect changes meaning

Once you include X1×X2, the coefficient on X1 alone is no longer 'the effect of X1'. It is the effect of X1 when X2 = 0. The full effect of X1 is β₁ + β₃·X2.

This trips people up constantly. In a model with interactions, never read a main-effect coefficient in isolation — always say 'at what level of the other variable?'

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Non-Linearity

Quadratic and polynomial terms

Add a squared term — β₁X + β₂X² — to bend the line into a curve. This captures diminishing returns (income's effect flattening) or U-shapes (age and earnings).

Centre X before squaring to tame the collinearity between X and X², and resist going past a quadratic — high-order polynomials wiggle wildly and overfit.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Log Transforms

Taming skew, reading elasticities

Logging a right-skewed variable like income compresses its long tail and often straightens a curved relationship. It also changes how you read the coefficient — in percentage terms.

Elasticity

In a log–log model, the coefficient is an elasticity: the percentage change in Y for a 1% change in X. A coefficient of 0.4 means a 1% rise in X is linked to a 0.4% rise in Y.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Log Cheat-Sheet

Four ways logs change the reading

Model	Coefficient β reads as…
Y on X (level–level)	β-unit change in Y per 1-unit X
Y on log X (level–log)	β/100 change in Y per 1% rise in X
log Y on X (log–level)	~100·β % change in Y per 1-unit X
log Y on log X (log–log)	β % change in Y per 1% change in X (elasticity)

Logs are the practitioner's friend for money variables: they fix skew, ease heteroscedasticity, and give intuitive percentage interpretations.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

08

Section Eight

Logistic Regression

ImpactMojoMultivariate Analysis 101www.impactmojo.in

A Different Outcome

When Y is yes/no

Many development outcomes are binary: a child is fully immunised or not; a woman delivered in a facility or not; a household is below the poverty line or not. Linear regression is the wrong tool for these.

Fit a straight line to a 0/1 outcome and it will happily predict probabilities below 0 or above 1 — nonsense. We need a model that stays inside 0–1.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

The Solution

Model the probability with an S-curve

Logistic regression models the probability of 'yes' using an S-shaped (logistic) curve that flattens near 0 and 1, so predictions always stay in the valid range.

The trick: instead of modelling the probability directly, it models the log-odds of the outcome as a linear function of the predictors. That keeps the familiar 'linear in the Xs' structure.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Odds & Log-Odds

From probability to log-odds

Quantity	Definition	Range
Probability p	Chance of 'yes'	0 to 1
Odds	p ÷ (1 − p)	0 to ∞
Log-odds (logit)	ln(odds)	−∞ to +∞

A probability of 0.8 is odds of 4 (4-to-1 on) and log-odds of about 1.39. Logistic regression's coefficients live on this log-odds scale — which is why they are not directly readable.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

The Key Move

Coefficients are in log-odds — exponentiate them

A logistic coefficient β is the change in log-odds per one-unit increase in X. That is hard to feel. So we take exp(β) to get an odds ratio — a multiplicative effect on the odds.

This is the most-mangled idea in applied statistics: the raw coefficient is log-odds; exp(coefficient) is the odds ratio. Never read the raw logit coefficient as a probability.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Reading Odds Ratios

Above 1, below 1, equal to 1

OR > 1: the predictor raises the odds of 'yes' (OR 1.5 → odds 50% higher)
OR < 1: the predictor lowers the odds (OR 0.7 → odds 30% lower)
OR = 1: no association — this is the null value

Because odds ratios are multiplicative, the null is 1, not 0. A 95% CI for an OR that includes 1 is 'not significant'.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

See It

Odds ratios for facility delivery

Illustrative odds ratios for facility delivery (vs home)

Illustrative — patterned on NFHS-style predictors

Read across the line at OR = 1: urban residence and wealth raise the odds of a facility delivery; high birth order (OR 0.62) lowers them by about 38%. All illustrative.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Careful Reading

Odds ratios are not risk ratios

An odds ratio of 2 does not mean the probability doubles. Odds and probability diverge sharply when outcomes are common. For a rare outcome the two are close; for a common one they are not.

Say 'the odds are 85% higher', not '85% more likely'. If your audience needs probabilities, report predicted probabilities at chosen covariate values instead.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Same Discipline

Everything else carries over

Coefficients are still partial effects — net of the other predictors
Each odds ratio still comes with a confidence interval — report it
Multicollinearity, interactions and dummies all behave as before
Fit is judged differently (pseudo-R², classification, AUC), not by OLS R²

Logistic regression is the same way of thinking on a new scale — learn the log-odds/odds-ratio translation and you are most of the way there.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

09

Section Nine

Data Reduction — PCA & Factor Analysis

ImpactMojoMultivariate Analysis 101www.impactmojo.in

The Problem

Too many indicators, one concept

You may have twenty asset and housing variables — fridge, TV, motorcycle, pucca walls, electricity, toilet type — all proxying one idea: household living standards. Putting all twenty in a regression is messy and collinear.

Data reduction compresses many correlated indicators into a few summary scores that capture most of their shared information.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

PCA

Principal Component Analysis

A technique that re-expresses many correlated variables as a smaller set of uncorrelated 'components', ordered so the first captures the most variance, the second the next-most, and so on.

The first principal component is the single weighted combination of the variables that explains the largest share of their joint variation — usually the 'size' or 'level' they share.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

How Much to Keep

The scree plot

Variance explained by each component (illustrative scree)

Illustrative

The first component dwarfs the rest — the classic signal that one dimension (wealth) underlies the asset variables. Keep components up to the 'elbow' where the bars level off.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

The Canonical Use

The DHS / NFHS wealth index

The standard DHS/NFHS wealth index is built with PCA. Asset, housing and utility variables go in; the first principal component becomes each household's wealth score — the textbook example of PCA in development.

Households are then ranked and split into five wealth quintiles. This is the 'wealth quintile' you see throughout NFHS tables — a PCA score in disguise.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

See It

The wealth-index distribution

Distribution of a PCA wealth score, split into quintiles (illustrative)

Illustrative

By construction the quintiles each hold ~20% of households — the index is a continuous score cut into five equal bands. The underlying score itself is continuous and right-skewed in level.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

PCA vs Factor Analysis

Cousins, not twins

	PCA	Factor analysis
Goal	Summarise / compress variance	Find latent underlying factors
Direction	Variables → components	Factors → cause the variables
Model of error	None — pure re-expression	Separates shared vs unique variance
Typical use	Indices (wealth index)	Psychometrics, attitude scales

In practice they often give similar results for index-building. PCA is the default for wealth indices; factor analysis suits questionnaire scales where a true latent trait is assumed.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Watch-Outs

Reduction has costs

Components can be hard to name — what does PC2 mean?
PCA is scale-sensitive — standardise variables (or use a suitable method) first
An asset index reflects what is in the basket — urban-skewed assets bias it
Compressing always discards some information by design

Used well, data reduction tames multicollinearity and yields one clean predictor. Used blindly, it buries the very structure you wanted to study.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Recap

Data reduction in one breath

Many correlated indicators → a few uncorrelated summary scores
PCA's first component = the dominant shared dimension (often 'level' or 'wealth')
The NFHS/DHS wealth index is the first principal component of assets
Quintiles are that continuous score cut into five equal groups

ImpactMojoMultivariate Analysis 101www.impactmojo.in

10

Section Ten

Pitfalls

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Pitfall 1

Overfitting

Throw enough predictors at a small sample and the model fits the noise, not the signal. It looks brilliant in your data and fails on the next dataset — it memorised rather than learned.

Symptoms: many predictors relative to observations, dazzling in-sample R², coefficients that change wildly across samples. Cure: simpler models, adjusted R², out-of-sample testing.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Pitfall 2

Omitted-variable bias

Leave out a real confounder and its effect contaminates the variables you did include. This is the mirror image of controlling: omit income and the diet coefficient absorbs income's effect.

The bias's direction depends on how the omitted variable relates to both X and Y. Unlike multicollinearity, this one genuinely biases coefficients — it is the more dangerous problem.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

A False Trade-Off

Multicollinearity vs causality

These two pitfalls pull in opposite directions. Dropping a collinear predictor cures the imprecision — but if that predictor was a real confounder, dropping it creates omitted-variable bias.

The resolution is your question. For an unbiased causal estimate, keep the confounder and tolerate wide intervals. For pure prediction, collinearity may not matter at all.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Pitfall 3

Controlling for a mediator

Mediator

A variable on the causal path between X and Y — X causes the mediator, which in turn causes Y. It transmits the effect rather than confounding it.

Education raises income, and income improves child nutrition. Income is a mediator of education's effect — not a confounder to be controlled away.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Why It's a Trap

'Controlling for' can bias the answer

If you 'control for' income while estimating education's effect on nutrition, you block the very pathway through which education works — and understate its total effect. More controls is not always better.

Rule: control for confounders (common causes), never for mediators (intermediate effects) or things caused by the outcome. Draw the causal story before choosing controls.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Pitfall 4

Extrapolation

A model fitted on households earning ₹5,000–₹40,000 a month says nothing reliable about a household earning ₹5 lakh. Predicting outside the range of your data assumes the line keeps going — it rarely does.

Stay within the support of your data. The neat straight line is an artefact of the range you observed, not a law of nature beyond it.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Pitfall 5

p-hacking & the forking paths

Try enough specifications — add this control, drop that one, split by region, redefine the outcome — and something will cross p < 0.05 by chance. Reporting only the winning run manufactures false findings.

Defend against it: pre-specify your model, report what you tried, show robustness across specifications, and trust replication over a single surprising result.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Pitfalls Recap

Six traps, one habit

Pitfall	What goes wrong	Guard
Overfitting	Fits noise, fails out of sample	Simplicity, adjusted R², hold-out
Omitted-variable bias	Confounder contaminates coefficients	Include real confounders
Collinearity confusion	Drop a confounder to 'fix' VIF	Let the question decide
Controlling a mediator	Blocks the causal pathway	Don't control intermediates
Extrapolation	Predicts beyond the data	Stay within range
p-hacking	Chance result sold as finding	Pre-specify, replicate

ImpactMojoMultivariate Analysis 101www.impactmojo.in

11

Section Eleven

Reporting & Tools

ImpactMojoMultivariate Analysis 101www.impactmojo.in

A Workflow

How to build a model, in order

01

ASK: a precise question & outcome

→

02

DRAW: the causal story — confounders vs mediators

→

03

FIT: choose linear or logistic; add controls

→

04

CHECK: residuals, VIF, fit

→

05

REPORT: effects, CIs, caveats

Notice the thinking happens before the software runs. Drawing the causal story first is what tells you which variables to control — and which to leave alone.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

The Regression Table

What a results table must show

Predictor	Coefficient	95% CI	p
Income (per ₹1,000)	0.03	0.01 – 0.05	<0.01
Mother's schooling (yr)	0.06	0.02 – 0.10	<0.01
Improved toilet (vs none)	0.18	0.07 – 0.29	<0.01
Intercept	−1.42	—	—

Always report coefficients with confidence intervals, the sample size, R²/adjusted R², and the units. Illustrative figures shown.

ImpactMojoMultivariate Analysis 101www.impactmojo.in

Good Reporting

Habits that earn trust

State your question and model before the numbers
Give effect sizes with units, not just stars
Show confidence intervals; note the sample size and any weights
Report the diagnostics you ran and what you found
Be explicit about what is association vs causal claim

ImpactMojoMultivariate Analysis 101www.impactmojo.in

The Toolbox

R, Python and Stata

Tool	Strengths	Note
R	Stats-first, superb diagnostics & graphics	Free; lm(), glm(), broom
Python (statsmodels)	Cleaning + modelling in one place	Free; pandas, scikit-learn
Stata	Survey data, clustered SEs, ubiquitous in econ	Paid; svy: prefix
SPSS	Menu-driven, common in academia	Paid; gentle on-ramp

For NFHS/PLFS work, whatever you choose must handle survey weights and clustered standard errors — R's survey package, Python's statsmodels, or Stata's svy commands.

ImpactMojoMultivariate Analysis 101www.impactmojo.in