Bivariate Analysis 101

ImpactMojoBivariate Analysis 101www.impactmojo.in

ImpactMojo 101 Series · Free Forever

Bivariate
Analysis
101

How Two Variables Move Together — Cross-Tabs, Correlation, Group Comparisons & Simple Regression for Development Practitioners in South Asia

Research-BackedSouth Asia Focus~90 SlidesFree Access

ImpactMojoBivariate Analysis 101www.impactmojo.in

Agenda

What We Cover

01

What Bivariate Analysis Is

Slides 3–10

02

Choosing a Method by Variable Type

Slides 11–17

03

Cross-Tabulation & Contingency Tables

Slides 18–25

04

Visualising Relationships

Slides 26–34

05

Correlation

Slides 35–44

06

Comparing Two Groups

Slides 45–53

07

Comparing Several Groups (ANOVA)

Slides 54–61

08

Association Between Categories (Chi-Square)

Slides 62–71

09

Simple Linear Regression

Slides 72–81

10

Pitfalls

Slides 82–91

11

Reporting, Practice & Tools

Slides 92–99

ImpactMojoBivariate Analysis 101www.impactmojo.in

01

Section One

What Bivariate Analysis Is

ImpactMojoBivariate Analysis 101www.impactmojo.in

Definition

From one variable to two

Univariate analysis describes one variable at a time — the average household size, the spread of incomes. Bivariate analysis is the very next step: it asks how two variables relate to each other.

Bivariate analysis

The study of the relationship between two variables — whether they move together, how strongly, and in what direction. 'Bi' = two; 'variate' = variable.

Most interesting development questions are bivariate at heart: does this go with that? Does schooling go with earnings? Does the programme go with better outcomes?

ImpactMojoBivariate Analysis 101www.impactmojo.in

The Step After Describing

Where it sits in the analysis ladder

01

UNIVARIATE: describe one variable (centre, spread, shape)

→

02

BIVARIATE: relate two variables (direction, strength)

→

03

MULTIVARIATE: many variables at once (control, adjust)

This course lives squarely in the middle rung. Master it before reaching for regression with twenty controls — the two-variable picture is where most reasoning errors are caught.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Two Questions

Direction and strength

Direction

Do they move the same way (positive) or opposite ways (negative)? More literacy, fewer births — that is a negative direction.

Strength

How tightly do they track each other? A loose cloud is weak; a near-straight line is strong.

Almost every bivariate tool you will meet is just a precise way to answer these two questions — plus a third: could this be chance?

ImpactMojoBivariate Analysis 101www.impactmojo.in

Vocabulary

Explanatory and response variables

Explanatory variable (X)

The variable you think does the explaining or predicting — also called the independent or predictor variable. Conventionally on the horizontal axis.

Response variable (Y)

The outcome you want to understand or predict — also called the dependent variable. Conventionally on the vertical axis.

Naming X and Y does not prove X causes Y. It only states which one you are treating as the outcome. The causal claim must be earned separately.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Why It Matters

The questions practitioners actually ask

Do villages with self-help groups have higher women's savings?
Is anaemia more common among Adivasi women than others?
Does distance to a health centre predict institutional delivery?
Did test scores differ between the treated and control schools?
Is caste associated with whether a household has a toilet?

Each is a relationship between two variables — and each maps to a specific bivariate method, which Section Two helps you choose.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Warning Up Front

A relationship is a clue, not a verdict

Bivariate analysis can reveal a relationship, quantify its strength, and tell you whether it is likely real or just noise. What it cannot do, on its own, is prove that one variable causes the other.

The plural of anecdote is not data; and the presence of correlation is not the presence of cause.

— a working principle of careful analysis

ImpactMojoBivariate Analysis 101www.impactmojo.in

Roadmap

How this course is built

Tools

Cross-tabs for category × category
Correlation for number × number
t-tests and ANOVA for comparing groups
Chi-square and simple regression

Judgement

Choosing the right method
Reading effect size, not just p-values
Spotting confounders and fallacies
Reporting a relationship honestly

Examples come from India and the wider region — the data you actually meet at work.

ImpactMojoBivariate Analysis 101www.impactmojo.in

02

Section Two

Choosing a Method by Variable Type

ImpactMojoBivariate Analysis 101www.impactmojo.in

The Master Move

First, classify both variables

The single most useful habit in bivariate analysis: before choosing any method, ask what kind of variable each of your two is — categorical or numeric. The pair of answers points straight to the right tool.

Categorical

Labels or groups — sex, caste, religion, district, yes/no. Includes ordered categories (wealth quintile, Likert scale).

Numeric

Counts and measurements — age, income, test score, fertility rate, distance in km. You can meaningfully average them.

ImpactMojoBivariate Analysis 101www.impactmojo.in

The Decision Table

Method by the two variable types

X type	Y type	Describe with	Test with
Categorical	Categorical	Cross-tab, % & bars	Chi-square test
Categorical (2 groups)	Numeric	Group means, box plots	t-test
Categorical (3+ groups)	Numeric	Group means, box plots	One-way ANOVA
Numeric	Numeric	Scatter plot	Correlation / regression

Pin this table to your wall. Nine times out of ten, classifying your two variables tells you exactly which method to reach for.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Cat × Cat

Two categories: is membership linked?

When both variables are categories — say caste group and has a household toilet (yes/no) — you ask whether knowing one tells you anything about the other.

Describe it with a cross-tabulation and percentages; test it with a chi-square test of independence. Covered in Sections Three and Eight.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Cat × Num

A group label and a number: compare means

When one variable is a group label and the other is numeric — treatment vs control and test score — you compare the numeric outcome across the groups.

Two groups

Difference in means — the t-test idea. Section Six.

Three or more groups

One-way ANOVA compares all the group means at once. Section Seven.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Num × Num

Two numbers: do they track each other?

When both variables are numeric — female literacy and fertility rate across districts — plot one against the other on a scatter and measure how tightly they move together.

Describe the strength with correlation (Section Five); model the line with simple linear regression (Section Nine).

ImpactMojoBivariate Analysis 101www.impactmojo.in

A Caution

The method serves the question, not the reverse

Do not pick a fancy test and then hunt for variables to feed it. Start from the real question, classify the two variables it involves, and let the decision table hand you the method.

Far better an approximate answer to the right question than an exact answer to the wrong one.

— John Tukey

ImpactMojoBivariate Analysis 101www.impactmojo.in

03

Section Three

Cross-Tabulation & Contingency Tables

ImpactMojoBivariate Analysis 101www.impactmojo.in

The Workhorse

Counting two categories together

Cross-tabulation (contingency table)

A table that counts how many cases fall into each combination of two categorical variables — rows for one variable, columns for the other.

It is the most-used tool in applied development analysis: a single table that shows, at a glance, how two group memberships line up.

ImpactMojoBivariate Analysis 101www.impactmojo.in

A 2×2 Table

Toilet ownership by location (raw counts)

	Has toilet	No toilet	Total
Rural	320	280	600
Urban	340	60	400
Total	660	340	1,000

Illustrative figures. The four inner cells are the joint counts; the right and bottom margins are the marginal totals for each variable on its own.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Why Percentages

Raw counts mislead when groups differ in size

Rural and urban have different totals (600 vs 400), so comparing raw counts is unfair. To compare fairly, convert to percentages — but in which direction?

This is the single most common cross-tab error: percentaging the wrong way and reading a relationship backwards. Get the direction right and the table tells the truth.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Row Percentages

Percent within each row

	Has toilet	No toilet	Row total
Rural	53%	47%	100%
Urban	85%	15%	100%

Each row sums to 100%. This answers: of rural households, what share have a toilet? 53% rural vs 85% urban — a clear location gap. (Illustrative.)

ImpactMojoBivariate Analysis 101www.impactmojo.in

Column Percentages

Percent within each column

	Has toilet	No toilet
Rural	48%	82%
Urban	52%	18%
Column total	100%	100%

Each column sums to 100%. This answers a different question: of households without a toilet, what share are rural? 82%. Same table, different story.

ImpactMojoBivariate Analysis 101www.impactmojo.in

The Rule

Percentage in the direction of the cause

Convention: percentage within categories of the explanatory variable, then compare across them. If location (X) might shape toilet ownership (Y), percentage within rural and within urban — that is row percentaging here.

01

Put X in the rows

→

02

Percentage so each row = 100%

→

03

Compare the same Y-column across rows

ImpactMojoBivariate Analysis 101www.impactmojo.in

Reading a 2×2

What a difference in percentages means

85% − 53%

= 32 percentage-point gap in toilet ownership, urban vs rural

Direction

Urban households far more likely to have a toilet — a positive urban–toilet link

A gap in the table suggests an association. Whether it is bigger than chance is what the chi-square test in Section Eight decides.

ImpactMojoBivariate Analysis 101www.impactmojo.in

04

Section Four

Visualising Relationships

ImpactMojoBivariate Analysis 101www.impactmojo.in

Plot First

Always look before you compute

Before any coefficient or test, draw the relationship. A picture reveals direction, strength, curvature, clusters and outliers that a single number can hide entirely.

The greatest value of a picture is when it forces us to notice what we never expected to see.

— John Tukey

ImpactMojoBivariate Analysis 101www.impactmojo.in

Match Plot to Types

Which chart for which pair

Variable pair	Best plot	Shows
Numeric × numeric	Scatter plot	Direction, strength, shape
Categorical × numeric	Box plot by group	Spread & median per group
Categorical × numeric	Grouped / clustered bars	Mean per group
Categorical × categorical	Stacked / grouped bars	Shares within groups

ImpactMojoBivariate Analysis 101www.impactmojo.in

Scatter Plots

The workhorse for two numbers

A scatter plot puts the explanatory variable on the X-axis, the response on the Y-axis, and one dot per case. The cloud's tilt shows direction; its tightness shows strength.

Read it like this: upward cloud = positive; downward = negative; round blob = no linear relationship; tight line = strong; fat cloud = weak.

ImpactMojoBivariate Analysis 101www.impactmojo.in

See It

Three scatters: positive, negative, none

Same axes, three relationships

Illustrative

Green rises, red falls, indigo wanders. Your eye reads direction and strength instantly — before any number.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Real-Feeling Data

Female literacy vs child mortality, by state

Female literacy (%) vs under-5 mortality (per 1,000), major states

Illustrative, patterned on Census 2011 & NFHS-5

A strong negative pattern: states with higher female literacy tend to have lower child mortality. Illustrative, but it mirrors the real Census–NFHS picture closely.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Box Plots

Comparing a number across groups

For a categorical X and numeric Y, a box plot per group beats a single average. Each box shows the median (the line), the middle 50% (the box), the range (the whiskers) and outliers (the dots).

Box plots let you see not just whether the centres differ between groups, but whether the spreads do — two distributions with the same mean can look completely different.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Grouped Bars

A mean compared across groups

Mean monthly earnings by education level (illustrative, ₹000)

Illustrative, patterned on PLFS-style data

A clean grouped bar shows the mean of a numeric outcome rising across ordered categories — the visual form of a category × number relationship.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Honest Charts

The same rules still apply

Start bar axes at zero — a truncated axis fakes a gap
Label both axes with units
Note the denominator and sample size
Put the source and date on the chart
Do not let one outlier dominate a scatter unremarked

A bivariate chart is an argument about a relationship. Truncated axes and missing denominators turn an honest pattern into a misleading one.

ImpactMojoBivariate Analysis 101www.impactmojo.in

05

Section Five

Correlation

ImpactMojoBivariate Analysis 101www.impactmojo.in

The Number

Putting a value on 'move together'

Correlation coefficient (r)

A single number summarising how strongly and in which direction two numeric variables move together in a straight-line sense. It ranges from −1 to +1.

Where a scatter shows the relationship, r measures it — one number for direction and strength combined.

ImpactMojoBivariate Analysis 101www.impactmojo.in

The Scale

r runs from −1 to +1

−1

Perfect negative — points fall exactly on a downward line

0

No linear relationship — a shapeless cloud

+1

Perfect positive — points fall exactly on an upward line

The sign gives direction; the distance from zero gives strength. r = −0.8 and r = +0.8 are equally strong, opposite in direction.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Pearson's r

The default: Pearson correlation

The everyday correlation is Pearson's r. Crucially, it measures only the linear (straight-line) component of a relationship between two numeric variables.

If the true relationship is curved — rising then falling — Pearson's r can be near zero even when the two variables are tightly related. Pearson sees lines, not curves.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Rough Guide

Interpreting the size of r

\|r\|	Rough strength	What the scatter looks like
0.0 – 0.1	Negligible	Shapeless cloud
0.1 – 0.3	Weak	Faint tilt
0.3 – 0.5	Moderate	Clear tilt, wide scatter
0.5 – 0.7	Strong	Tight tilt
0.7 – 1.0	Very strong	Near a straight line

These bands are conventions, not laws. In messy social data, an r of 0.3 can be a genuinely important signal.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Worked Example

Literacy and fertility, quantified

Female literacy (%) vs total fertility rate, major states

Illustrative, patterned on Census 2011 & NFHS-5

This cloud has a clear downward tilt — an r near −0.9 (illustrative). Strong and negative. But strength is not the same as cause: literacy may proxy for income, health and much else.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Spearman

When to use rank correlation instead

Spearman's rank correlation (ρ)

Pearson's correlation computed on the ranks of the data rather than the raw values. It measures whether two variables move together in the same order, not strictly in a straight line.

Use it for ordinal data — wealth quintiles, Likert scales
Use it for skewed data — income, landholding
Use it when the relationship is monotonic but curved
It is robust to outliers that would swing Pearson's r

ImpactMojoBivariate Analysis 101www.impactmojo.in

Pearson vs Spearman

Two correlations, side by side

	Pearson r	Spearman ρ
Measures	Linear association	Monotonic (rank) association
Best data	Numeric, roughly symmetric	Ordinal or skewed numeric
Outlier-sensitive?	Yes — one point can swing it	No — uses ranks
Range	−1 to +1	−1 to +1

When Pearson and Spearman disagree sharply, suspect non-linearity or an outlier — and go back to the scatter.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Strength ≠ Significance

A big r and a small p are different things

Strength (effect size)

How big is the relationship? That is what r itself tells you — the practical magnitude.

Significance (p-value)

How sure are we it is not zero by chance? That depends heavily on sample size.

With thousands of cases, a tiny, uninteresting r = 0.05 can be 'statistically significant'. Always report the size, not just the star.

ImpactMojoBivariate Analysis 101www.impactmojo.in

r-squared Preview

Square it for 'variance explained'

Square the correlation and you get R² — the share of the variation in one variable that is statistically accounted for by the other. r = 0.7 means R² = 0.49: about half the variation is shared.

We return to R² properly in the regression section — it is the natural bridge from correlation to a fitted line.

ImpactMojoBivariate Analysis 101www.impactmojo.in

06

Section Six

Comparing Two Groups

ImpactMojoBivariate Analysis 101www.impactmojo.in

The Setup

One group label, one number

A categorical X with two values (treatment/control, girls/boys, SHG/non-SHG) and a numeric Y (score, income, weight). The question: do the two group means differ — and is the difference real?

01

Split the data into two groups by X

→

02

Compute each group's mean of Y

→

03

Ask: is the gap bigger than chance?

ImpactMojoBivariate Analysis 101www.impactmojo.in

Difference in Means

The quantity of interest

Mean test score: control vs treatment schools (illustrative)

Illustrative

The raw difference is 7 points (61 − 54). But two questions remain: could a 7-point gap arise by chance, and is 7 points big enough to matter?

ImpactMojoBivariate Analysis 101www.impactmojo.in

The t-test Idea

Signal divided by noise

t-test

A test of whether the difference between two group means is larger than we would expect from sampling variation alone — essentially, the size of the gap relative to how noisy the data are.

Intuitively, t ≈ difference in means ÷ uncertainty in that difference. A big, clean gap gives a big t; a small gap drowned in scatter gives a small t.

ImpactMojoBivariate Analysis 101www.impactmojo.in

What Drives It

Three things make a gap convincing

A larger difference between the two means
Less spread (variation) within each group
A larger sample in each group

The same 7-point gap is unconvincing with 20 noisy pupils per arm, but compelling with 2,000 tightly clustered ones. Sample size and spread decide the verdict.

ImpactMojoBivariate Analysis 101www.impactmojo.in

The p-value

What a p-value does and doesn't say

p-value

The probability of seeing a difference at least this large if there were truly no difference between the groups. Small p (say < 0.05) means the gap is unlikely to be pure chance.

A p-value is not the probability the effect is real, nor a measure of its size. It only addresses 'could this be chance?' — nothing about importance.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Effect Size

How big, in plain terms

Significance asks whether there is a difference; effect size asks how big. Report it in real units (7 marks, ₹400/month) or as a standardised measure (Cohen's d) so others can judge whether it matters.

p-value

Is the gap likely real, not chance?

Effect size

Is the gap big enough to act on?

ImpactMojoBivariate Analysis 101www.impactmojo.in

The Trap

Significant but trivial — and vice versa

Significant but trivial

Huge sample → a 0.3-mark difference is 'significant' but means nothing for any child.

Real but 'not significant'

Tiny pilot → a promising 9-point gap fails the test purely for lack of sample. Absence of evidence is not evidence of absence.

Always read the p-value and the effect size together. Neither alone tells the whole story.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Assumptions

When the simple t-test is appropriate

The two groups are independent (different people)
Y is roughly symmetric within each group (or n is large)
For paired data — same people before/after — use a paired t-test instead
For severely skewed data, consider a rank-based alternative

ImpactMojoBivariate Analysis 101www.impactmojo.in

07

Section Seven

Comparing Several Groups (ANOVA)

ImpactMojoBivariate Analysis 101www.impactmojo.in

The Setup

When there are three or more groups

A categorical X with three or more values — four caste groups, five wealth quintiles, six districts — and a numeric Y. We want to know whether the group means differ overall.

Tempting shortcut: run a t-test on every pair. Don't. With many pairs, the chance of a false 'significant' result piles up fast — the multiple-comparisons problem.

ImpactMojoBivariate Analysis 101www.impactmojo.in

The Idea

One-way ANOVA, conceptually

One-way ANOVA

Analysis of Variance — a single test of whether the means of three or more groups differ by more than chance, by comparing variation BETWEEN groups to variation WITHIN groups.

Despite the name, ANOVA is about means. It uses variances as the yardstick for deciding whether those means are really apart.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Between vs Within

The heart of ANOVA

Between-group variation

How far apart the group means are from each other — the signal.

Within-group variation

How much individuals scatter inside each group — the noise.

The F-ratio = between ÷ within. If groups sit far apart relative to their internal scatter, F is large — evidence the means genuinely differ.

ImpactMojoBivariate Analysis 101www.impactmojo.in

See It

Mean nutrition score by wealth quintile

Mean child height-for-age z-score by wealth quintile (illustrative)

Illustrative, patterned on NFHS-5

Five group means rising steadily across quintiles. ANOVA asks: taken together, is this spread of means bigger than the scatter within each quintile? (Illustrative; the real gradient is well documented.)

ImpactMojoBivariate Analysis 101www.impactmojo.in

After ANOVA

It says 'somewhere', not 'where'

A significant ANOVA tells you the groups are not all equal — but not which ones differ. To locate the differences, follow up with post-hoc pairwise comparisons that correct for multiple testing.

Think of ANOVA as the smoke alarm: it tells you there is a fire somewhere, then post-hoc tests find the room.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Assumptions

When one-way ANOVA fits

Groups are independent
Y is roughly symmetric within each group (or n is large)
Group spreads are not wildly different
For badly skewed data, a rank-based alternative (Kruskal–Wallis) may suit

ANOVA is the natural extension of the two-group t-test to many groups — same logic, scaled up.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Recap

Comparing groups, at a glance

Groups	Y type	Method
2 (independent)	Numeric	Two-sample t-test
2 (same units, paired)	Numeric	Paired t-test
3 or more	Numeric	One-way ANOVA
3+, skewed data	Numeric	Kruskal–Wallis

ImpactMojoBivariate Analysis 101www.impactmojo.in

08

Section Eight

Association Between Categories (Chi-Square)

ImpactMojoBivariate Analysis 101www.impactmojo.in

The Setup

Two categories: are they independent?

Back to two categorical variables and a cross-tab. The chi-square test of independence asks: is the pattern in this table bigger than we would expect if the two variables were completely unrelated?

Independence

Two variables are independent if knowing one tells you nothing about the other — the distribution of Y is the same within every category of X.

ImpactMojoBivariate Analysis 101www.impactmojo.in

The Core Idea

Observed vs expected counts

Chi-square compares what you observed in each cell with what you would expect if the two variables were independent. Big gaps between observed and expected = evidence of association.

01

Compute expected counts (if independent)

→

02

Compare with observed counts, cell by cell

→

03

Sum the scaled squared gaps → χ²

→

04

Large χ² → reject independence

ImpactMojoBivariate Analysis 101www.impactmojo.in

Expected Counts

What 'no relationship' would predict

Each cell's expected count is (row total × column total) ÷ grand total. It is the count you'd see if Y were distributed identically across every category of X.

Example: with 660 of 1,000 households owning a toilet (66%), independence predicts 66% of the 600 rural households — 396 — would own one. We observed only 320. That gap is the signal.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Observed vs Expected

Side by side (illustrative)

Cell	Observed	Expected	Gap
Rural, toilet	320	396	−76
Rural, no toilet	280	204	+76
Urban, toilet	340	264	+76
Urban, no toilet	60	136	−76

Rural areas have far fewer toilets than independence predicts, urban far more. Consistent gaps like these push χ² up. (Illustrative figures.)

ImpactMojoBivariate Analysis 101www.impactmojo.in

Reading the Result

What chi-square tells you

Small p

Reject independence — the two variables ARE associated

Large p

No evidence of association — consistent with independence

Chi-square tells you whether there is an association, not how strong it is. For strength, report a measure like Cramér's V alongside the test.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Validity

When chi-square is valid

Each case counted once (independent observations)
Cells hold counts, not percentages or means
Expected count ≥ 5 in (almost) every cell
Categories are mutually exclusive and exhaustive

The expected-count rule is the one most often broken: with thin cells the chi-square approximation fails. Use Fisher's exact test for small tables instead.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Common Mistakes

Three ways chi-square goes wrong

Feeding it percentages instead of raw counts
Ignoring tiny expected counts in sparse cells
Reading a significant result as proof of causation
Forgetting it says nothing about the strength of association

ImpactMojoBivariate Analysis 101www.impactmojo.in

In Practice

Chi-square in a development workflow

It is the natural partner to the cross-tab: you describe the two categories with row percentages, then test whether the visible gap is bigger than chance with chi-square.

Worked flow

Caste × toilet ownership

Build the cross-tab → percentage within caste → eyeball the gap → check expected counts ≥ 5 → run chi-square → report the test and a strength measure → resist the leap to 'caste causes…'.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Recap

The category × category toolkit

Step	Tool	Answers
Describe	Cross-tab + %	What does the pattern look like?
Test	Chi-square	Is it bigger than chance?
Quantify strength	Cramér's V	How strong is it?
Small table?	Fisher's exact	Same question, thin cells

ImpactMojoBivariate Analysis 101www.impactmojo.in

09

Section Nine

Simple Linear Regression

ImpactMojoBivariate Analysis 101www.impactmojo.in

From Cloud to Line

Drawing the best line through a scatter

Correlation gives a number for a numeric–numeric relationship. Simple linear regression goes one step further: it fits the single straight line of best fit through the scatter, summarising the relationship as an equation.

Simple linear regression

Modelling a numeric response Y as a straight-line function of one numeric predictor X: Y = intercept + slope × X, plus error.

ImpactMojoBivariate Analysis 101www.impactmojo.in

See It

A regression line over a scatter

Years of schooling vs monthly earnings, with line of best fit

Illustrative

The red line is the one that sits closest to all the points at once — it minimises the total squared vertical distance from points to line (least squares). Illustrative data.

ImpactMojoBivariate Analysis 101www.impactmojo.in

The Equation

Y = a + bX

a (intercept)

Predicted Y when X = 0 — where the line crosses the vertical axis

b (slope)

Change in Y for each 1-unit rise in X — the rate of the relationship

The slope carries the action. Its sign matches the correlation's sign; its size is in real units of Y per unit of X.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Intercept Care

When the intercept is meaningless

The intercept is the predicted Y at X = 0 — but X = 0 may be nonsensical or far outside your data. The earnings at zero years of schooling may be a mathematical artefact, not a real group.

Interpret the intercept only when X = 0 is both meaningful and within the range of your data. Otherwise treat it as a mathematical anchor for the line.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Reading the Slope

What b means in plain words

In our example the line rises about ₹1,250 per extra year of schooling (illustrative). So b ≈ 1.25 in ₹000: ‘each additional year of schooling is associated with about ₹1,250 more monthly earnings.’

Note the phrase ‘associated with’, not ‘causes’. Regression fits a line; it does not, by itself, license a causal claim.

ImpactMojoBivariate Analysis 101www.impactmojo.in

R-squared

How much variation the line explains

R² (coefficient of determination)

The proportion of the variation in Y that is explained by X through the fitted line. It runs from 0 (line explains nothing) to 1 (line explains everything). In simple regression, R² = r².

R² = 0.8

Line explains 80% of the variation in Y; 20% is left unexplained

R² = 0.1

Line explains just 10% — X is a weak guide to Y

ImpactMojoBivariate Analysis 101www.impactmojo.in

R-squared Caution

High R² is not always good

A high R² does not prove the model is correct or causal
In social data, an R² of 0.2 can still be a useful finding
A low R² with a clear slope can still matter for policy
R² says nothing about whether a straight line fits

Report R² for context, but never let it crowd out the slope, its uncertainty, and a look at the scatter.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Prediction vs Explanation

Two different goals for one line

Explanation

Understand the relationship: how does Y change with X, and how strong is it? The slope is the prize.

Prediction

Estimate Y for a new X you haven't measured. Accuracy, not interpretation, is the prize.

Be clear which you are doing — the same line, read for different purposes, demands different cautions.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Extrapolation

Don't predict beyond your data

Extrapolation — using the line far outside the range of X you actually observed — is one of regression's classic traps. A line fitted on 0–16 years of schooling says nothing reliable about 30 years.

The relationship may bend, plateau or break entirely outside your data. Predict within the range you measured, and flag any step beyond it.

ImpactMojoBivariate Analysis 101www.impactmojo.in

10

Section Ten

Pitfalls

ImpactMojoBivariate Analysis 101www.impactmojo.in

The Big One

Correlation is not causation

A relationship between two variables can arise for several reasons, only one of which is ‘X causes Y’. This is the single most important caution in all of bivariate analysis.

Reverse causation: Y might cause X
Confounding: a third factor C drives both
Selection: how the sample was chosen creates the link
Chance: with enough variables, some correlate by luck

ImpactMojoBivariate Analysis 101www.impactmojo.in

Confounding

The lurking third variable

A confounder is a variable linked to both X and Y that creates — or distorts — the relationship between them. It is the commonest reason a bivariate link is misleading.

01

Household wealth (confounder C)

→

02

drives girls' schooling (X)

→

03

AND drives child survival (Y)

→

04

so X and Y correlate — partly via C

ImpactMojoBivariate Analysis 101www.impactmojo.in

Why It Matters Here

Bivariate links almost always hide confounders

Recall our literacy–fertility scatter. Female literacy really does track lower fertility — but richer states also have better health systems, later marriage and more contraception. Wealth and development confound the simple two-variable picture.

This is exactly why bivariate analysis is a starting point. To isolate one variable's effect you need to control for confounders — the job of multivariate regression.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Ecological Fallacy

Group patterns ≠ individual truths

Ecological fallacy

Wrongly inferring something about individuals from a relationship observed only at the group (district, state) level.

States with higher average literacy have lower average fertility — that does not mean the literate women within a state are the ones with fewer children. A relationship across districts need not hold across people.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Outliers & Leverage

One point can swing everything

A single extreme point can dominate a correlation or drag a regression line toward itself — high leverage. The whole ‘relationship’ may rest on one unusual case.

Always check: does the pattern survive if I remove the most extreme point? If r or the slope collapses, the finding was that one point, not a real trend.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Non-linearity

Straight-line tools miss curves

Pearson's r and linear regression assume a straight line. A U-shaped or saturating relationship — common in real data — can show a near-zero correlation while being strongly, but non-linearly, related.

The fix is the cheapest in statistics: plot the scatter. A curve is obvious to the eye and invisible to r.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Anscombe

Four datasets, identical statistics

Anscombe-style: same r, four very different shapes

After F. J. Anscombe (1973)

Anscombe's quartet (1973): four datasets with near-identical means, variances and correlation, yet utterly different shapes. The lesson, still true: always plot your data.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Simpson's Paradox

A trend can reverse when you split

Simpson's paradox: a relationship in the pooled data can flip direction within every subgroup. A scheme can look worse overall yet be better in every district — if districts differ in size and baseline.

Always disaggregate before concluding. The aggregate two-variable relationship can point the opposite way to the truth.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Pitfall Checklist

Before you believe a relationship

Did I plot it — is the shape really linear?
Could a confounder explain it?
Does it survive removing the outlier?
Is this a group pattern I'm reading onto individuals?
Does it reverse when I disaggregate?
Is there a plausible mechanism, or just a coincidence?

ImpactMojoBivariate Analysis 101www.impactmojo.in

11

Section Eleven

Reporting, Practice & Tools

ImpactMojoBivariate Analysis 101www.impactmojo.in

Report Honestly

How to report a relationship

State the direction and strength in plain words
Give the effect size in real units, not just a p-value
Report the uncertainty — confidence interval or range
Name the sample size and how it was drawn
Flag the obvious confounders you could not rule out

ImpactMojoBivariate Analysis 101www.impactmojo.in

Causal Language

Mind your verbs

Safe phrasing

‘is associated with’
‘tends to go with’
‘is correlated with’

Reserve for evidence

‘causes’
‘leads to’
‘increases / reduces’

Causal verbs need a causal design — an experiment or a careful quasi-experiment. A bivariate correlation does not earn them.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Common Sins

What to avoid in write-ups

Reporting a p-value with no effect size
Claiming causation from a correlation
Hiding the scatter behind a single coefficient
Testing many pairs and reporting only the one that worked
Using a state-level relationship to make individual claims

ImpactMojoBivariate Analysis 101www.impactmojo.in

A Worked Habit

A reliable bivariate workflow

01

CLASSIFY: what type is each variable?

→

02

PLOT: scatter, box plot or bars

→

03

DESCRIBE: direction, strength, shape

→

04

TEST: the matching method

→

05

INTERPRET: effect size + uncertainty

→

06

CAUTION: confounders, fallacies

Six steps, every time. The discipline is what turns a number into a defensible finding.

ImpactMojoBivariate Analysis 101www.impactmojo.in

Tools

Software for bivariate analysis

Tool	Good for	Note
Excel / Google Sheets	Cross-tabs, scatter, CORREL, t-test	Start here; pivot tables for cross-tabs
R	Every test, publication graphics	Free; cor.test, t.test, aov, chisq.test, lm
Python (pandas, scipy, statsmodels)	Cleaning + tests + regression	Free, general-purpose
Jamovi / JASP	Point-and-click stats	Free, friendly for learners
Stata / SPSS	Survey data, weights	Common in research shops

Tools matter less than habits: classify, plot, then test. A clear scatter in a spreadsheet beats a misread coefficient in any package.

ImpactMojoBivariate Analysis 101www.impactmojo.in