fullscreen
ImpactMojoBivariate Analysis 101www.impactmojo.in
ImpactMojo 101 Series · Free Forever
Bivariate
Analysis
101
How Two Variables Move Together — Cross-Tabs, Correlation, Group Comparisons & Simple Regression for Development Practitioners in South Asia
Research-BackedSouth Asia Focus~90 SlidesFree Access
ImpactMojoBivariate Analysis 101www.impactmojo.in
What We Cover
01
What Bivariate Analysis Is
Slides 3–10
02
Choosing a Method by Variable Type
Slides 11–17
03
Cross-Tabulation & Contingency Tables
Slides 18–25
04
Visualising Relationships
Slides 26–34
05
Correlation
Slides 35–44
06
Comparing Two Groups
Slides 45–53
07
Comparing Several Groups (ANOVA)
Slides 54–61
08
Association Between Categories (Chi-Square)
Slides 62–71
09
Simple Linear Regression
Slides 72–81
10
Pitfalls
Slides 82–91
11
Reporting, Practice & Tools
Slides 92–99
ImpactMojoBivariate Analysis 101www.impactmojo.in
01
Section One
What Bivariate Analysis Is
ImpactMojoBivariate Analysis 101www.impactmojo.in
From one variable to two
Univariate analysis describes one variable at a time — the average household size, the spread of incomes. Bivariate analysis is the very next step: it asks how two variables relate to each other.
Bivariate analysis
The study of the relationship between two variables — whether they move together, how strongly, and in what direction. 'Bi' = two; 'variate' = variable.
Most interesting development questions are bivariate at heart: does this go with that? Does schooling go with earnings? Does the programme go with better outcomes?
ImpactMojoBivariate Analysis 101www.impactmojo.in
Where it sits in the analysis ladder
01
UNIVARIATE: describe one variable (centre, spread, shape)
02
BIVARIATE: relate two variables (direction, strength)
03
MULTIVARIATE: many variables at once (control, adjust)
This course lives squarely in the middle rung. Master it before reaching for regression with twenty controls — the two-variable picture is where most reasoning errors are caught.
ImpactMojoBivariate Analysis 101www.impactmojo.in
Direction and strength
Direction
Do they move the same way (positive) or opposite ways (negative)? More literacy, fewer births — that is a negative direction.
Strength
How tightly do they track each other? A loose cloud is weak; a near-straight line is strong.
Almost every bivariate tool you will meet is just a precise way to answer these two questions — plus a third: could this be chance?
ImpactMojoBivariate Analysis 101www.impactmojo.in
Explanatory and response variables
Explanatory variable (X)
The variable you think does the explaining or predicting — also called the independent or predictor variable. Conventionally on the horizontal axis.
Response variable (Y)
The outcome you want to understand or predict — also called the dependent variable. Conventionally on the vertical axis.
Naming X and Y does not prove X causes Y. It only states which one you are treating as the outcome. The causal claim must be earned separately.
ImpactMojoBivariate Analysis 101www.impactmojo.in
The questions practitioners actually ask
  • Do villages with self-help groups have higher women's savings?
  • Is anaemia more common among Adivasi women than others?
  • Does distance to a health centre predict institutional delivery?
  • Did test scores differ between the treated and control schools?
  • Is caste associated with whether a household has a toilet?
Each is a relationship between two variables — and each maps to a specific bivariate method, which Section Two helps you choose.
ImpactMojoBivariate Analysis 101www.impactmojo.in
A relationship is a clue, not a verdict
Bivariate analysis can reveal a relationship, quantify its strength, and tell you whether it is likely real or just noise. What it cannot do, on its own, is prove that one variable causes the other.
The plural of anecdote is not data; and the presence of correlation is not the presence of cause.
— a working principle of careful analysis
ImpactMojoBivariate Analysis 101www.impactmojo.in
How this course is built
Tools
  • Cross-tabs for category × category
  • Correlation for number × number
  • t-tests and ANOVA for comparing groups
  • Chi-square and simple regression
Judgement
  • Choosing the right method
  • Reading effect size, not just p-values
  • Spotting confounders and fallacies
  • Reporting a relationship honestly
Examples come from India and the wider region — the data you actually meet at work.
ImpactMojoBivariate Analysis 101www.impactmojo.in
02
Section Two
Choosing a Method by Variable Type
ImpactMojoBivariate Analysis 101www.impactmojo.in
First, classify both variables
The single most useful habit in bivariate analysis: before choosing any method, ask what kind of variable each of your two is — categorical or numeric. The pair of answers points straight to the right tool.
Categorical
Labels or groups — sex, caste, religion, district, yes/no. Includes ordered categories (wealth quintile, Likert scale).
Numeric
Counts and measurements — age, income, test score, fertility rate, distance in km. You can meaningfully average them.
ImpactMojoBivariate Analysis 101www.impactmojo.in
Method by the two variable types
X typeY typeDescribe withTest with
CategoricalCategoricalCross-tab, % & barsChi-square test
Categorical (2 groups)NumericGroup means, box plotst-test
Categorical (3+ groups)NumericGroup means, box plotsOne-way ANOVA
NumericNumericScatter plotCorrelation / regression
Pin this table to your wall. Nine times out of ten, classifying your two variables tells you exactly which method to reach for.
ImpactMojoBivariate Analysis 101www.impactmojo.in
Two categories: is membership linked?
When both variables are categories — say caste group and has a household toilet (yes/no) — you ask whether knowing one tells you anything about the other.
Describe it with a cross-tabulation and percentages; test it with a chi-square test of independence. Covered in Sections Three and Eight.
ImpactMojoBivariate Analysis 101www.impactmojo.in
A group label and a number: compare means
When one variable is a group label and the other is numeric — treatment vs control and test score — you compare the numeric outcome across the groups.
Two groups
Difference in means — the t-test idea. Section Six.
Three or more groups
One-way ANOVA compares all the group means at once. Section Seven.
ImpactMojoBivariate Analysis 101www.impactmojo.in
Two numbers: do they track each other?
When both variables are numeric — female literacy and fertility rate across districts — plot one against the other on a scatter and measure how tightly they move together.
Describe the strength with correlation (Section Five); model the line with simple linear regression (Section Nine).
ImpactMojoBivariate Analysis 101www.impactmojo.in
The method serves the question, not the reverse
Do not pick a fancy test and then hunt for variables to feed it. Start from the real question, classify the two variables it involves, and let the decision table hand you the method.
Far better an approximate answer to the right question than an exact answer to the wrong one.
— John Tukey
ImpactMojoBivariate Analysis 101www.impactmojo.in
03
Section Three
Cross-Tabulation & Contingency Tables
ImpactMojoBivariate Analysis 101www.impactmojo.in
Counting two categories together
Cross-tabulation (contingency table)
A table that counts how many cases fall into each combination of two categorical variables — rows for one variable, columns for the other.
It is the most-used tool in applied development analysis: a single table that shows, at a glance, how two group memberships line up.
ImpactMojoBivariate Analysis 101www.impactmojo.in
Toilet ownership by location (raw counts)
Has toiletNo toiletTotal
Rural320280600
Urban34060400
Total6603401,000
Illustrative figures. The four inner cells are the joint counts; the right and bottom margins are the marginal totals for each variable on its own.
ImpactMojoBivariate Analysis 101www.impactmojo.in
Raw counts mislead when groups differ in size
Rural and urban have different totals (600 vs 400), so comparing raw counts is unfair. To compare fairly, convert to percentages — but in which direction?
This is the single most common cross-tab error: percentaging the wrong way and reading a relationship backwards. Get the direction right and the table tells the truth.
ImpactMojoBivariate Analysis 101www.impactmojo.in
Percent within each row
Has toiletNo toiletRow total
Rural53%47%100%
Urban85%15%100%
Each row sums to 100%. This answers: of rural households, what share have a toilet? 53% rural vs 85% urban — a clear location gap. (Illustrative.)
ImpactMojoBivariate Analysis 101www.impactmojo.in
Percent within each column
Has toiletNo toilet
Rural48%82%
Urban52%18%
Column total100%100%
Each column sums to 100%. This answers a different question: of households without a toilet, what share are rural? 82%. Same table, different story.
ImpactMojoBivariate Analysis 101www.impactmojo.in
Percentage in the direction of the cause
Convention: percentage within categories of the explanatory variable, then compare across them. If location (X) might shape toilet ownership (Y), percentage within rural and within urban — that is row percentaging here.
01
Put X in the rows
02
Percentage so each row = 100%
03
Compare the same Y-column across rows
ImpactMojoBivariate Analysis 101www.impactmojo.in
What a difference in percentages means
85% − 53%
= 32 percentage-point gap in toilet ownership, urban vs rural
Direction
Urban households far more likely to have a toilet — a positive urban–toilet link
A gap in the table suggests an association. Whether it is bigger than chance is what the chi-square test in Section Eight decides.
ImpactMojoBivariate Analysis 101www.impactmojo.in
04
Section Four
Visualising Relationships
ImpactMojoBivariate Analysis 101www.impactmojo.in
Always look before you compute
Before any coefficient or test, draw the relationship. A picture reveals direction, strength, curvature, clusters and outliers that a single number can hide entirely.
The greatest value of a picture is when it forces us to notice what we never expected to see.
— John Tukey
ImpactMojoBivariate Analysis 101www.impactmojo.in
Which chart for which pair
Variable pairBest plotShows
Numeric × numericScatter plotDirection, strength, shape
Categorical × numericBox plot by groupSpread & median per group
Categorical × numericGrouped / clustered barsMean per group
Categorical × categoricalStacked / grouped barsShares within groups
ImpactMojoBivariate Analysis 101www.impactmojo.in
The workhorse for two numbers
A scatter plot puts the explanatory variable on the X-axis, the response on the Y-axis, and one dot per case. The cloud's tilt shows direction; its tightness shows strength.
Read it like this: upward cloud = positive; downward = negative; round blob = no linear relationship; tight line = strong; fat cloud = weak.
ImpactMojoBivariate Analysis 101www.impactmojo.in
Three scatters: positive, negative, none
Same axes, three relationships
Illustrative
Green rises, red falls, indigo wanders. Your eye reads direction and strength instantly — before any number.
ImpactMojoBivariate Analysis 101www.impactmojo.in
Female literacy vs child mortality, by state
Female literacy (%) vs under-5 mortality (per 1,000), major states
Illustrative, patterned on Census 2011 & NFHS-5
A strong negative pattern: states with higher female literacy tend to have lower child mortality. Illustrative, but it mirrors the real Census–NFHS picture closely.
ImpactMojoBivariate Analysis 101www.impactmojo.in
Comparing a number across groups
For a categorical X and numeric Y, a box plot per group beats a single average. Each box shows the median (the line), the middle 50% (the box), the range (the whiskers) and outliers (the dots).
Box plots let you see not just whether the centres differ between groups, but whether the spreads do — two distributions with the same mean can look completely different.
ImpactMojoBivariate Analysis 101www.impactmojo.in
A mean compared across groups
Mean monthly earnings by education level (illustrative, ₹000)
Illustrative, patterned on PLFS-style data
A clean grouped bar shows the mean of a numeric outcome rising across ordered categories — the visual form of a category × number relationship.
ImpactMojoBivariate Analysis 101www.impactmojo.in
The same rules still apply
  • Start bar axes at zero — a truncated axis fakes a gap
  • Label both axes with units
  • Note the denominator and sample size
  • Put the source and date on the chart
  • Do not let one outlier dominate a scatter unremarked
A bivariate chart is an argument about a relationship. Truncated axes and missing denominators turn an honest pattern into a misleading one.
ImpactMojoBivariate Analysis 101www.impactmojo.in
05
Section Five
Correlation
ImpactMojoBivariate Analysis 101www.impactmojo.in
Putting a value on 'move together'
Correlation coefficient (r)
A single number summarising how strongly and in which direction two numeric variables move together in a straight-line sense. It ranges from −1 to +1.
Where a scatter shows the relationship, r measures it — one number for direction and strength combined.
ImpactMojoBivariate Analysis 101www.impactmojo.in
r runs from −1 to +1
−1
Perfect negative — points fall exactly on a downward line
0
No linear relationship — a shapeless cloud
+1
Perfect positive — points fall exactly on an upward line
The sign gives direction; the distance from zero gives strength. r = −0.8 and r = +0.8 are equally strong, opposite in direction.
ImpactMojoBivariate Analysis 101www.impactmojo.in
The default: Pearson correlation
The everyday correlation is Pearson's r. Crucially, it measures only the linear (straight-line) component of a relationship between two numeric variables.
If the true relationship is curved — rising then falling — Pearson's r can be near zero even when the two variables are tightly related. Pearson sees lines, not curves.
ImpactMojoBivariate Analysis 101www.impactmojo.in
Interpreting the size of r
|r|Rough strengthWhat the scatter looks like
0.0 – 0.1NegligibleShapeless cloud
0.1 – 0.3WeakFaint tilt
0.3 – 0.5ModerateClear tilt, wide scatter
0.5 – 0.7StrongTight tilt
0.7 – 1.0Very strongNear a straight line
These bands are conventions, not laws. In messy social data, an r of 0.3 can be a genuinely important signal.
ImpactMojoBivariate Analysis 101www.impactmojo.in
Literacy and fertility, quantified
Female literacy (%) vs total fertility rate, major states
Illustrative, patterned on Census 2011 & NFHS-5
This cloud has a clear downward tilt — an r near −0.9 (illustrative). Strong and negative. But strength is not the same as cause: literacy may proxy for income, health and much else.
ImpactMojoBivariate Analysis 101www.impactmojo.in
When to use rank correlation instead
Spearman's rank correlation (ρ)
Pearson's correlation computed on the ranks of the data rather than the raw values. It measures whether two variables move together in the same order, not strictly in a straight line.
  • Use it for ordinal data — wealth quintiles, Likert scales
  • Use it for skewed data — income, landholding
  • Use it when the relationship is monotonic but curved
  • It is robust to outliers that would swing Pearson's r
ImpactMojoBivariate Analysis 101www.impactmojo.in
Two correlations, side by side
Pearson rSpearman ρ
MeasuresLinear associationMonotonic (rank) association
Best dataNumeric, roughly symmetricOrdinal or skewed numeric
Outlier-sensitive?Yes — one point can swing itNo — uses ranks
Range−1 to +1−1 to +1
When Pearson and Spearman disagree sharply, suspect non-linearity or an outlier — and go back to the scatter.
ImpactMojoBivariate Analysis 101www.impactmojo.in
A big r and a small p are different things
Strength (effect size)
How big is the relationship? That is what r itself tells you — the practical magnitude.
Significance (p-value)
How sure are we it is not zero by chance? That depends heavily on sample size.
With thousands of cases, a tiny, uninteresting r = 0.05 can be 'statistically significant'. Always report the size, not just the star.
ImpactMojoBivariate Analysis 101www.impactmojo.in
Square it for 'variance explained'
Square the correlation and you get — the share of the variation in one variable that is statistically accounted for by the other. r = 0.7 means R² = 0.49: about half the variation is shared.
We return to R² properly in the regression section — it is the natural bridge from correlation to a fitted line.
ImpactMojoBivariate Analysis 101www.impactmojo.in
06
Section Six
Comparing Two Groups
ImpactMojoBivariate Analysis 101www.impactmojo.in
One group label, one number
A categorical X with two values (treatment/control, girls/boys, SHG/non-SHG) and a numeric Y (score, income, weight). The question: do the two group means differ — and is the difference real?
01
Split the data into two groups by X
02
Compute each group's mean of Y
03
Ask: is the gap bigger than chance?
ImpactMojoBivariate Analysis 101www.impactmojo.in
The quantity of interest
Mean test score: control vs treatment schools (illustrative)
Illustrative
The raw difference is 7 points (61 − 54). But two questions remain: could a 7-point gap arise by chance, and is 7 points big enough to matter?
ImpactMojoBivariate Analysis 101www.impactmojo.in
Signal divided by noise
t-test
A test of whether the difference between two group means is larger than we would expect from sampling variation alone — essentially, the size of the gap relative to how noisy the data are.
Intuitively, t ≈ difference in means ÷ uncertainty in that difference. A big, clean gap gives a big t; a small gap drowned in scatter gives a small t.
ImpactMojoBivariate Analysis 101www.impactmojo.in
Three things make a gap convincing
  • A larger difference between the two means
  • Less spread (variation) within each group
  • A larger sample in each group
The same 7-point gap is unconvincing with 20 noisy pupils per arm, but compelling with 2,000 tightly clustered ones. Sample size and spread decide the verdict.
ImpactMojoBivariate Analysis 101www.impactmojo.in
What a p-value does and doesn't say
p-value
The probability of seeing a difference at least this large if there were truly no difference between the groups. Small p (say < 0.05) means the gap is unlikely to be pure chance.
A p-value is not the probability the effect is real, nor a measure of its size. It only addresses 'could this be chance?' — nothing about importance.
ImpactMojoBivariate Analysis 101www.impactmojo.in
How big, in plain terms
Significance asks whether there is a difference; effect size asks how big. Report it in real units (7 marks, ₹400/month) or as a standardised measure (Cohen's d) so others can judge whether it matters.
p-value
Is the gap likely real, not chance?
Effect size
Is the gap big enough to act on?
ImpactMojoBivariate Analysis 101www.impactmojo.in
Significant but trivial — and vice versa
Significant but trivial
Huge sample → a 0.3-mark difference is 'significant' but means nothing for any child.
Real but 'not significant'
Tiny pilot → a promising 9-point gap fails the test purely for lack of sample. Absence of evidence is not evidence of absence.
Always read the p-value and the effect size together. Neither alone tells the whole story.
ImpactMojoBivariate Analysis 101www.impactmojo.in
When the simple t-test is appropriate
  • The two groups are independent (different people)
  • Y is roughly symmetric within each group (or n is large)
  • For paired data — same people before/after — use a paired t-test instead
  • For severely skewed data, consider a rank-based alternative
ImpactMojoBivariate Analysis 101www.impactmojo.in
07
Section Seven
Comparing Several Groups (ANOVA)
ImpactMojoBivariate Analysis 101www.impactmojo.in
When there are three or more groups
A categorical X with three or more values — four caste groups, five wealth quintiles, six districts — and a numeric Y. We want to know whether the group means differ overall.
Tempting shortcut: run a t-test on every pair. Don't. With many pairs, the chance of a false 'significant' result piles up fast — the multiple-comparisons problem.
ImpactMojoBivariate Analysis 101www.impactmojo.in
One-way ANOVA, conceptually
One-way ANOVA
Analysis of Variance — a single test of whether the means of three or more groups differ by more than chance, by comparing variation BETWEEN groups to variation WITHIN groups.
Despite the name, ANOVA is about means. It uses variances as the yardstick for deciding whether those means are really apart.
ImpactMojoBivariate Analysis 101www.impactmojo.in
The heart of ANOVA
Between-group variation
How far apart the group means are from each other — the signal.
Within-group variation
How much individuals scatter inside each group — the noise.
The F-ratio = between ÷ within. If groups sit far apart relative to their internal scatter, F is large — evidence the means genuinely differ.
ImpactMojoBivariate Analysis 101www.impactmojo.in
Mean nutrition score by wealth quintile
Mean child height-for-age z-score by wealth quintile (illustrative)
Illustrative, patterned on NFHS-5
Five group means rising steadily across quintiles. ANOVA asks: taken together, is this spread of means bigger than the scatter within each quintile? (Illustrative; the real gradient is well documented.)
ImpactMojoBivariate Analysis 101www.impactmojo.in
It says 'somewhere', not 'where'
A significant ANOVA tells you the groups are not all equal — but not which ones differ. To locate the differences, follow up with post-hoc pairwise comparisons that correct for multiple testing.
Think of ANOVA as the smoke alarm: it tells you there is a fire somewhere, then post-hoc tests find the room.
ImpactMojoBivariate Analysis 101www.impactmojo.in
When one-way ANOVA fits
  • Groups are independent
  • Y is roughly symmetric within each group (or n is large)
  • Group spreads are not wildly different
  • For badly skewed data, a rank-based alternative (Kruskal–Wallis) may suit
ANOVA is the natural extension of the two-group t-test to many groups — same logic, scaled up.
ImpactMojoBivariate Analysis 101www.impactmojo.in
Comparing groups, at a glance
GroupsY typeMethod
2 (independent)NumericTwo-sample t-test
2 (same units, paired)NumericPaired t-test
3 or moreNumericOne-way ANOVA
3+, skewed dataNumericKruskal–Wallis
ImpactMojoBivariate Analysis 101www.impactmojo.in
08
Section Eight
Association Between Categories (Chi-Square)
ImpactMojoBivariate Analysis 101www.impactmojo.in
Two categories: are they independent?
Back to two categorical variables and a cross-tab. The chi-square test of independence asks: is the pattern in this table bigger than we would expect if the two variables were completely unrelated?
Independence
Two variables are independent if knowing one tells you nothing about the other — the distribution of Y is the same within every category of X.
ImpactMojoBivariate Analysis 101www.impactmojo.in
Observed vs expected counts
Chi-square compares what you observed in each cell with what you would expect if the two variables were independent. Big gaps between observed and expected = evidence of association.
01
Compute expected counts (if independent)
02
Compare with observed counts, cell by cell
03
Sum the scaled squared gaps → χ²
04
Large χ² → reject independence
ImpactMojoBivariate Analysis 101www.impactmojo.in
What 'no relationship' would predict
Each cell's expected count is (row total × column total) ÷ grand total. It is the count you'd see if Y were distributed identically across every category of X.
Example: with 660 of 1,000 households owning a toilet (66%), independence predicts 66% of the 600 rural households — 396 — would own one. We observed only 320. That gap is the signal.
ImpactMojoBivariate Analysis 101www.impactmojo.in
Side by side (illustrative)
CellObservedExpectedGap
Rural, toilet320396−76
Rural, no toilet280204+76
Urban, toilet340264+76
Urban, no toilet60136−76
Rural areas have far fewer toilets than independence predicts, urban far more. Consistent gaps like these push χ² up. (Illustrative figures.)
ImpactMojoBivariate Analysis 101www.impactmojo.in
What chi-square tells you
Small p
Reject independence — the two variables ARE associated
Large p
No evidence of association — consistent with independence
Chi-square tells you whether there is an association, not how strong it is. For strength, report a measure like Cramér's V alongside the test.
ImpactMojoBivariate Analysis 101www.impactmojo.in
When chi-square is valid
  • Each case counted once (independent observations)
  • Cells hold counts, not percentages or means
  • Expected count ≥ 5 in (almost) every cell
  • Categories are mutually exclusive and exhaustive
The expected-count rule is the one most often broken: with thin cells the chi-square approximation fails. Use Fisher's exact test for small tables instead.
ImpactMojoBivariate Analysis 101www.impactmojo.in
Three ways chi-square goes wrong
  • Feeding it percentages instead of raw counts
  • Ignoring tiny expected counts in sparse cells
  • Reading a significant result as proof of causation
  • Forgetting it says nothing about the strength of association
ImpactMojoBivariate Analysis 101www.impactmojo.in
Chi-square in a development workflow
It is the natural partner to the cross-tab: you describe the two categories with row percentages, then test whether the visible gap is bigger than chance with chi-square.
Worked flow
Caste × toilet ownership
Build the cross-tab → percentage within caste → eyeball the gap → check expected counts ≥ 5 → run chi-square → report the test and a strength measure → resist the leap to 'caste causes…'.
ImpactMojoBivariate Analysis 101www.impactmojo.in
The category × category toolkit
StepToolAnswers
DescribeCross-tab + % What does the pattern look like?
TestChi-squareIs it bigger than chance?
Quantify strengthCramér's VHow strong is it?
Small table?Fisher's exactSame question, thin cells
ImpactMojoBivariate Analysis 101www.impactmojo.in
09
Section Nine
Simple Linear Regression
ImpactMojoBivariate Analysis 101www.impactmojo.in
Drawing the best line through a scatter
Correlation gives a number for a numeric–numeric relationship. Simple linear regression goes one step further: it fits the single straight line of best fit through the scatter, summarising the relationship as an equation.
Simple linear regression
Modelling a numeric response Y as a straight-line function of one numeric predictor X: Y = intercept + slope × X, plus error.
ImpactMojoBivariate Analysis 101www.impactmojo.in
A regression line over a scatter
Years of schooling vs monthly earnings, with line of best fit
Illustrative
The red line is the one that sits closest to all the points at once — it minimises the total squared vertical distance from points to line (least squares). Illustrative data.
ImpactMojoBivariate Analysis 101www.impactmojo.in
Y = a + bX
a (intercept)
Predicted Y when X = 0 — where the line crosses the vertical axis
b (slope)
Change in Y for each 1-unit rise in X — the rate of the relationship
The slope carries the action. Its sign matches the correlation's sign; its size is in real units of Y per unit of X.
ImpactMojoBivariate Analysis 101www.impactmojo.in
When the intercept is meaningless
The intercept is the predicted Y at X = 0 — but X = 0 may be nonsensical or far outside your data. The earnings at zero years of schooling may be a mathematical artefact, not a real group.
Interpret the intercept only when X = 0 is both meaningful and within the range of your data. Otherwise treat it as a mathematical anchor for the line.
ImpactMojoBivariate Analysis 101www.impactmojo.in
What b means in plain words
In our example the line rises about ₹1,250 per extra year of schooling (illustrative). So b ≈ 1.25 in ₹000: ‘each additional year of schooling is associated with about ₹1,250 more monthly earnings.’
Note the phrase ‘associated with’, not ‘causes’. Regression fits a line; it does not, by itself, license a causal claim.
ImpactMojoBivariate Analysis 101www.impactmojo.in
How much variation the line explains
R² (coefficient of determination)
The proportion of the variation in Y that is explained by X through the fitted line. It runs from 0 (line explains nothing) to 1 (line explains everything). In simple regression, R² = r².
R² = 0.8
Line explains 80% of the variation in Y; 20% is left unexplained
R² = 0.1
Line explains just 10% — X is a weak guide to Y
ImpactMojoBivariate Analysis 101www.impactmojo.in
High R² is not always good
  • A high R² does not prove the model is correct or causal
  • In social data, an R² of 0.2 can still be a useful finding
  • A low R² with a clear slope can still matter for policy
  • R² says nothing about whether a straight line fits
Report R² for context, but never let it crowd out the slope, its uncertainty, and a look at the scatter.
ImpactMojoBivariate Analysis 101www.impactmojo.in
Two different goals for one line
Explanation
Understand the relationship: how does Y change with X, and how strong is it? The slope is the prize.
Prediction
Estimate Y for a new X you haven't measured. Accuracy, not interpretation, is the prize.
Be clear which you are doing — the same line, read for different purposes, demands different cautions.
ImpactMojoBivariate Analysis 101www.impactmojo.in
Don't predict beyond your data
Extrapolation — using the line far outside the range of X you actually observed — is one of regression's classic traps. A line fitted on 0–16 years of schooling says nothing reliable about 30 years.
The relationship may bend, plateau or break entirely outside your data. Predict within the range you measured, and flag any step beyond it.
ImpactMojoBivariate Analysis 101www.impactmojo.in
10
Section Ten
Pitfalls
ImpactMojoBivariate Analysis 101www.impactmojo.in
Correlation is not causation
A relationship between two variables can arise for several reasons, only one of which is ‘X causes Y’. This is the single most important caution in all of bivariate analysis.
  • Reverse causation: Y might cause X
  • Confounding: a third factor C drives both
  • Selection: how the sample was chosen creates the link
  • Chance: with enough variables, some correlate by luck
ImpactMojoBivariate Analysis 101www.impactmojo.in
The lurking third variable
A confounder is a variable linked to both X and Y that creates — or distorts — the relationship between them. It is the commonest reason a bivariate link is misleading.
01
Household wealth (confounder C)
02
drives girls' schooling (X)
03
AND drives child survival (Y)
04
so X and Y correlate — partly via C
ImpactMojoBivariate Analysis 101www.impactmojo.in
Bivariate links almost always hide confounders
Recall our literacy–fertility scatter. Female literacy really does track lower fertility — but richer states also have better health systems, later marriage and more contraception. Wealth and development confound the simple two-variable picture.
This is exactly why bivariate analysis is a starting point. To isolate one variable's effect you need to control for confounders — the job of multivariate regression.
ImpactMojoBivariate Analysis 101www.impactmojo.in
Group patterns ≠ individual truths
Ecological fallacy
Wrongly inferring something about individuals from a relationship observed only at the group (district, state) level.
States with higher average literacy have lower average fertility — that does not mean the literate women within a state are the ones with fewer children. A relationship across districts need not hold across people.
ImpactMojoBivariate Analysis 101www.impactmojo.in
One point can swing everything
A single extreme point can dominate a correlation or drag a regression line toward itself — high leverage. The whole ‘relationship’ may rest on one unusual case.
Always check: does the pattern survive if I remove the most extreme point? If r or the slope collapses, the finding was that one point, not a real trend.
ImpactMojoBivariate Analysis 101www.impactmojo.in
Straight-line tools miss curves
Pearson's r and linear regression assume a straight line. A U-shaped or saturating relationship — common in real data — can show a near-zero correlation while being strongly, but non-linearly, related.
The fix is the cheapest in statistics: plot the scatter. A curve is obvious to the eye and invisible to r.
ImpactMojoBivariate Analysis 101www.impactmojo.in
Four datasets, identical statistics
Anscombe-style: same r, four very different shapes
After F. J. Anscombe (1973)
Anscombe's quartet (1973): four datasets with near-identical means, variances and correlation, yet utterly different shapes. The lesson, still true: always plot your data.
ImpactMojoBivariate Analysis 101www.impactmojo.in
A trend can reverse when you split
Simpson's paradox: a relationship in the pooled data can flip direction within every subgroup. A scheme can look worse overall yet be better in every district — if districts differ in size and baseline.
Always disaggregate before concluding. The aggregate two-variable relationship can point the opposite way to the truth.
ImpactMojoBivariate Analysis 101www.impactmojo.in
Before you believe a relationship
  • Did I plot it — is the shape really linear?
  • Could a confounder explain it?
  • Does it survive removing the outlier?
  • Is this a group pattern I'm reading onto individuals?
  • Does it reverse when I disaggregate?
  • Is there a plausible mechanism, or just a coincidence?
ImpactMojoBivariate Analysis 101www.impactmojo.in
11
Section Eleven
Reporting, Practice & Tools
ImpactMojoBivariate Analysis 101www.impactmojo.in
How to report a relationship
  • State the direction and strength in plain words
  • Give the effect size in real units, not just a p-value
  • Report the uncertainty — confidence interval or range
  • Name the sample size and how it was drawn
  • Flag the obvious confounders you could not rule out
ImpactMojoBivariate Analysis 101www.impactmojo.in
Mind your verbs
Safe phrasing
  • ‘is associated with’
  • ‘tends to go with’
  • ‘is correlated with’
Reserve for evidence
  • ‘causes’
  • ‘leads to’
  • ‘increases / reduces’
Causal verbs need a causal design — an experiment or a careful quasi-experiment. A bivariate correlation does not earn them.
ImpactMojoBivariate Analysis 101www.impactmojo.in
What to avoid in write-ups
  • Reporting a p-value with no effect size
  • Claiming causation from a correlation
  • Hiding the scatter behind a single coefficient
  • Testing many pairs and reporting only the one that worked
  • Using a state-level relationship to make individual claims
ImpactMojoBivariate Analysis 101www.impactmojo.in
A reliable bivariate workflow
01
CLASSIFY: what type is each variable?
02
PLOT: scatter, box plot or bars
03
DESCRIBE: direction, strength, shape
04
TEST: the matching method
05
INTERPRET: effect size + uncertainty
06
CAUTION: confounders, fallacies
Six steps, every time. The discipline is what turns a number into a defensible finding.
ImpactMojoBivariate Analysis 101www.impactmojo.in
Software for bivariate analysis
ToolGood forNote
Excel / Google SheetsCross-tabs, scatter, CORREL, t-testStart here; pivot tables for cross-tabs
REvery test, publication graphicsFree; cor.test, t.test, aov, chisq.test, lm
Python (pandas, scipy, statsmodels)Cleaning + tests + regressionFree, general-purpose
Jamovi / JASPPoint-and-click statsFree, friendly for learners
Stata / SPSSSurvey data, weightsCommon in research shops
Tools matter less than habits: classify, plot, then test. A clear scatter in a spreadsheet beats a misread coefficient in any package.
ImpactMojoBivariate Analysis 101www.impactmojo.in
A short, honest reading list
  • Statistics — Freedman, Pisani & Purves (the gold standard primer)
  • The Art of Statistics — David Spiegelhalter
  • How to Lie with Statistics — Darrell Huff
  • Naked Statistics — Charles Wheelan
  • Mostly Harmless Econometrics — Angrist & Pischke (going causal)
Pair this deck with ImpactMojo's Data Literacy, Exploratory Data Analysis and Research Methods 101 courses.
ImpactMojoBivariate Analysis 101www.impactmojo.in
If you remember five things
  • Classify both variables first — it picks the method
  • Plot before you compute — shape, outliers, curves
  • Report effect size, not just the p-value
  • Correlation is not causation — hunt the confounder
  • Group patterns are not individual truths — mind the fallacies
ImpactMojoBivariate Analysis 101www.impactmojo.in
Bivariate Analysis 101 · Complete
Two variables.
One honest story.
CC BY-NC-ND 4.0·Free Forever·ImpactMojo 101 Series