Item Response Theory 101

ImpactMojoItem Response Theory 101www.impactmojo.in

ImpactMojo 101 Series · Free Forever

Item
Response
Theory 101

Measuring Latent Traits Well — a Foundational Course on IRT for Assessment & M&E Practitioners in South Asia

Research-BackedSouth Asia Focus100 SlidesFree Access

ImpactMojoItem Response Theory 101www.impactmojo.in

Agenda

What We Cover

01

The Measurement Problem

Slides 3–10

02

Classical Test Theory & Its Limits

Slides 11–19

03

The Big Idea of IRT

Slides 20–28

04

The Item Characteristic Curve

Slides 29–37

05

Item Difficulty — the b Parameter

Slides 38–45

06

Item Discrimination — the a Parameter

Slides 46–54

07

Guessing & the Model Family

Slides 55–63

08

Information & Precision

Slides 64–72

09

Validating Scales & DIF

Slides 73–81

10

Applications in Development

Slides 82–90

11

Assumptions, Limits & Tools

Slides 91–99

ImpactMojoItem Response Theory 101www.impactmojo.in

01

Section One

The Measurement Problem

ImpactMojoItem Response Theory 101www.impactmojo.in

The Core Problem

You cannot see what you most want to measure

Development work is full of things we care about but cannot observe directly: a child's reading ability, a woman's empowerment, a household's food insecurity. We only ever see responses — answers to items, ticks on a scale.

Latent trait

An unobservable characteristic of a person — ability, attitude, deprivation — that we infer from their answers to a set of items. By convention it is written as theta (θ).

ImpactMojoItem Response Theory 101www.impactmojo.in

From Trait to Answer

We observe answers, not the trait

01

LATENT TRAIT: reading ability (θ) — unseen

→

02

ITEMS: a child reads words, sentences, a paragraph

→

03

RESPONSES: correct / incorrect on each item

→

04

INFERENCE: a measurement model estimates θ from the pattern

A measurement model is the bridge: it links the unseen trait to the seen responses, with explicit assumptions you can check.

ImpactMojoItem Response Theory 101www.impactmojo.in

Three Examples

The same problem, three development settings

Ability

ASER-style early-grade reading & arithmetic — a child either can or cannot do each task

Empowerment

A woman's decision-making, mobility & asset control — agree / disagree items

Food insecurity

FIES — eight yes/no experiences from worry to going a whole day without eating

In each case the trait is on a hidden continuum and the items are graded markers along it.

ImpactMojoItem Response Theory 101www.impactmojo.in

Why A Model

Why not just count right answers?

The obvious approach — add up correct answers, or count 'yes' responses — treats every item as equal and every score gap as the same size. But a hard item is not worth the same as an easy one, and the jump from 4 to 5 correct is not the same as 8 to 9.

A measurement model lets items differ — in difficulty, in how well they sort people — and places people and items on one common scale.

ImpactMojoItem Response Theory 101www.impactmojo.in

Two Traditions

Two families of measurement theory

Classical Test Theory (CTT)

Works on the total score. Simple, familiar, everywhere — but its statistics depend on the particular sample and the particular test.

Item Response Theory (IRT)

Models each item and each person separately, on a shared scale. More demanding, but the properties travel across samples and forms.

This course starts from CTT — what it is, and exactly where it strains — then builds IRT as the answer.

ImpactMojoItem Response Theory 101www.impactmojo.in

What Good Measurement Buys You

Why measurement quality is not a luxury

Comparability: compare a child this year with last year, or one state's empowerment score with another's, on the same ruler
Fairness: detect items that behave differently for girls, or for one language group
Precision where it matters: know exactly where the test measures well and where it is guessing
Honest scores: a number that means the same thing for everyone

ImpactMojoItem Response Theory 101www.impactmojo.in

Roadmap

How this course is built

Foundations

The measurement problem & CTT's limits
The big idea of IRT and the ICC
Difficulty (b), discrimination (a), guessing (c)

Using IRT well

Information, precision and test targeting
Validating scales: fit, dimensionality, DIF
Applications, assumptions and tools

Examples are drawn from learning assessments, empowerment, wealth and food-security scales used across the region.

ImpactMojoItem Response Theory 101www.impactmojo.in

02

Section Two

Classical Test Theory & Its Limits

ImpactMojoItem Response Theory 101www.impactmojo.in

The CTT Score

The familiar sum or percentage score

Classical Test Theory is the world of the total score: number correct, percentage right, the count of 'yes' responses on a scale. It is what nearly every report and dashboard already uses.

01

Observed score = True score + Error

→

02

X = T + E

→

03

Goal: estimate T, shrink E

ImpactMojoItem Response Theory 101www.impactmojo.in

Reliability

Reliability: how much is signal, not noise?

Reliability

The share of the variation in scores that reflects true differences between people rather than measurement error. It runs from 0 (all noise) to 1 (no error).

A reliable test gives nearly the same score if a person sits it twice. Low reliability means scores wobble for reasons that have nothing to do with the trait.

ImpactMojoItem Response Theory 101www.impactmojo.in

Cronbach's Alpha

Cronbach's alpha — the workhorse statistic

Cronbach's alpha estimates reliability from a single sitting by asking how consistently the items hang together. Rules of thumb put 'acceptable' around 0.70 and up — but the number is widely misread.

Alpha rises simply by adding more items, and high alpha does not prove the scale measures one thing. It is a useful summary, not a certificate of quality.

ImpactMojoItem Response Theory 101www.impactmojo.in

Limit 1

CTT statistics are sample-dependent

An item's CTT 'difficulty' is just the proportion who got it right (the p-value). Give the same item to a high-ability school and a struggling one and that proportion changes — so the item looks 'easier' or 'harder' depending on who sat it.

The item did not change. The sample did. Yet the headline statistic moved — that is the problem.

ImpactMojoItem Response Theory 101www.impactmojo.in

Sample-Dependence Illustrated

Same item, two samples, two p-values

Group	% correct on Item Q	CTT verdict
High-ability school	88%	An easy item
Mixed government school	61%	A moderate item
Struggling school	34%	A hard item

Illustrative figures. One physical item, three contradictory CTT difficulties — because the p-value confounds the item with the people who answered it.

ImpactMojoItem Response Theory 101www.impactmojo.in

Limit 2

CTT statistics are test-dependent

A person's CTT ability is their score on this particular test. Put them on an easier form and the score climbs; a harder form and it falls. The person's standing is tangled up with the difficulty of the form they happened to take.

Two children with the same true ability can get different scores purely because they sat different forms. Comparing across forms or years becomes guesswork.

ImpactMojoItem Response Theory 101www.impactmojo.in

Limit 3

One error figure for the whole scale

CTT reports a single standard error of measurement for everyone. But in reality a test measures a middling student far more precisely than it measures a top scorer (who finds every item easy) or a very weak one (who finds every item hard).

Precision genuinely varies across the ability range. CTT pretends it is constant — IRT will let it vary, which is closer to the truth.

ImpactMojoItem Response Theory 101www.impactmojo.in

Where CTT Strains

Three jobs CTT cannot do cleanly

You want to…	CTT problem	IRT answer (coming)
Compare items fairly across samples	p-values shift with the sample	Item params are sample-free
Compare people across forms/years	Scores depend on the form	θ is on a common scale
Know precision at each level	One SEM for all	Information varies along θ
Build an adaptive or short form	Hard — scores not comparable	Items & people share a metric

None of this means CTT is wrong — just that it asks too much of one total score. IRT splits the job apart.

ImpactMojoItem Response Theory 101www.impactmojo.in

03

Section Three

The Big Idea of IRT

ImpactMojoItem Response Theory 101www.impactmojo.in

One Scale

Put people and items on the same ruler

The central move of IRT is deceptively simple: place each person's trait (θ) and each item's difficulty on one shared continuum. A person is somewhere on the line; so is every item.

If a person sits above an item on the scale, they are likely to get it right or endorse it. Below it, unlikely. The distance between them sets the probability.

ImpactMojoItem Response Theory 101www.impactmojo.in

The Logit Scale

A trait scale centred on zero

By convention θ is scaled to have mean 0 and standard deviation 1 in the reference group, usually running from about −3 to +3. Negative means lower ability / less of the trait; positive means more.

θ ≈ −2

low trait — struggles with most items

θ ≈ 0

average for the reference group

θ ≈ +2

high trait — succeeds on most items

ImpactMojoItem Response Theory 101www.impactmojo.in

The Probability Statement

IRT predicts a probability, not a yes/no

IRT never says a person will get an item right. It gives the probability of a correct (or endorsing) response, as a function of where the person and the item sit on the scale.

P(correct | θ)

The probability that a person with trait level theta answers an item correctly (for a test) or endorses it (for an attitude / experience scale). It rises smoothly as theta rises.

ImpactMojoItem Response Theory 101www.impactmojo.in

The Logistic Form

The S-shaped probability curve

That probability follows a smooth logistic function of θ. Far below the item, P is near 0; far above, near 1; in between it climbs through an S-shape. This curve is the heart of IRT.

Crucially the curve is monotonic: more trait always means a higher chance of success. It never dips.

ImpactMojoItem Response Theory 101www.impactmojo.in

First Look

One item's curve across the trait range

Probability of a correct response rises with θ (item at b=0)

Illustrative logistic ICC (a=1, b=0)

At θ = 0 the chance is 50%. Move up the scale and success becomes near-certain; move down and it fades to near zero.

ImpactMojoItem Response Theory 101www.impactmojo.in

What Travels

Why this fixes CTT's headaches

Item properties are sample-independent: the curve describes the item itself, not the group that sat it (when the model fits)
Person estimates are test-independent: θ means the same thing whatever items you used
Items & people share a metric: you can match a test to a person, link forms, and build short or adaptive tests

These gains hold only when the model's assumptions hold — a promise we will scrutinise later.

ImpactMojoItem Response Theory 101www.impactmojo.in

Reading An Endorsement Scale

The same idea for attitudes & experiences

For a food-insecurity or empowerment scale there is no 'right' answer — the curve gives the probability of endorsing the item ('yes, we worried about food'). Severe items are endorsed only by households high on the insecurity continuum.

'Could you visit a health centre alone?' and 'did you go a whole day without eating?' are items placed at very different points on their respective scales.

ImpactMojoItem Response Theory 101www.impactmojo.in

The Promise

What the rest of the course unpacks

Everything that follows is detail on that one S-curve: where it sits (difficulty, b), how steep it is (discrimination, a), whether it has a floor (guessing, c), how much information it carries, and whether it behaves the same for everyone (DIF).

ImpactMojoItem Response Theory 101www.impactmojo.in

04

Section Four

The Item Characteristic Curve

ImpactMojoItem Response Theory 101www.impactmojo.in

Definition

The Item Characteristic Curve (ICC)

Item Characteristic Curve (ICC)

The graph of the probability of a correct / endorsing response (y, 0 to 1) against the latent trait theta (x). One curve per item; its shape encodes everything the model says about that item.

Also called the item response function. Read it left to right: as the person's trait increases, the chance of success increases.

ImpactMojoItem Response Theory 101www.impactmojo.in

How To Read It

Reading an ICC in three moves

Pick a θ on the x-axis — a person's trait level
Go up to the curve, then across — read the probability
The whole curve tells you how the item behaves for everyone

Two numbers define the basic curve: where it crosses 50% (difficulty) and how steeply it rises there (discrimination).

ImpactMojoItem Response Theory 101www.impactmojo.in

Anatomy

The three landmarks of an ICC

Lower tail

P near 0 (or near c) for very low θ — the item is too hard for them

Inflection

the steepest point — where the item best separates people

Upper tail

P near 1 for very high θ — the item is trivially easy for them

ImpactMojoItem Response Theory 101www.impactmojo.in

The Curve

A single, well-behaved ICC

One item's ICC (a=1, b=0): monotonic, S-shaped, 0 to 1

Illustrative logistic ICC

The 50% point sits at θ = 0 here — that is this item's difficulty. The slope through that point is its discrimination.

ImpactMojoItem Response Theory 101www.impactmojo.in

Monotonic

An ICC only ever goes up

A valid ICC is monotonically increasing: more of the trait never lowers the chance of a correct response. If a fitted curve dips or wiggles, something is wrong — a mis-keyed item, a trick question, or a broken assumption.

A non-monotonic empirical curve is a red flag, not a finding. Investigate the item before trusting it.

ImpactMojoItem Response Theory 101www.impactmojo.in

Comparing Items

Several items on one set of axes

Four items overlaid: easy/hard differ in position, steep/flat in slope

Illustrative ICCs

Overlaying ICCs is how psychometricians read an item bank at a glance — left/right tells you difficulty, steepness tells you discrimination.

ImpactMojoItem Response Theory 101www.impactmojo.in

Probability To Score

From curves to a person's estimate

Given a person's pattern of right and wrong answers and the fitted ICCs, the software finds the θ that makes that pattern most likely. That θ — not the raw count — is the person's IRT score.

01

Item ICCs (a, b, c) estimated

→

02

Person's response pattern observed

→

03

Find θ that best fits the pattern

→

04

θ with a standard error = the score

ImpactMojoItem Response Theory 101www.impactmojo.in

Why Curves Beat Counts

The same raw score, different meaning

Two children each get 6 of 12 right. One passed the six easy items; the other passed six hard ones and missed easy items. CTT calls them equal. IRT, reading which items, places them at different θ.

The response pattern carries information that the raw total throws away.

ImpactMojoItem Response Theory 101www.impactmojo.in

05

Section Five

Item Difficulty — the b Parameter

ImpactMojoItem Response Theory 101www.impactmojo.in

Definition

b locates the item on the trait scale

Difficulty (b)

The point on the theta scale where the item's success probability is 50% (for a 2PL/1PL model). It is measured in the same units as theta, so an item and a person can be compared directly.

Mantra: higher b = harder item. A hard item's curve sits to the right; you need more trait to have a 50–50 chance.

ImpactMojoItem Response Theory 101www.impactmojo.in

The Key Picture

Easy, medium and hard items

Three items, same a, different b: the curve slides right as b rises

Illustrative ICCs (a=1; b = −1, 0, +1)

Read off the 50% line: the easy item crosses it at θ = −1, the hard item at θ = +1. That crossing point is b.

ImpactMojoItem Response Theory 101www.impactmojo.in

Same Units

b and θ share one scale

Because b is expressed in θ units, you can ask: is this person above or below this item? A child at θ = 0.5 is comfortably above an easy item (b = −1) but below a hard one (b = +1).

This shared metric is what lets you target a test — choose items whose b's surround the people you most need to measure.

ImpactMojoItem Response Theory 101www.impactmojo.in

Worked Example

Ordering items by difficulty

Item (early-grade reading)	b (illustrative)	Interpretation
Recognise a letter	−2.0	Very easy — most children pass
Read a familiar word	−0.8	Easy
Read a simple sentence	0.2	Moderate
Read a short paragraph	1.0	Hard
Answer a comprehension question	1.8	Very hard

Illustrative values. The ordering — letter to comprehension — is what an ASER-style ladder captures, and b puts numbers on it.

ImpactMojoItem Response Theory 101www.impactmojo.in

Difficulty For Scales

For attitude scales, b means 'severity'

On a food-insecurity scale, b is better read as severity: how much insecurity it takes before a household endorses the item. 'We worried about food' has a low b; 'a household member went a whole day without eating' has a high b.

A well-built scale spreads its items' b's from mild to severe, so it can place households all along the continuum.

ImpactMojoItem Response Theory 101www.impactmojo.in

Item Map

Mapping items along the trait scale

Item-difficulty map: where five items 'bite' on the θ scale

Illustrative b values on the θ axis

Gaps in the map are blind spots: between b = −0.8 and 0.2 the test thins out, so it measures children there less well.

ImpactMojoItem Response Theory 101www.impactmojo.in

Common Confusion

Difficulty is about the item, not the topic

b is empirical: it is set by how people actually respond, not by how hard the item looks. A question that seems advanced may be easy if everyone was taught it; a 'simple' item may be hard if the wording confuses people.

Never assign difficulty by intuition. Estimate it from data, then sanity-check the surprises — they often reveal a flawed item.

ImpactMojoItem Response Theory 101www.impactmojo.in

06

Section Six

Item Discrimination — the a Parameter

ImpactMojoItem Response Theory 101www.impactmojo.in

Definition

a measures how sharply an item sorts people

Discrimination (a)

How steeply the ICC rises at its midpoint — how well the item distinguishes people just below its difficulty from those just above it. Higher a means a steeper curve and a sharper distinction.

Mantra: higher a = steeper ICC = better at separating low-trait from high-trait people near its b.

ImpactMojoItem Response Theory 101www.impactmojo.in

The Key Picture

A steep item and a flat item

Same difficulty (b=0), different discrimination: steep vs flat

Illustrative ICCs (b=0; a = 1.8 vs 0.5)

Both cross 50% at θ = 0 (same b). The steep item leaps from low to high probability over a narrow band — it sorts people sharply right there. The flat item barely distinguishes anyone.

ImpactMojoItem Response Theory 101www.impactmojo.in

Why Steep Helps

A steep item is a sharp ruler

Near its difficulty, a high-a item changes a person's success probability a lot for a small change in θ. That sensitivity is exactly what lets it tell two nearby people apart — it carries more information (next section).

High discrimination is generally desirable — but only near that item's b. Away from b, even a steep item is flat and tells you little.

ImpactMojoItem Response Theory 101www.impactmojo.in

Low Discrimination

When an item barely sorts anyone

A low-a (flat) item gives almost the same success probability to people across a wide range of θ. Low-trait and high-trait people answer it similarly — so it adds little to distinguishing them.

Very low or negative a is a warning: the item may be ambiguous, off-topic, or mis-keyed. Items that do not discriminate are candidates for revision or removal.

ImpactMojoItem Response Theory 101www.impactmojo.in

Negative Discrimination

A curve that goes the wrong way

If an item shows negative discrimination, higher-trait people do worse on it than lower-trait people — the ICC slopes downward. That should never happen for a sound item.

Usual culprits: the answer key is wrong, the item measures something else, or it is a trick question. Fix or drop it — do not leave it scoring people backwards.

ImpactMojoItem Response Theory 101www.impactmojo.in

a and b Together

Difficulty and discrimination are independent

	Low a (flat)	High a (steep)
Low b (easy)	Easy, sorts weakly	Easy, sorts low-trait people sharply
High b (hard)	Hard, sorts weakly	Hard, sorts high-trait people sharply

b says where the item works; a says how well it works there. A good item bank mixes b's to cover the range and favours high a's at each level.

ImpactMojoItem Response Theory 101www.impactmojo.in

Scale Items

Discrimination on an empowerment scale

On an empowerment scale, a high-a item is one whose endorsement cleanly separates more- from less-empowered women near its severity. A low-a item — one answered similarly regardless of empowerment — is dead weight.

Selecting high-a items is how scale developers shorten a questionnaire without losing measurement quality.

ImpactMojoItem Response Theory 101www.impactmojo.in

Caution

Steeper is not always better

An extremely steep item measures superbly — but only in a razor-thin band of θ. A test built only of very steep items can measure one narrow region brilliantly and everywhere else poorly.

Balance matters: you want high discrimination spread across the range you care about, not piled at one point.

ImpactMojoItem Response Theory 101www.impactmojo.in

07

Section Seven

Guessing & the Model Family

ImpactMojoItem Response Theory 101www.impactmojo.in

The Floor

On multiple-choice items, no one scores zero

On a 4-option multiple-choice item, even a child who knows nothing has roughly a 1-in-4 chance of being right. So the ICC should not fall to zero at low θ — it should flatten out at a lower asymptote.

Guessing (c)

The lower asymptote of the ICC — the success probability for someone with very low theta. It is the floor created by guessing or by partial cues.

ImpactMojoItem Response Theory 101www.impactmojo.in

The Key Picture

A guessing floor lifts the lower tail

With guessing (c=0.25) the curve floors near 0.25, not 0

Illustrative ICCs (a=1, b=0; c=0 vs c=0.25)

Note the amber curve never drops below ~0.25. A very low-ability student still has a one-in-four chance — that is the guessing floor.

ImpactMojoItem Response Theory 101www.impactmojo.in

Why It Matters

Ignoring guessing biases difficulty

If you fit a model with no guessing parameter to multiple-choice data, the floor created by guessing gets misread as the item being 'easier' than it is — biasing b and distorting low-ability scores.

c matters most at the bottom of the scale — precisely where many development assessments most need to measure well.

ImpactMojoItem Response Theory 101www.impactmojo.in

The Model Family

1PL, 2PL, 3PL: what each one adds

Model	Free parameters	Adds…
1PL / Rasch	b only (a fixed equal)	Difficulty differs; all items equally discriminating
2PL	a and b	Items also differ in discrimination
3PL	a, b and c	Plus a guessing floor

Each step adds realism — and demands more data to estimate the extra parameters reliably.

ImpactMojoItem Response Theory 101www.impactmojo.in

Rasch / 1PL

The Rasch model: every item equally discriminating

The 1PL / Rasch model fixes a to be the same for every item, so items differ only in difficulty (b). The ICCs are parallel S-curves — same shape, shifted left or right.

Why people love it

Simple, stable, needs less data
Raw score is a sufficient statistic
Elegant measurement properties

The trade-off

It assumes equal discrimination. Where items genuinely differ in a, Rasch will misfit some of them.

ImpactMojoItem Response Theory 101www.impactmojo.in

2PL

The 2PL model: let discrimination vary

The 2PL frees a, so each item has its own slope as well as its own difficulty. ICCs can now be steep or flat, crossing one another. It fits more datasets but needs more respondents.

Use 2PL when items plainly differ in how well they sort people — common for attitude and empowerment scales.

ImpactMojoItem Response Theory 101www.impactmojo.in

3PL

The 3PL model: add a guessing floor

The 3PL adds c, the guessing asymptote — built for multiple-choice ability tests where lucky guesses are real. It is the most flexible of the three but the hungriest for data and the trickiest to estimate.

c is notoriously hard to pin down — it lives in the sparsely-populated low-θ tail. Large samples are essential, or c is often fixed to 1/(number of options).

ImpactMojoItem Response Theory 101www.impactmojo.in

Choosing

Which model should you use?

Start simple. Rasch/1PL if items are similar and data is limited — many large-scale assessments use it deliberately
Move to 2PL when discrimination clearly varies and you have the sample size
Reserve 3PL for multiple-choice ability tests with guessing and very large samples
Let fit and theory decide — not the wish for the fanciest model

ImpactMojoItem Response Theory 101www.impactmojo.in

08

Section Eight

Information & Precision

ImpactMojoItem Response Theory 101www.impactmojo.in

The Idea

Information = measurement precision

Item information

How much an item reduces uncertainty about theta at each point on the scale. An item is most informative around its own difficulty b, and more so the higher its discrimination a.

Information is the IRT replacement for one blanket reliability figure: it tells you where on the scale the item measures well.

ImpactMojoItem Response Theory 101www.impactmojo.in

Item Information

An item informs most around its difficulty

Item information function (item at b=0): peaks at its difficulty

Illustrative item information (a=1, b=0)

The peak sits at θ = 0 — the item's b. An item tells you most about people whose trait is near its difficulty, and little about those far from it.

ImpactMojoItem Response Theory 101www.impactmojo.in

Test Information

Add items, add information

The test information function is just the sum of the item information functions. Where many items pile up, the test measures precisely; where items are sparse, it measures poorly.

Test information (3 items at b = −1, 0, +1): broad, peaks near 0

Illustrative test information function

ImpactMojoItem Response Theory 101www.impactmojo.in

Information And SEM

More information means less error

Precision and information are two sides of one coin: the standard error of measurement at any θ is 1 / √information. High information → small standard error; low information → large error.

Standard error varies along θ — lowest where information is highest

Illustrative SEM = 1 / √(test information)

ImpactMojoItem Response Theory 101www.impactmojo.in

The Contrast

Precision is not constant — and IRT shows it

CTT

One standard error for everyone — pretends the test is equally precise across the whole range.

IRT

Error is lowest where information peaks (usually mid-range) and rises sharply in the tails. Honest, and actionable.

If you must measure the very weak or very strong precisely, you need items targeted there — the middle-heavy test will fail them.

ImpactMojoItem Response Theory 101www.impactmojo.in

Targeting

Targeting a test where it matters

Two tests, same length: a 'high-θ' form moves the information peak right

Illustrative test information (items centred at θ≈0 vs θ≈+1.5)

Choosing items by their b is how you target a form — e.g. to grade top performers, or to pinpoint a pass/fail cut-score.

ImpactMojoItem Response Theory 101www.impactmojo.in

Cut-Scores

Put your information where the decision is

If a programme classifies children as 'at grade level' or not, the decision happens at one θ — the cut-score. That is exactly where you want maximum information and minimum error.

Pack items with b's near the cut-score. A test can be short yet decisive if its information is concentrated where the call is actually made.

ImpactMojoItem Response Theory 101www.impactmojo.in

Adaptive Testing

How computer-adaptive tests use information

Because items and people share a scale, a computer-adaptive test can pick each next item to be maximally informative at the test-taker's current θ estimate — honing in fast.

01

Estimate θ so far

→

02

Pick the most informative unused item near that θ

→

03

Update θ from the answer

→

04

Stop when the standard error is small enough

ImpactMojoItem Response Theory 101www.impactmojo.in

09

Section Nine

Validating Scales & DIF

ImpactMojoItem Response Theory 101www.impactmojo.in

Earn The Benefits

IRT's promises hold only if assumptions do

Sample-free items, test-free scores, a clean common metric — these gifts depend on the model actually fitting the data. Validation is the work of checking that it does.

Does the model fit the responses?
Is the scale unidimensional?
Are responses locally independent?
Does each item behave the same across groups (no DIF)?

ImpactMojoItem Response Theory 101www.impactmojo.in

Model Fit

Checking model fit

Fit asks whether the observed responses match what the fitted ICCs predict — overall and item by item. A badly fitting item's empirical curve departs from its modelled S-curve.

Use item-fit statistics and, crucially, plot the empirical vs modelled ICC. A picture catches misfit that a single index can hide — and points to the offending item.

ImpactMojoItem Response Theory 101www.impactmojo.in

Unidimensionality

Assumption 1: one trait at a time

Unidimensionality

The assumption that a single latent trait accounts for the responses. The items should all tap the same underlying continuum — one theta, not several.

A 'reading' test that secretly also measures vocabulary and reasoning violates this. Check with factor analysis before trusting a one-dimensional θ.

ImpactMojoItem Response Theory 101www.impactmojo.in

Local Independence

Assumption 2: items don't lean on each other

Local independence

Once you account for theta, responses to different items are independent. Knowing the answer to one item should give no extra clue to another, beyond what theta already explains.

Violated by item chains — e.g. several questions about one reading passage, where missing the passage sinks them all together. Bundle or rewrite such items.

ImpactMojoItem Response Theory 101www.impactmojo.in

DIF Defined

Differential Item Functioning (DIF)

When two people with the SAME theta but from different groups (e.g. girls vs boys, one region vs another) have different probabilities of answering an item correctly. The item behaves differently across groups.

DIF is potential item bias. Same trait, different odds — the item is reading something other than the trait for one group.

ImpactMojoItem Response Theory 101www.impactmojo.in

Seeing DIF

DIF makes one item's ICC split by group

Same item, two groups: the ICCs diverge → DIF

Illustrative ICCs for one item across two groups

At any given θ, Group B's success probability is lower — the item is effectively harder for them at the same trait level. That is uniform DIF.

ImpactMojoItem Response Theory 101www.impactmojo.in

DIF vs Real Gaps

DIF is not the same as a group difference

If girls genuinely have higher reading ability than boys, they will score higher — that is a real trait difference, not DIF. DIF is when girls and boys at the same ability still differ on a specific item.

DIF conditions on θ. It isolates item bias from true group differences — a distinction CTT cannot make cleanly.

ImpactMojoItem Response Theory 101www.impactmojo.in

Acting On DIF

What to do when an item shows DIF

Investigate the content: a word, context or example unfamiliar to one group (an urban example, a gendered scenario)
Check the translation: DIF across language versions often signals a poor or unequal translation
Revise or remove biased items before reporting scores
Document the DIF review — fairness is part of validity

ImpactMojoItem Response Theory 101www.impactmojo.in

10

Section Ten

Applications in Development

ImpactMojoItem Response Theory 101www.impactmojo.in

Learning Assessments

ASER- and NAS-style learning assessments

Large learning assessments — India's NAS, citizen-led ASER, and cross-national studies — lean on IRT to place children on a single proficiency scale and to compare across grades, states and years.

IRT is what lets 'reading at grade-2 level' mean the same thing whether a child sat form A or form B, this year or last.

ImpactMojoItem Response Theory 101www.impactmojo.in

Equating

Equating: comparing across forms and years

Equating / linking

Placing different test forms onto a common theta scale — usually via shared 'anchor' items — so scores from different forms or years are directly comparable.

Because IRT item parameters are (when the model fits) sample-independent, anchor items let you stitch separate forms into one continuous scale.

ImpactMojoItem Response Theory 101www.impactmojo.in

Why Equating Matters

Measuring change without changing the ruler

To track learning over years you must change the questions (security, age-appropriateness) without changing what the score means. Equating via anchor items keeps the ruler fixed while the items rotate.

Without equating, a 'rise in scores' could just be an easier form. Equating separates real learning gains from changes in the test.

ImpactMojoItem Response Theory 101www.impactmojo.in

Food Insecurity

FIES: a global IRT-based scale

The Food Insecurity Experience Scale (FIES) — eight yes/no experience items — is modelled with a Rasch/1PL approach so that severity is comparable across countries and languages, underpinning SDG indicator 2.1.2.

8 items

from 'worried' to 'a whole day without eating'

Severity = b

items ordered from mild to severe insecurity

Equated

calibrated to a global reference scale

ImpactMojoItem Response Theory 101www.impactmojo.in

Empowerment

Empowerment and agency scales

Women's empowerment indices and agency scales combine items on mobility, decision-making and asset control. IRT checks whether they measure one coherent trait, ranks items by severity, and flags items that work differently across regions or castes.

It turns a bag of agree/disagree items into a calibrated scale — and reveals which items actually discriminate between more- and less-empowered women.

ImpactMojoItem Response Theory 101www.impactmojo.in

Wealth Scales

Asset and wealth indices

Asset-based wealth indices (the kind behind NFHS/DHS wealth quintiles) ask whether a household owns particular assets. IRT and related latent-trait methods place households on a wealth continuum from the pattern of ownership.

A motorcycle and a mud floor sit at very different points on the wealth scale — just as easy and hard items sit at different b's. The logic is identical.

ImpactMojoItem Response Theory 101www.impactmojo.in

Attitude Scales

Attitude, stigma and knowledge scales

Health knowledge: grade items from basic to advanced and measure understanding precisely where a campaign targets it
Stigma / attitude scales: order statements by how much prejudice it takes to endorse them
Quality-of-life & depression screeners: many are now built and validated with IRT

ImpactMojoItem Response Theory 101www.impactmojo.in

The Payoff

What IRT gives a development programme

Shorter instruments: keep the most informative items, cut respondent burden
Comparable numbers: across forms, years, regions and languages
Fairer measures: DIF screening removes biased items
Targeted precision: measure best exactly where decisions are made

ImpactMojoItem Response Theory 101www.impactmojo.in

11

Section Eleven

Assumptions, Limits & Tools

ImpactMojoItem Response Theory 101www.impactmojo.in

Recap The Assumptions

IRT is powerful, not magic

Unidimensionality — one trait drives the responses
Local independence — items don't lean on each other
Correct model — the chosen 1PL/2PL/3PL actually fits
Monotonicity — more trait, higher success probability

Break an assumption and the elegant guarantees — sample-free items, comparable scores — quietly stop holding.

ImpactMojoItem Response Theory 101www.impactmojo.in

Sample Size

IRT is data-hungry

Model	Rough sample-size guidance	Note
1PL / Rasch	~200+ respondents	Most forgiving
2PL	~500+ respondents	Estimating a needs more data
3PL	~1,000+ respondents	c is hard to estimate; often fixed

Illustrative rules of thumb only — needs depend on test length, item quality and how spread the sample is. Small, homogeneous samples can defeat even a simple model.

ImpactMojoItem Response Theory 101www.impactmojo.in

Other Limits

Where IRT can mislead

Garbage items in, garbage scale out — IRT cannot rescue badly written items
A neat θ can hide a contested concept — 'empowerment' is political, not just psychometric
Multidimensional traits forced onto one scale lose meaning
Black-box scores are harder for non-specialists to interpret than a percentage

ImpactMojoItem Response Theory 101www.impactmojo.in

Software

Tools for fitting IRT models

Tool	Good for	Note
R: mirt	Uni- & multidimensional IRT, all common models	Free, powerful, well documented
R: ltm	1PL/2PL/3PL for dichotomous & graded items	Free, gentle entry point
R: TAM / eRm	Large-scale & Rasch modelling	Free; TAM mirrors big assessments
Stata: irt suite	IRT within a familiar stats package	Built-in irt commands
jMetrik / IRTPRO	Point-and-click psychometrics	Lower coding barrier

ImpactMojoItem Response Theory 101www.impactmojo.in

A Workflow

A sensible IRT workflow

01

CHECK dimensionality & local independence

→

02

FIT the simplest defensible model (start Rasch)

→

03

EXAMINE item fit, a, b, (c) and information

→

04

SCREEN for DIF across key groups

→

05

REVISE items, then score & report θ with its error

ImpactMojoItem Response Theory 101www.impactmojo.in

Communicating

Translating θ for non-specialists

A θ of 1.2 means nothing to a programme officer or a parent. Part of doing IRT well is translating the scale back into language people act on — proficiency bands, 'can read a paragraph', percentile, or a clear cut-score.

Always report the standard error alongside the score, and describe what the cut-points mean in real-world terms. A precise number nobody understands helps no one.

ImpactMojoItem Response Theory 101www.impactmojo.in