fullscreen
ImpactMojoItem Response Theory 101www.impactmojo.in
ImpactMojo 101 Series · Free Forever
Item
Response
Theory 101
Measuring Latent Traits Well — a Foundational Course on IRT for Assessment & M&E Practitioners in South Asia
Research-BackedSouth Asia Focus100 SlidesFree Access
ImpactMojoItem Response Theory 101www.impactmojo.in
What We Cover
01
The Measurement Problem
Slides 3–10
02
Classical Test Theory & Its Limits
Slides 11–19
03
The Big Idea of IRT
Slides 20–28
04
The Item Characteristic Curve
Slides 29–37
05
Item Difficulty — the b Parameter
Slides 38–45
06
Item Discrimination — the a Parameter
Slides 46–54
07
Guessing & the Model Family
Slides 55–63
08
Information & Precision
Slides 64–72
09
Validating Scales & DIF
Slides 73–81
10
Applications in Development
Slides 82–90
11
Assumptions, Limits & Tools
Slides 91–99
ImpactMojoItem Response Theory 101www.impactmojo.in
01
Section One
The Measurement Problem
ImpactMojoItem Response Theory 101www.impactmojo.in
You cannot see what you most want to measure
Development work is full of things we care about but cannot observe directly: a child's reading ability, a woman's empowerment, a household's food insecurity. We only ever see responses — answers to items, ticks on a scale.
Latent trait
An unobservable characteristic of a person — ability, attitude, deprivation — that we infer from their answers to a set of items. By convention it is written as theta (θ).
ImpactMojoItem Response Theory 101www.impactmojo.in
We observe answers, not the trait
01
LATENT TRAIT: reading ability (θ) — unseen
02
ITEMS: a child reads words, sentences, a paragraph
03
RESPONSES: correct / incorrect on each item
04
INFERENCE: a measurement model estimates θ from the pattern
A measurement model is the bridge: it links the unseen trait to the seen responses, with explicit assumptions you can check.
ImpactMojoItem Response Theory 101www.impactmojo.in
The same problem, three development settings
Ability
ASER-style early-grade reading & arithmetic — a child either can or cannot do each task
Empowerment
A woman's decision-making, mobility & asset control — agree / disagree items
Food insecurity
FIES — eight yes/no experiences from worry to going a whole day without eating
In each case the trait is on a hidden continuum and the items are graded markers along it.
ImpactMojoItem Response Theory 101www.impactmojo.in
Why not just count right answers?
The obvious approach — add up correct answers, or count 'yes' responses — treats every item as equal and every score gap as the same size. But a hard item is not worth the same as an easy one, and the jump from 4 to 5 correct is not the same as 8 to 9.
A measurement model lets items differ — in difficulty, in how well they sort people — and places people and items on one common scale.
ImpactMojoItem Response Theory 101www.impactmojo.in
Two families of measurement theory
Classical Test Theory (CTT)
Works on the total score. Simple, familiar, everywhere — but its statistics depend on the particular sample and the particular test.
Item Response Theory (IRT)
Models each item and each person separately, on a shared scale. More demanding, but the properties travel across samples and forms.
This course starts from CTT — what it is, and exactly where it strains — then builds IRT as the answer.
ImpactMojoItem Response Theory 101www.impactmojo.in
Why measurement quality is not a luxury
  • Comparability: compare a child this year with last year, or one state's empowerment score with another's, on the same ruler
  • Fairness: detect items that behave differently for girls, or for one language group
  • Precision where it matters: know exactly where the test measures well and where it is guessing
  • Honest scores: a number that means the same thing for everyone
ImpactMojoItem Response Theory 101www.impactmojo.in
How this course is built
Foundations
  • The measurement problem & CTT's limits
  • The big idea of IRT and the ICC
  • Difficulty (b), discrimination (a), guessing (c)
Using IRT well
  • Information, precision and test targeting
  • Validating scales: fit, dimensionality, DIF
  • Applications, assumptions and tools
Examples are drawn from learning assessments, empowerment, wealth and food-security scales used across the region.
ImpactMojoItem Response Theory 101www.impactmojo.in
02
Section Two
Classical Test Theory & Its Limits
ImpactMojoItem Response Theory 101www.impactmojo.in
The familiar sum or percentage score
Classical Test Theory is the world of the total score: number correct, percentage right, the count of 'yes' responses on a scale. It is what nearly every report and dashboard already uses.
01
Observed score = True score + Error
02
X = T + E
03
Goal: estimate T, shrink E
ImpactMojoItem Response Theory 101www.impactmojo.in
Reliability: how much is signal, not noise?
Reliability
The share of the variation in scores that reflects true differences between people rather than measurement error. It runs from 0 (all noise) to 1 (no error).
A reliable test gives nearly the same score if a person sits it twice. Low reliability means scores wobble for reasons that have nothing to do with the trait.
ImpactMojoItem Response Theory 101www.impactmojo.in
Cronbach's alpha — the workhorse statistic
Cronbach's alpha estimates reliability from a single sitting by asking how consistently the items hang together. Rules of thumb put 'acceptable' around 0.70 and up — but the number is widely misread.
Alpha rises simply by adding more items, and high alpha does not prove the scale measures one thing. It is a useful summary, not a certificate of quality.
ImpactMojoItem Response Theory 101www.impactmojo.in
CTT statistics are sample-dependent
An item's CTT 'difficulty' is just the proportion who got it right (the p-value). Give the same item to a high-ability school and a struggling one and that proportion changes — so the item looks 'easier' or 'harder' depending on who sat it.
The item did not change. The sample did. Yet the headline statistic moved — that is the problem.
ImpactMojoItem Response Theory 101www.impactmojo.in
Same item, two samples, two p-values
Group% correct on Item QCTT verdict
High-ability school88%An easy item
Mixed government school61%A moderate item
Struggling school34%A hard item
Illustrative figures. One physical item, three contradictory CTT difficulties — because the p-value confounds the item with the people who answered it.
ImpactMojoItem Response Theory 101www.impactmojo.in
CTT statistics are test-dependent
A person's CTT ability is their score on this particular test. Put them on an easier form and the score climbs; a harder form and it falls. The person's standing is tangled up with the difficulty of the form they happened to take.
Two children with the same true ability can get different scores purely because they sat different forms. Comparing across forms or years becomes guesswork.
ImpactMojoItem Response Theory 101www.impactmojo.in
One error figure for the whole scale
CTT reports a single standard error of measurement for everyone. But in reality a test measures a middling student far more precisely than it measures a top scorer (who finds every item easy) or a very weak one (who finds every item hard).
Precision genuinely varies across the ability range. CTT pretends it is constant — IRT will let it vary, which is closer to the truth.
ImpactMojoItem Response Theory 101www.impactmojo.in
Three jobs CTT cannot do cleanly
You want to…CTT problemIRT answer (coming)
Compare items fairly across samplesp-values shift with the sampleItem params are sample-free
Compare people across forms/yearsScores depend on the formθ is on a common scale
Know precision at each levelOne SEM for allInformation varies along θ
Build an adaptive or short formHard — scores not comparableItems & people share a metric
None of this means CTT is wrong — just that it asks too much of one total score. IRT splits the job apart.
ImpactMojoItem Response Theory 101www.impactmojo.in
03
Section Three
The Big Idea of IRT
ImpactMojoItem Response Theory 101www.impactmojo.in
Put people and items on the same ruler
The central move of IRT is deceptively simple: place each person's trait (θ) and each item's difficulty on one shared continuum. A person is somewhere on the line; so is every item.
If a person sits above an item on the scale, they are likely to get it right or endorse it. Below it, unlikely. The distance between them sets the probability.
ImpactMojoItem Response Theory 101www.impactmojo.in
A trait scale centred on zero
By convention θ is scaled to have mean 0 and standard deviation 1 in the reference group, usually running from about −3 to +3. Negative means lower ability / less of the trait; positive means more.
θ ≈ −2
low trait — struggles with most items
θ ≈ 0
average for the reference group
θ ≈ +2
high trait — succeeds on most items
ImpactMojoItem Response Theory 101www.impactmojo.in
IRT predicts a probability, not a yes/no
IRT never says a person will get an item right. It gives the probability of a correct (or endorsing) response, as a function of where the person and the item sit on the scale.
P(correct | θ)
The probability that a person with trait level theta answers an item correctly (for a test) or endorses it (for an attitude / experience scale). It rises smoothly as theta rises.
ImpactMojoItem Response Theory 101www.impactmojo.in
The S-shaped probability curve
That probability follows a smooth logistic function of θ. Far below the item, P is near 0; far above, near 1; in between it climbs through an S-shape. This curve is the heart of IRT.
Crucially the curve is monotonic: more trait always means a higher chance of success. It never dips.
ImpactMojoItem Response Theory 101www.impactmojo.in
One item's curve across the trait range
Probability of a correct response rises with θ (item at b=0)
Illustrative logistic ICC (a=1, b=0)
At θ = 0 the chance is 50%. Move up the scale and success becomes near-certain; move down and it fades to near zero.
ImpactMojoItem Response Theory 101www.impactmojo.in
Why this fixes CTT's headaches
  • Item properties are sample-independent: the curve describes the item itself, not the group that sat it (when the model fits)
  • Person estimates are test-independent: θ means the same thing whatever items you used
  • Items & people share a metric: you can match a test to a person, link forms, and build short or adaptive tests
These gains hold only when the model's assumptions hold — a promise we will scrutinise later.
ImpactMojoItem Response Theory 101www.impactmojo.in
The same idea for attitudes & experiences
For a food-insecurity or empowerment scale there is no 'right' answer — the curve gives the probability of endorsing the item ('yes, we worried about food'). Severe items are endorsed only by households high on the insecurity continuum.
'Could you visit a health centre alone?' and 'did you go a whole day without eating?' are items placed at very different points on their respective scales.
ImpactMojoItem Response Theory 101www.impactmojo.in
What the rest of the course unpacks
Everything that follows is detail on that one S-curve: where it sits (difficulty, b), how steep it is (discrimination, a), whether it has a floor (guessing, c), how much information it carries, and whether it behaves the same for everyone (DIF).
ImpactMojoItem Response Theory 101www.impactmojo.in
04
Section Four
The Item Characteristic Curve
ImpactMojoItem Response Theory 101www.impactmojo.in
The Item Characteristic Curve (ICC)
Item Characteristic Curve (ICC)
The graph of the probability of a correct / endorsing response (y, 0 to 1) against the latent trait theta (x). One curve per item; its shape encodes everything the model says about that item.
Also called the item response function. Read it left to right: as the person's trait increases, the chance of success increases.
ImpactMojoItem Response Theory 101www.impactmojo.in
Reading an ICC in three moves
  • Pick a θ on the x-axis — a person's trait level
  • Go up to the curve, then across — read the probability
  • The whole curve tells you how the item behaves for everyone
Two numbers define the basic curve: where it crosses 50% (difficulty) and how steeply it rises there (discrimination).
ImpactMojoItem Response Theory 101www.impactmojo.in
The three landmarks of an ICC
Lower tail
P near 0 (or near c) for very low θ — the item is too hard for them
Inflection
the steepest point — where the item best separates people
Upper tail
P near 1 for very high θ — the item is trivially easy for them
ImpactMojoItem Response Theory 101www.impactmojo.in
A single, well-behaved ICC
One item's ICC (a=1, b=0): monotonic, S-shaped, 0 to 1
Illustrative logistic ICC
The 50% point sits at θ = 0 here — that is this item's difficulty. The slope through that point is its discrimination.
ImpactMojoItem Response Theory 101www.impactmojo.in
An ICC only ever goes up
A valid ICC is monotonically increasing: more of the trait never lowers the chance of a correct response. If a fitted curve dips or wiggles, something is wrong — a mis-keyed item, a trick question, or a broken assumption.
A non-monotonic empirical curve is a red flag, not a finding. Investigate the item before trusting it.
ImpactMojoItem Response Theory 101www.impactmojo.in
Several items on one set of axes
Four items overlaid: easy/hard differ in position, steep/flat in slope
Illustrative ICCs
Overlaying ICCs is how psychometricians read an item bank at a glance — left/right tells you difficulty, steepness tells you discrimination.
ImpactMojoItem Response Theory 101www.impactmojo.in
From curves to a person's estimate
Given a person's pattern of right and wrong answers and the fitted ICCs, the software finds the θ that makes that pattern most likely. That θ — not the raw count — is the person's IRT score.
01
Item ICCs (a, b, c) estimated
02
Person's response pattern observed
03
Find θ that best fits the pattern
04
θ with a standard error = the score
ImpactMojoItem Response Theory 101www.impactmojo.in
The same raw score, different meaning
Two children each get 6 of 12 right. One passed the six easy items; the other passed six hard ones and missed easy items. CTT calls them equal. IRT, reading which items, places them at different θ.
The response pattern carries information that the raw total throws away.
ImpactMojoItem Response Theory 101www.impactmojo.in
05
Section Five
Item Difficulty — the b Parameter
ImpactMojoItem Response Theory 101www.impactmojo.in
b locates the item on the trait scale
Difficulty (b)
The point on the theta scale where the item's success probability is 50% (for a 2PL/1PL model). It is measured in the same units as theta, so an item and a person can be compared directly.
Mantra: higher b = harder item. A hard item's curve sits to the right; you need more trait to have a 50–50 chance.
ImpactMojoItem Response Theory 101www.impactmojo.in
Easy, medium and hard items
Three items, same a, different b: the curve slides right as b rises
Illustrative ICCs (a=1; b = −1, 0, +1)
Read off the 50% line: the easy item crosses it at θ = −1, the hard item at θ = +1. That crossing point is b.
ImpactMojoItem Response Theory 101www.impactmojo.in
b and θ share one scale
Because b is expressed in θ units, you can ask: is this person above or below this item? A child at θ = 0.5 is comfortably above an easy item (b = −1) but below a hard one (b = +1).
This shared metric is what lets you target a test — choose items whose b's surround the people you most need to measure.
ImpactMojoItem Response Theory 101www.impactmojo.in
Ordering items by difficulty
Item (early-grade reading)b (illustrative)Interpretation
Recognise a letter−2.0Very easy — most children pass
Read a familiar word−0.8Easy
Read a simple sentence0.2Moderate
Read a short paragraph1.0Hard
Answer a comprehension question1.8Very hard
Illustrative values. The ordering — letter to comprehension — is what an ASER-style ladder captures, and b puts numbers on it.
ImpactMojoItem Response Theory 101www.impactmojo.in
For attitude scales, b means 'severity'
On a food-insecurity scale, b is better read as severity: how much insecurity it takes before a household endorses the item. 'We worried about food' has a low b; 'a household member went a whole day without eating' has a high b.
A well-built scale spreads its items' b's from mild to severe, so it can place households all along the continuum.
ImpactMojoItem Response Theory 101www.impactmojo.in
Mapping items along the trait scale
Item-difficulty map: where five items 'bite' on the θ scale
Illustrative b values on the θ axis
Gaps in the map are blind spots: between b = −0.8 and 0.2 the test thins out, so it measures children there less well.
ImpactMojoItem Response Theory 101www.impactmojo.in
Difficulty is about the item, not the topic
b is empirical: it is set by how people actually respond, not by how hard the item looks. A question that seems advanced may be easy if everyone was taught it; a 'simple' item may be hard if the wording confuses people.
Never assign difficulty by intuition. Estimate it from data, then sanity-check the surprises — they often reveal a flawed item.
ImpactMojoItem Response Theory 101www.impactmojo.in
06
Section Six
Item Discrimination — the a Parameter
ImpactMojoItem Response Theory 101www.impactmojo.in
a measures how sharply an item sorts people
Discrimination (a)
How steeply the ICC rises at its midpoint — how well the item distinguishes people just below its difficulty from those just above it. Higher a means a steeper curve and a sharper distinction.
Mantra: higher a = steeper ICC = better at separating low-trait from high-trait people near its b.
ImpactMojoItem Response Theory 101www.impactmojo.in
A steep item and a flat item
Same difficulty (b=0), different discrimination: steep vs flat
Illustrative ICCs (b=0; a = 1.8 vs 0.5)
Both cross 50% at θ = 0 (same b). The steep item leaps from low to high probability over a narrow band — it sorts people sharply right there. The flat item barely distinguishes anyone.
ImpactMojoItem Response Theory 101www.impactmojo.in
A steep item is a sharp ruler
Near its difficulty, a high-a item changes a person's success probability a lot for a small change in θ. That sensitivity is exactly what lets it tell two nearby people apart — it carries more information (next section).
High discrimination is generally desirable — but only near that item's b. Away from b, even a steep item is flat and tells you little.
ImpactMojoItem Response Theory 101www.impactmojo.in
When an item barely sorts anyone
A low-a (flat) item gives almost the same success probability to people across a wide range of θ. Low-trait and high-trait people answer it similarly — so it adds little to distinguishing them.
Very low or negative a is a warning: the item may be ambiguous, off-topic, or mis-keyed. Items that do not discriminate are candidates for revision or removal.
ImpactMojoItem Response Theory 101www.impactmojo.in
A curve that goes the wrong way
If an item shows negative discrimination, higher-trait people do worse on it than lower-trait people — the ICC slopes downward. That should never happen for a sound item.
Usual culprits: the answer key is wrong, the item measures something else, or it is a trick question. Fix or drop it — do not leave it scoring people backwards.
ImpactMojoItem Response Theory 101www.impactmojo.in
Difficulty and discrimination are independent
Low a (flat)High a (steep)
Low b (easy)Easy, sorts weaklyEasy, sorts low-trait people sharply
High b (hard)Hard, sorts weaklyHard, sorts high-trait people sharply
b says where the item works; a says how well it works there. A good item bank mixes b's to cover the range and favours high a's at each level.
ImpactMojoItem Response Theory 101www.impactmojo.in
Discrimination on an empowerment scale
On an empowerment scale, a high-a item is one whose endorsement cleanly separates more- from less-empowered women near its severity. A low-a item — one answered similarly regardless of empowerment — is dead weight.
Selecting high-a items is how scale developers shorten a questionnaire without losing measurement quality.
ImpactMojoItem Response Theory 101www.impactmojo.in
Steeper is not always better
An extremely steep item measures superbly — but only in a razor-thin band of θ. A test built only of very steep items can measure one narrow region brilliantly and everywhere else poorly.
Balance matters: you want high discrimination spread across the range you care about, not piled at one point.
ImpactMojoItem Response Theory 101www.impactmojo.in
07
Section Seven
Guessing & the Model Family
ImpactMojoItem Response Theory 101www.impactmojo.in
On multiple-choice items, no one scores zero
On a 4-option multiple-choice item, even a child who knows nothing has roughly a 1-in-4 chance of being right. So the ICC should not fall to zero at low θ — it should flatten out at a lower asymptote.
Guessing (c)
The lower asymptote of the ICC — the success probability for someone with very low theta. It is the floor created by guessing or by partial cues.
ImpactMojoItem Response Theory 101www.impactmojo.in
A guessing floor lifts the lower tail
With guessing (c=0.25) the curve floors near 0.25, not 0
Illustrative ICCs (a=1, b=0; c=0 vs c=0.25)
Note the amber curve never drops below ~0.25. A very low-ability student still has a one-in-four chance — that is the guessing floor.
ImpactMojoItem Response Theory 101www.impactmojo.in
Ignoring guessing biases difficulty
If you fit a model with no guessing parameter to multiple-choice data, the floor created by guessing gets misread as the item being 'easier' than it is — biasing b and distorting low-ability scores.
c matters most at the bottom of the scale — precisely where many development assessments most need to measure well.
ImpactMojoItem Response Theory 101www.impactmojo.in
1PL, 2PL, 3PL: what each one adds
ModelFree parametersAdds…
1PL / Raschb only (a fixed equal)Difficulty differs; all items equally discriminating
2PLa and bItems also differ in discrimination
3PLa, b and cPlus a guessing floor
Each step adds realism — and demands more data to estimate the extra parameters reliably.
ImpactMojoItem Response Theory 101www.impactmojo.in
The Rasch model: every item equally discriminating
The 1PL / Rasch model fixes a to be the same for every item, so items differ only in difficulty (b). The ICCs are parallel S-curves — same shape, shifted left or right.
Why people love it
  • Simple, stable, needs less data
  • Raw score is a sufficient statistic
  • Elegant measurement properties
The trade-off
It assumes equal discrimination. Where items genuinely differ in a, Rasch will misfit some of them.
ImpactMojoItem Response Theory 101www.impactmojo.in
The 2PL model: let discrimination vary
The 2PL frees a, so each item has its own slope as well as its own difficulty. ICCs can now be steep or flat, crossing one another. It fits more datasets but needs more respondents.
Use 2PL when items plainly differ in how well they sort people — common for attitude and empowerment scales.
ImpactMojoItem Response Theory 101www.impactmojo.in
The 3PL model: add a guessing floor
The 3PL adds c, the guessing asymptote — built for multiple-choice ability tests where lucky guesses are real. It is the most flexible of the three but the hungriest for data and the trickiest to estimate.
c is notoriously hard to pin down — it lives in the sparsely-populated low-θ tail. Large samples are essential, or c is often fixed to 1/(number of options).
ImpactMojoItem Response Theory 101www.impactmojo.in
Which model should you use?
  • Start simple. Rasch/1PL if items are similar and data is limited — many large-scale assessments use it deliberately
  • Move to 2PL when discrimination clearly varies and you have the sample size
  • Reserve 3PL for multiple-choice ability tests with guessing and very large samples
  • Let fit and theory decide — not the wish for the fanciest model
ImpactMojoItem Response Theory 101www.impactmojo.in
08
Section Eight
Information & Precision
ImpactMojoItem Response Theory 101www.impactmojo.in
Information = measurement precision
Item information
How much an item reduces uncertainty about theta at each point on the scale. An item is most informative around its own difficulty b, and more so the higher its discrimination a.
Information is the IRT replacement for one blanket reliability figure: it tells you where on the scale the item measures well.
ImpactMojoItem Response Theory 101www.impactmojo.in
An item informs most around its difficulty
Item information function (item at b=0): peaks at its difficulty
Illustrative item information (a=1, b=0)
The peak sits at θ = 0 — the item's b. An item tells you most about people whose trait is near its difficulty, and little about those far from it.
ImpactMojoItem Response Theory 101www.impactmojo.in
Add items, add information
The test information function is just the sum of the item information functions. Where many items pile up, the test measures precisely; where items are sparse, it measures poorly.
Test information (3 items at b = −1, 0, +1): broad, peaks near 0
Illustrative test information function
ImpactMojoItem Response Theory 101www.impactmojo.in
More information means less error
Precision and information are two sides of one coin: the standard error of measurement at any θ is 1 / √information. High information → small standard error; low information → large error.
Standard error varies along θ — lowest where information is highest
Illustrative SEM = 1 / √(test information)
ImpactMojoItem Response Theory 101www.impactmojo.in
Precision is not constant — and IRT shows it
CTT
One standard error for everyone — pretends the test is equally precise across the whole range.
IRT
Error is lowest where information peaks (usually mid-range) and rises sharply in the tails. Honest, and actionable.
If you must measure the very weak or very strong precisely, you need items targeted there — the middle-heavy test will fail them.
ImpactMojoItem Response Theory 101www.impactmojo.in
Targeting a test where it matters
Two tests, same length: a 'high-θ' form moves the information peak right
Illustrative test information (items centred at θ≈0 vs θ≈+1.5)
Choosing items by their b is how you target a form — e.g. to grade top performers, or to pinpoint a pass/fail cut-score.
ImpactMojoItem Response Theory 101www.impactmojo.in
Put your information where the decision is
If a programme classifies children as 'at grade level' or not, the decision happens at one θ — the cut-score. That is exactly where you want maximum information and minimum error.
Pack items with b's near the cut-score. A test can be short yet decisive if its information is concentrated where the call is actually made.
ImpactMojoItem Response Theory 101www.impactmojo.in
How computer-adaptive tests use information
Because items and people share a scale, a computer-adaptive test can pick each next item to be maximally informative at the test-taker's current θ estimate — honing in fast.
01
Estimate θ so far
02
Pick the most informative unused item near that θ
03
Update θ from the answer
04
Stop when the standard error is small enough
ImpactMojoItem Response Theory 101www.impactmojo.in
09
Section Nine
Validating Scales & DIF
ImpactMojoItem Response Theory 101www.impactmojo.in
IRT's promises hold only if assumptions do
Sample-free items, test-free scores, a clean common metric — these gifts depend on the model actually fitting the data. Validation is the work of checking that it does.
  • Does the model fit the responses?
  • Is the scale unidimensional?
  • Are responses locally independent?
  • Does each item behave the same across groups (no DIF)?
ImpactMojoItem Response Theory 101www.impactmojo.in
Checking model fit
Fit asks whether the observed responses match what the fitted ICCs predict — overall and item by item. A badly fitting item's empirical curve departs from its modelled S-curve.
Use item-fit statistics and, crucially, plot the empirical vs modelled ICC. A picture catches misfit that a single index can hide — and points to the offending item.
ImpactMojoItem Response Theory 101www.impactmojo.in
Assumption 1: one trait at a time
Unidimensionality
The assumption that a single latent trait accounts for the responses. The items should all tap the same underlying continuum — one theta, not several.
A 'reading' test that secretly also measures vocabulary and reasoning violates this. Check with factor analysis before trusting a one-dimensional θ.
ImpactMojoItem Response Theory 101www.impactmojo.in
Assumption 2: items don't lean on each other
Local independence
Once you account for theta, responses to different items are independent. Knowing the answer to one item should give no extra clue to another, beyond what theta already explains.
Violated by item chains — e.g. several questions about one reading passage, where missing the passage sinks them all together. Bundle or rewrite such items.
ImpactMojoItem Response Theory 101www.impactmojo.in
Differential Item Functioning (DIF)
Differential Item Functioning (DIF)
When two people with the SAME theta but from different groups (e.g. girls vs boys, one region vs another) have different probabilities of answering an item correctly. The item behaves differently across groups.
DIF is potential item bias. Same trait, different odds — the item is reading something other than the trait for one group.
ImpactMojoItem Response Theory 101www.impactmojo.in
DIF makes one item's ICC split by group
Same item, two groups: the ICCs diverge → DIF
Illustrative ICCs for one item across two groups
At any given θ, Group B's success probability is lower — the item is effectively harder for them at the same trait level. That is uniform DIF.
ImpactMojoItem Response Theory 101www.impactmojo.in
DIF is not the same as a group difference
If girls genuinely have higher reading ability than boys, they will score higher — that is a real trait difference, not DIF. DIF is when girls and boys at the same ability still differ on a specific item.
DIF conditions on θ. It isolates item bias from true group differences — a distinction CTT cannot make cleanly.
ImpactMojoItem Response Theory 101www.impactmojo.in
What to do when an item shows DIF
  • Investigate the content: a word, context or example unfamiliar to one group (an urban example, a gendered scenario)
  • Check the translation: DIF across language versions often signals a poor or unequal translation
  • Revise or remove biased items before reporting scores
  • Document the DIF review — fairness is part of validity
ImpactMojoItem Response Theory 101www.impactmojo.in
10
Section Ten
Applications in Development
ImpactMojoItem Response Theory 101www.impactmojo.in
ASER- and NAS-style learning assessments
Large learning assessments — India's NAS, citizen-led ASER, and cross-national studies — lean on IRT to place children on a single proficiency scale and to compare across grades, states and years.
IRT is what lets 'reading at grade-2 level' mean the same thing whether a child sat form A or form B, this year or last.
ImpactMojoItem Response Theory 101www.impactmojo.in
Equating: comparing across forms and years
Equating / linking
Placing different test forms onto a common theta scale — usually via shared 'anchor' items — so scores from different forms or years are directly comparable.
Because IRT item parameters are (when the model fits) sample-independent, anchor items let you stitch separate forms into one continuous scale.
ImpactMojoItem Response Theory 101www.impactmojo.in
Measuring change without changing the ruler
To track learning over years you must change the questions (security, age-appropriateness) without changing what the score means. Equating via anchor items keeps the ruler fixed while the items rotate.
Without equating, a 'rise in scores' could just be an easier form. Equating separates real learning gains from changes in the test.
ImpactMojoItem Response Theory 101www.impactmojo.in
FIES: a global IRT-based scale
The Food Insecurity Experience Scale (FIES) — eight yes/no experience items — is modelled with a Rasch/1PL approach so that severity is comparable across countries and languages, underpinning SDG indicator 2.1.2.
8 items
from 'worried' to 'a whole day without eating'
Severity = b
items ordered from mild to severe insecurity
Equated
calibrated to a global reference scale
ImpactMojoItem Response Theory 101www.impactmojo.in
Empowerment and agency scales
Women's empowerment indices and agency scales combine items on mobility, decision-making and asset control. IRT checks whether they measure one coherent trait, ranks items by severity, and flags items that work differently across regions or castes.
It turns a bag of agree/disagree items into a calibrated scale — and reveals which items actually discriminate between more- and less-empowered women.
ImpactMojoItem Response Theory 101www.impactmojo.in
Asset and wealth indices
Asset-based wealth indices (the kind behind NFHS/DHS wealth quintiles) ask whether a household owns particular assets. IRT and related latent-trait methods place households on a wealth continuum from the pattern of ownership.
A motorcycle and a mud floor sit at very different points on the wealth scale — just as easy and hard items sit at different b's. The logic is identical.
ImpactMojoItem Response Theory 101www.impactmojo.in
Attitude, stigma and knowledge scales
  • Health knowledge: grade items from basic to advanced and measure understanding precisely where a campaign targets it
  • Stigma / attitude scales: order statements by how much prejudice it takes to endorse them
  • Quality-of-life & depression screeners: many are now built and validated with IRT
ImpactMojoItem Response Theory 101www.impactmojo.in
What IRT gives a development programme
  • Shorter instruments: keep the most informative items, cut respondent burden
  • Comparable numbers: across forms, years, regions and languages
  • Fairer measures: DIF screening removes biased items
  • Targeted precision: measure best exactly where decisions are made
ImpactMojoItem Response Theory 101www.impactmojo.in
11
Section Eleven
Assumptions, Limits & Tools
ImpactMojoItem Response Theory 101www.impactmojo.in
IRT is powerful, not magic
  • Unidimensionality — one trait drives the responses
  • Local independence — items don't lean on each other
  • Correct model — the chosen 1PL/2PL/3PL actually fits
  • Monotonicity — more trait, higher success probability
Break an assumption and the elegant guarantees — sample-free items, comparable scores — quietly stop holding.
ImpactMojoItem Response Theory 101www.impactmojo.in
IRT is data-hungry
ModelRough sample-size guidanceNote
1PL / Rasch~200+ respondentsMost forgiving
2PL~500+ respondentsEstimating a needs more data
3PL~1,000+ respondentsc is hard to estimate; often fixed
Illustrative rules of thumb only — needs depend on test length, item quality and how spread the sample is. Small, homogeneous samples can defeat even a simple model.
ImpactMojoItem Response Theory 101www.impactmojo.in
Where IRT can mislead
  • Garbage items in, garbage scale out — IRT cannot rescue badly written items
  • A neat θ can hide a contested concept — 'empowerment' is political, not just psychometric
  • Multidimensional traits forced onto one scale lose meaning
  • Black-box scores are harder for non-specialists to interpret than a percentage
ImpactMojoItem Response Theory 101www.impactmojo.in
Tools for fitting IRT models
ToolGood forNote
R: mirtUni- & multidimensional IRT, all common modelsFree, powerful, well documented
R: ltm1PL/2PL/3PL for dichotomous & graded itemsFree, gentle entry point
R: TAM / eRmLarge-scale & Rasch modellingFree; TAM mirrors big assessments
Stata: irt suiteIRT within a familiar stats packageBuilt-in irt commands
jMetrik / IRTPROPoint-and-click psychometricsLower coding barrier
ImpactMojoItem Response Theory 101www.impactmojo.in
A sensible IRT workflow
01
CHECK dimensionality & local independence
02
FIT the simplest defensible model (start Rasch)
03
EXAMINE item fit, a, b, (c) and information
04
SCREEN for DIF across key groups
05
REVISE items, then score & report θ with its error
ImpactMojoItem Response Theory 101www.impactmojo.in
Translating θ for non-specialists
A θ of 1.2 means nothing to a programme officer or a parent. Part of doing IRT well is translating the scale back into language people act on — proficiency bands, 'can read a paragraph', percentile, or a clear cut-score.
Always report the standard error alongside the score, and describe what the cut-points mean in real-world terms. A precise number nobody understands helps no one.
ImpactMojoItem Response Theory 101www.impactmojo.in
Where to go deeper
  • Item Response Theory for Psychologists — Embretson & Reise (the standard, readable introduction)
  • The Theory and Practice of Item Response Theory — de Ayala
  • Fundamentals of Item Response Theory — Hambleton, Swaminathan & Rogers
  • Applying the Rasch Model — Bond & Fox (Rasch-focused)
Pair this deck with ImpactMojo's Data Literacy, Survey Design and Monitoring & Evaluation 101 courses.
ImpactMojoItem Response Theory 101www.impactmojo.in
If you remember five things
  • The trait is hidden — IRT models the probability of each response from θ
  • Higher b = harder; higher a = steeper; c is the guessing floor
  • Information, not one reliability number — precision varies along θ, so target your test
  • DIF = same θ, different odds across groups — check for it
  • The guarantees hold only if the assumptions do — validate, don't assume
ImpactMojoItem Response Theory 101www.impactmojo.in
Item Response Theory 101 · Complete
Now measure the
unmeasurable, well.
CC BY-NC-ND 4.0·Free Forever·ImpactMojo 101 Series