| HH ID | Person line | Age | Sex | Education |
|---|---|---|---|---|
| 10231 | 1 | 44 | Male | Secondary |
| 10231 | 2 | 40 | Female | Primary |
| 10231 | 3 | 16 | Female | Secondary |
| 10231 | 4 | 11 | Male | Primary |
| 10232 | 1 | 67 | Female | None |
| Survey | What it covers | Who runs it | Frequency |
|---|---|---|---|
| NSS (consumption, etc.) | Consumption expenditure, employment, social consumption | NSSO / MoSPI | Rounds (subject rotates) |
| PLFS | Labour force: work, unemployment, wages | NSSO / MoSPI | Annual since 2017–18 |
| NFHS | Health, nutrition, fertility, anaemia, women's status | IIPS / MoHFW | ~5 yrs (NFHS-5: 2019–21) |
| CMIE-CPHS | Household income, consumption, sentiment | CMIE (private) | Continuous, 3 waves/yr |
| Question | What you are checking |
|---|---|
| How many rows & columns? | Size and whether the file is complete |
| What type is each column? | Numeric, text, date, categorical codes |
| What is the range of each? | Min, max, plausibility |
| How many distinct values? | Categorical levels, accidental duplicates |
| How many missing per column? | Where the gaps are |
| Codebook field | Why EDA needs it |
|---|---|
| Variable label | What it actually measures |
| Value codes | 1 = yes, 2 = no, 9 = missing |
| Units | Rupees? months? per week? |
| Universe / who answers | Only women 15–49? only workers? |
| Reference period | Last 7 days? last 30? last year? |
| Skip patterns | Why a block is blank for some rows |
| Level | Meaning | Survey example | Valid maths |
|---|---|---|---|
| Nominal | Labels, no order | Religion, state, ration-card type | Counts, mode |
| Ordinal | Ordered, unequal gaps | Education level, wealth quintile | Median, rank |
| Interval | Equal gaps, no true zero | Year of birth | Mean, difference |
| Ratio | Equal gaps, true zero | Age, income, consumption | All, ratios |
| Code | Often means | Risk if treated as a value |
|---|---|---|
| 99 / 999 | Not known / not stated | Inflates the mean enormously |
| 97 / 98 | Refused / not applicable | Phantom category in tables |
| 0 | Sometimes a real zero, sometimes 'none' | Ambiguous — check codebook |
| Blank | Skip pattern or true missing | Silent loss of cases |
| Sanitation facility | Households | % |
|---|---|---|
| Improved, not shared | 6,420 | 64.2 |
| Improved, shared | 1,180 | 11.8 |
| Unimproved | 910 | 9.1 |
| Open defecation | 1,390 | 13.9 |
| Not stated | 100 | 1.0 |
| Problem | Survey example | How EDA spots it |
|---|---|---|
| Missing values | Income blank for some | Count of NA per variable |
| Impossible values | Age = 230, −3 children | Min/max range check |
| Special codes as data | 99 = 'not stated' | Spike at 99 in histogram |
| Top-coding | Income capped at a max | Wall of cases at the ceiling |
| Heaping | Ages bunched at 0, 5, 10 | Comb pattern in histogram |
| Duplicates | Same HH twice | Duplicate household IDs |
| Rule | Violation to flag |
|---|---|
| Age between 0 and ~110 | Age = 230, age = −1 |
| Percentages in 0–100 | Vaccination = 140% |
| Members ≥ earners | 8 earners in a 4-person household |
| Mother older than child | Mother 14, child 10 |
| Consumption > 0 | MPCE = 0 with members present |
| Sector | Has toilet | No toilet | % with toilet |
|---|---|---|---|
| Rural | 5,180 | 2,020 | 71.9 |
| Urban | 2,610 | 190 | 93.2 |
| All | 7,790 | 2,210 | 77.9 |
| Tool | Good for | Note |
|---|---|---|
| R + tidyverse | Cleaning, plotting, reproducible analysis | Free; survey & srvyr packages handle weights |
| Python + pandas | Cleaning, large data, automation | Free; samplics / statsmodels for survey design |
| Stata | Standard for official microdata | svyset built-in for weights & design; widely used |
| Spreadsheets | Quick first look, small tables | Fine to start; not for weighted survey estimates |