| Primary | Secondary | |
|---|---|---|
| Source | You collect it | Someone else collected it |
| Example | Your baseline survey, FGDs | Census, NFHS, district HMIS |
| Control | High — you design it | Low — you take it as given |
| Cost / time | High | Usually low |
| Risk | Fieldwork error, bias | May not fit your question |
| Source | What it covers | Frequency |
|---|---|---|
| Census of India | Every person — population, literacy, housing, migration | Decennial (2011 latest) |
| NFHS | Health, nutrition, fertility, anaemia, women's status | ~5 years (NFHS-5: 2019–21) |
| NSS / PLFS | Consumption, employment, unemployment | PLFS annual since 2017–18 |
| SRS | Birth & death rates, infant mortality, life expectancy | Annual |
| HMIS | Facility-level health service delivery | Monthly |
| SECC 2011 | Socio-economic & caste deprivation indicators | One-off (2011) |
| Level | Meaning | Example | Valid maths |
|---|---|---|---|
| Nominal | Labels, no order | District, caste, religion | Counts, mode |
| Ordinal | Ordered, unequal gaps | Wealth quintile, Likert scale | Median, rank |
| Interval | Equal gaps, no true zero | Temperature (°C), calendar year | Mean, difference |
| Ratio | Equal gaps, true zero | Income, age, children ever born | All, ratios |
| Measure | What it is | Best when |
|---|---|---|
| Mean | Arithmetic average | Roughly symmetric data, no wild outliers |
| Median | Middle value when sorted | Skewed data — income, land, wealth |
| Mode | Most frequent value | Categories — commonest crop, caste, response |
| You want to show… | Use | Avoid |
|---|---|---|
| Change over time | Line chart | Pie chart |
| Comparison across categories | Bar chart | 3-D anything |
| Composition / shares of a whole | Stacked bar (or 1 pie, few slices) | Many pies |
| Relationship between two variables | Scatter plot | Dual-axis tricks |
| Distribution of one variable | Histogram / box plot | Single average |
| Geographic pattern | Choropleth map | Map coloured by raw counts |
| Method | How | Use when |
|---|---|---|
| Simple random | Every unit equal chance | You have a full list |
| Systematic | Every k-th unit from a list | Ordered list, no hidden cycle |
| Stratified | Split into groups, sample each | You must represent subgroups |
| Cluster | Sample whole groups (villages) | People are geographically spread |
| Multistage | Clusters, then units within | Large national surveys (NFHS) |
| Problem | Example | Risk |
|---|---|---|
| Missing values | Blank income field | Biased averages if not random |
| Duplicates | Same beneficiary twice | Inflated counts |
| Inconsistent codes | 'F' / 'Female' / '2' | Broken grouping |
| Outliers / impossible | Age = 200, −5 children | Distorted statistics |
| Format drift | DD/MM vs MM/DD dates | Silent miscalculation |
| Typos in keys | Misspelt village name | Failed merges |
| Tool | Good for | Note |
|---|---|---|
| Spreadsheets (Excel, Google Sheets) | Most everyday analysis | Start here; learn pivot tables |
| R | Statistics, reproducible analysis, graphics | Free, powerful, steeper curve |
| Python (pandas) | Cleaning, large data, automation | Free, general-purpose |
| KoboToolbox / ODK | Mobile survey data collection | Free, offline-capable |
| QGIS | Maps and spatial data | Free, open-source GIS |
| Power BI / Looker Studio | Dashboards | Quick visual reporting |