Sampling Foundations for Statistical Inference

Learn about samples and population in data, how sampling techniques are used for statistical inference.

Sampling Foundations for Statistical Inference

In statistics and predictive modeling, we rarely have the luxury of analyzing an entire Population (NN). Instead, we rely on a Sample (nn), a smaller subset used to make inferences about the larger group.

What is a Sample?

A Sample is a smaller, manageable subset of a larger population. The goal of sampling is to collect evidence from this subset to make mathematical inferences about the whole group.

  • Population (NN): The entire group.
  • Sample (nn): The subset selected for analysis.

1. Probability Sampling Techniques

Probability sampling uses random selection, ensuring every unit has a known, non-zero probability of being chosen. This enables calculation of Sampling Error.

Simple Random Sampling (SRS)

Every individual has an equal probability of selection.

Formula:

P=nNP = \frac{n}{N}

Where:

  • PP = probability of selecting any individual unit
  • nn = sample size
  • NN = population size

The sample mean is:

xˉ=1ni=1nxi\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i

Where:

  • xˉ\bar{x} = sample mean
  • nn = number of observations in the sample
  • xix_i = each individual observation
  • μ\mu = population mean (the parameter being estimated)
Population (N=12N=12)ProcessSample (n=3n=3)
🔴 🔵 🟡 🟢
🔵 🔴 🟢 🟡
🟡 🟢 🔴 🔵
Random Selection\xrightarrow{\text{Random Selection}}🔴
🟢
🟡

Systematic Sampling

Selecting units at fixed intervals from an ordered population, starting from a random position.

Formula:

k=Nnk = \frac{N}{n}

Where:

  • kk = sampling interval
  • NN = population size
  • nn = desired sample size

A random starting point rr is chosen such that 1rk1 \le r \le k, and the sample consists of:

r,  r+k,  r+2k,  r,\; r+k,\; r+2k,\; \dots
Population (N=12N=12)ProcessSample (n=2n=2)
⚪ ⚪ ⚪ ⚪ 🔴
⚪ ⚪ ⚪ 🔴 ⚪ ⚪
Pick every 5th item\xrightarrow{\text{Pick every } 5^{th} \text{ item}}🔴
🔴

Stratified Sampling

Dividing the population into Strata and sampling proportionally from each group.

Formula:

nh=(NhN)×nn_h = \left( \frac{N_h}{N} \right) \times n

Where:

  • nhn_h = sample size from stratum hh
  • NhN_h = population size of stratum hh
  • NN = total population size
  • nn = total sample size
Population (Strata)ProcessSample (Representative)
🔴 🔴 🔴 🔴
🔵 🔵 🔵 🔵
🟢 🟢 🟢 🟢
Randomly sample nh from each\xrightarrow{\text{Randomly sample } n_h \text{ from each}}🔴
🔵
🟢

The Logic: The population is organized into homogeneous subgroups (strata) based on a specific characteristic (like M&M color). By sampling from every group, we ensure that the sample reflects the true diversity of the population, even for smaller minority groups.


Cluster Sampling

Dividing the population into Clusters, randomly selecting clusters, and sampling all units within them.

Formula:

Y^=Mmi=1myi\hat{Y} = \frac{M}{m} \sum_{i=1}^{m} y_i

Where:

  • Y^\hat{Y} = estimated population total
  • MM = total number of clusters in the population
  • mm = number of clusters sampled
  • yiy_i = total value from cluster ii
Population (Clusters)ProcessSample (Selected Clusters)
[🔴 🔵] [🟡 🟢]
[🔵 🟡] [🟢 🔴]
Randomly select m clusters\xrightarrow{\text{Randomly select } m \text{ clusters}}[🔵 🟡]
[🟢 🔴]

The Logic: The population is divided into heterogeneous groups (clusters) that each represent the whole. Instead of sampling individuals, we randomly select entire clusters and study every unit within them. This is highly cost-effective when the population is geographically spread out.


2. Non-Probability Sampling Techniques

Non-probability sampling does not use random selection. The probability of selection is unknown, which means sampling error cannot be formally calculated.

Quota Sampling

Selecting units until predefined category targets are met.

This method ensures representation of certain characteristics but does not rely on randomness.

Population (Categorized)ProcessSample (Target Met)
🔴 🔴 🔴 🔴
🔵 🔵 🔵 🔵
🟢 🟢 🟢 🟢
Fill Quota: 2 Red, 2 Blue\xrightarrow{\text{Fill Quota: 2 Red, 2 Blue}}🔴 🔴
🔵 🔵

Note: Unlike Stratified sampling, the selection within those groups is not random it often depends on who is easiest to reach first.


Judgmental Sampling

The researcher selects units based on expertise or subjective judgment.

This is often used when specialized knowledge is required.

Population (Diverse)ProcessSample (Expert Selection)
🔴 ⚪ ⚪ ⚪
⚪ ⚪ 🔴 ⚪
⚪ 🔴 ⚪ ⚪
Researcher picks specific units\xrightarrow{\text{Researcher picks specific units}}🔴
🔴
🔴

The Logic: Selection is driven by a specific purpose or research objective. Instead of relying on a random draw, the researcher deliberately picks individuals who are "most informative" or possess specific expertise relevant to the study.


Snowball Sampling

Existing participants recruit future participants.

This is commonly used when studying hard-to-reach populations.

Population (Hidden)Selection Process (Referrals)Final Sample (The Chain)
⚪ ⚪ ⚪ ⚪ ⚪
⚪ 🔴 ⚪ 🔴 ⚪
⚪ ⚪ 🔴 ⚪ ⚪
⚪ 🔴 ⚪ 🔴 ⚪
🔴
↙ ↘
🔴    🔴
↙ ↘    ↙ ↘
🔴 🔴   🔴 🔴
🔴
🔴 🔴
🔴 🔴 🔴 🔴

The Logic: Selection starts with a "seed" subject who meets the criteria. That subject then recruits or refers others from their network who also fit the criteria. This creates a branching tree structure, which is the only effective way to reach niche populations without a formal database.


Convenience Sampling

Selecting units that are easiest to access.

This method is fast and inexpensive but highly prone to bias.

Population (All)ProcessSample (The Nearest)
🔴 ⚪ 🔴 ⚪
⚪ ⚪ ⚪ ⚪
🔴 🔴 🔴 🔴
⚪ ⚪ ⚪ ⚪
Pick the closest units\xrightarrow{\text{Pick the closest units}}🔴 🔴
🔴 🔴

The Logic: This is the "handiest" method where the researcher samples whoever is nearby or currently available. While it is the fastest and cheapest method, it carries the highest risk of Selection Bias because it ignores everyone outside the immediate vicinity.


Sampling in the ML Pipeline

  • In Machine Learning, sampling is the foundation of model generalization.
  • By using Stratified Sampling to handle class imbalance and Cluster Sampling logic to prevent data leakage during train-test splits, we ensure that models perform accurately on unseen data.
  • A sophisticated algorithm cannot fix a biased dataset , this is where sampling helps in stabilizing the model.

Contributors

lv1607
Sampling Foundations for Statistical Inference