Simple Random Sampling vs. PPS Sampling

TESTING A question came up on one of our evaluations on whether we should use simple random sampling (SRS) or probability proportional to size (PPS) sampling when selecting villages (our primary sampling units) for a matching study. Under SRS, you randomly select primary sampling units (PSUs) until you reach your desired sample size. With PPS sampling, you select your PSUs using some measure of size. PPS is often used in a first stage of a two-stage sampling design because if you use PPS to select PSUs and then select a fixed number of units (households in our case) per PSU in the second stage of sampling, the probability of selection will be identical for all units. (To see this, note that the probability of selecting each PSU is $ n_1*w_i$) where $ n_1$ is the number of PSUs you select and the probability of selecting a household in a village conditional on selecting the village in the first stage is $\frac{n_2}{w_i}$ where $n_2$ is the number of households sampled per village. Note that this depends on your estimate of size being 100% accurate. More details here. Code to perform PPS without replacement here.

A quick way to answer this question is to use something called the “design effect.” The design effect is defined as the ratio of the variance of your estimate under a given sampling scheme to the variance of your estimate under SRS of final units. (Note that performing SRS on your final units, here households, is not the same as performing SRS of PSUs and then selecting a fixed number of units per PSU.) Design effects are typically used to estimate sample size requirements for population-based surveys that don’t use SRS. For example, you may know that for a household survey in India looking at consumption expenditure and using a certain sampling strategy, the design effect is likely to be around 2. To estimate the required sample size for this survey, you first estimate the sample size required under SRS and then double it. (Remember that the variance of a mean of a sample drawn using SRS is $\frac{\sigma^2}{N}$ so if you multiple the variance by X you also need to multiply the sample size by X to get the same variance.)

If we used SRS to select villages and then selected a fixed number of households per village, we might want to weight each household by the number of households in the village, or $ \frac{w_i}{\sum{w_i}} $, when performing our final analysis so that our estimates are representative of the entire population. (I say “might” because there are differing opinions on this. Stay tuned for a book club discussion on this topic.) If we do use weights in our analysis, and if we assume constant variance $ \sigma_y^2 $ per village and do a little hand waving, the variance of our estimate of the mean of a variable in the treatment group would be roughly:

$Var(\bar{y}_{w})=Var\left(\sum{\frac{w_iy_i}{\sum{w_i}}}\right)=\frac{\sum{w_i^2}}{\left(\sum{w_i}\right)^2}\sigma_y^2$

Where $ w_i $ is the size of village i.

If we sample using PPS, our variance would be roughly the variance for an estimate under SRS of final units (which, again, is different from SRS of PSUs) which would just be $ \frac{\sigma_y^2}{N} $. Thus, our design effect is…

$\frac{\sum{w_i^2}}{\left(\sum{w_i}\right)^2}N$

To estimate the final sample size required if using weights, we first calculate the sample size required using standard power calculations and then multiply this by our estimate of the design effect. Note that this is a pretty rough calculation (for example, I’m not taking into account the fact that both sampling schemes involve two stages), but it gives you an approximate idea of how the sampling scheme will affect power.

Another consideration in choosing between the two sampling schemes for this evaluation is that we have to do a full household listing in each village. On average, the villages selected using PPS will be significantly larger than under simple random sampling (where the expected value of the village sizes would be equal to the average village size). The formula for the expected size of the PSU under PPS (assuming we are just selecting one PSU or selecting with replacement) is…

$E[w_i]=\frac{\sum{w_i^2}}{\sum{w_i}}$

To derive this, note that the expected value of a variable is the sum of the probability of selecting each value of the variable times the value. Under PPS, the probability of selecting each village is $\frac{w_i}{\sum{w_i}} $.