LABEL-- an identifier of the municipality running from 1 to 284.
P85-- 1985 population in thousands.
SS82-- number of Social Democratic seats in the municipal council.
REV84-- real estate values according to the 1984 assessment (in millions
of kronor).
CL-- an identifier for the cluster the municipality belongs to, where
clusters are constructed by aggregating neighboring municipalities.
There are 50
clusters in the country.
NCLUS-- the number of municipalities in the cluster in which the municipality
belongs.
We want to estimate the totals of P85, SS82, and REV84 in all of Sweden. The sampled municipalities in swedish_munis were selected in two stages: 1) fifteen clusters were selected according to a probability proportional to size scheme (without replacement), where the size measure is the number of municipalities in the cluster; 2) two municipalities were selected in each sampled cluster by simple random sampling.
a) Explain how you calculate the weight for each unit.
b) Use STATA to estimate the total, standard error of the total, and
the design effect (DEFF) for each variable.
c) The estimate of the standard error is not exactly right. Recall
that STATA does not incorporate the second stage variance in its estimate,
and that it uses the simplified formula for with replacement sampling at
the first stage as opposed to the complex formulas for without replacement
sampling. Based on the data, do you think that STATA is likely to underestimate
or overestimate the variance? (Hint: construct a boxplot of the data
by cluster to see roughly how much within-cluster variability exists relative
to between-cluster variability. This gives you a sense of the effect
of not incorporating the second stage variance. Now, consider the
variance of the cluster totals--if it is large, then the with-replacement
variance greatly overestimates the variance of the pps estimator.)
d) Interpret the values of the design effects for these variables.
Would an SRS have been more accurate than this pps, two-stage cluster sample?
Posit a reason why Statistics Sweden might have used a pps, two-stage
cluster sample instead of the SRS.
Relevant Stata commands:
To estimate parameters in pps cluster designs in STATA, use the psu(clustername) option after performing a svytotal, svymean, or svyratio. For example, to estimate a total of var1 and var2 in a cluster sample, use:
svytotal var1 var2 [weight=wtsname], psu(clustername)
Notice that there is no finite population correction factor for pps designs. STATA estimates the variance by using the with-replacement variance formula (see Lohr, Equation 6.7). The without-replacement variance formula (Lohr, Equation 6.14) requires too much effort in terms of keeping track of all the joint inclusion probabilities, pi_ij.
For two-stage cluster sampling, you use the same command structure, but make sure that the weights reflect two-stage sampling. That is, make sure that the weight for each unit is the inverse of the product of the inclusion probability for the psu (the cluster) and the ssu (the unit within the cluster).
For two-stage cluster sampling, STATA estimates the variance of the pi-estimator by ignoring the variability due to the second stage of sampling. STATA determines variances this way to simplify calculations, so that variances in different types of two-stage sampling designs can be estimated the same way.