Math 443: The Mathematics and Statistics of Surveys

Assignment 6 


Due Date: November 17.
Reading: Lohr, Chapter 6  and 7.

 
Problems:

 
  1. Lohr, Chapter 6, Problem 1.

  2. For this problem, you will use the data set swedish_munis, which is in the Math 443 data sets folder.  The data set contains information on a sample of Swedish municipalities, which are defined as a town and its surrounding area.  There are 284 such municipalities in Sweden, and they vary in characteristics.  The data set contains the following variables:

  3. LABEL-- an identifier of the municipality running from 1 to 284.
    P85-- 1985 population in thousands.
    SS82-- number of Social Democratic seats in the municipal council.
    REV84-- real estate values according to the 1984 assessment (in millions of kronor).
    CL-- an identifier for the cluster the municipality belongs to, where clusters are constructed by aggregating neighboring municipalities.
              There are 50 clusters in the country.
    NCLUS-- the number of municipalities in the cluster in which the municipality belongs.

    We want to estimate the totals of P85, SS82, and REV84 in all of Sweden. The sampled municipalities in swedish_munis  were selected in two stages: 1)  fifteen clusters were selected according to a probability proportional to size scheme (without replacement), where the size measure is the number of municipalities in the cluster; 2) two municipalities were selected in each sampled cluster by simple random sampling.

    a)  Explain how you calculate the weight for each unit.


    b) Use STATA to estimate the total, standard error of the total, and the design effect (DEFF) for each variable.


    c) The estimate of the standard error is not exactly right.  Recall that STATA does not incorporate the second stage variance in its estimate, and that it uses the simplified formula for with replacement sampling at the first stage as opposed to the complex formulas for without replacement sampling. Based on the data, do you think that STATA is likely to underestimate or overestimate the variance?  (Hint: construct a boxplot of the data by cluster to see roughly how much within-cluster variability exists relative to between-cluster variability.  This gives you a sense of the effect of not incorporating the second stage variance.  Now, consider the variance of the cluster totals--if it is large, then the with-replacement variance greatly overestimates the variance of the pps estimator.) 


    d) Interpret the values of the design effects for these variables.  Would an SRS have been more accurate than this pps, two-stage cluster sample? Posit a reason why Statistics Sweden might have used a  pps, two-stage cluster sample instead of the SRS.
     

  4. Lohr, Chapter 7, Problem 3.

  5.  
  6. Lohr, Chapter 7, Problem 9.

  7.  
  8. Lohr, Chapter 7, Problem 13 and 14.  For Problem 14, you do not need to graph the data. I suggest using EXCEL for Problem 14.  I have included an EXCEL file of the data set (syc.excel) in the course folder as well.  We will learn how to estimate variances of these quantities in Chapter 9.

  9.  

     
     
     
     
     

    Relevant Stata commands:

    To estimate parameters in pps cluster designs in STATA, use the psu(clustername) option after performing a svytotal, svymean, or svyratio.  For example, to estimate a total of var1 and  var2 in a cluster sample, use:

    svytotal var1 var2 [weight=wtsname], psu(clustername)

    Notice that there is no finite population correction factor for pps designs.  STATA estimates the variance by using the with-replacement variance formula (see Lohr, Equation 6.7).  The without-replacement variance formula (Lohr, Equation 6.14) requires too much effort in terms of keeping track of all the joint inclusion probabilities, pi_ij.

    For two-stage cluster sampling, you use the same command structure, but make sure that the weights reflect two-stage sampling.  That is, make sure that the weight for each unit is the inverse of the product of the inclusion probability for the psu (the cluster) and the ssu (the unit within the cluster).

    For two-stage cluster sampling, STATA estimates the variance of the pi-estimator by ignoring the variability due to the second stage of sampling.  STATA determines variances this way to simplify calculations, so that variances in different types of two-stage sampling designs can be estimated the same way.

     Stata handout.

     
     
       



Jerome.P.Reiter

Sat Sep 25 20:29:13 EDT 1999