Global Social Data

Includes data from including the CIA World Factbook, the World Bank, the Association of Religion Data Archives, the United Nations Office on Drugs and Crime, the International Centre for Prison Studies, and the Stockholm International Peace Research Institute.

  • Let's imagine this data represents our population

  • We want to identify the mean percentage of women holding national office.

library(haven)
library(dplyr)
global13_full <- read_sav("/shared/isma5720@pacificu.edu/global13sdss_0.sav")
global13 <- global13_full %>% select(country, FEMALEOFFICE)

Plotting the population data

library(ggplot2)
global13 %>% ggplot(mapping = aes(x = FEMALEOFFICE)) +
  geom_histogram(color = "white", bins = 10)

Sampling

  • Let's use sampling to produce an estimate
  • Let's choose a sample of size 25
library(mosaic); set.seed(2016)
sample1 <- global13 %>% resample(size = 25, replace = FALSE)
(mean_s1_f <- sample1 %>% summarize(mean_f = mean(FEMALEOFFICE)))
## # A tibble: 1 x 1
##   mean_f
##    <dbl>
## 1 21.352
  • But this is just one sample? We could produce many samples with different means. Which one is correct?

Sampling distribution

sample_means <- do(200) * 
  global13 %>% resample(size = 25, replace = FALSE) %>% 
  summarize(mean_f = mean(FEMALEOFFICE))
sample_means
##     mean_f
## 1   23.072
## 2   18.092
## 3   22.960
## 4   20.848
## 5   20.284
## 6   22.756
## 7   16.404
## 8   17.856
## 9   23.352
## 10  23.268
## 11  21.624
## 12  22.088
## 13  19.212
## 14  20.136
## 15  21.852
## 16  18.888
## 17  23.780
## 18  16.584
## 19  17.572
## 20  19.852
## 21  20.704
## 22  19.376
## 23  21.464
## 24  16.832
## 25  21.564
## 26  19.772
## 27  19.100
## 28  20.888
## 29  22.312
## 30  16.160
## 31  19.564
## 32  17.376
## 33  19.496
## 34  19.584
## 35  22.092
## 36  20.100
## 37  20.604
## 38  20.656
## 39  18.080
## 40  22.196
## 41  21.080
## 42  18.928
## 43  20.252
## 44  20.856
## 45  19.444
## 46  19.384
## 47  20.612
## 48  21.448
## 49  21.816
## 50  22.000
## 51  18.284
## 52  17.988
## 53  20.608
## 54  20.548
## 55  19.592
## 56  18.468
## 57  23.256
## 58  20.360
## 59  21.880
## 60  20.196
## 61  20.160
## 62  17.600
## 63  20.344
## 64  22.612
## 65  20.160
## 66  22.416
## 67  22.496
## 68  19.600
## 69  23.804
## 70  20.052
## 71  21.140
## 72  19.764
## 73  18.456
## 74  21.568
## 75  21.400
## 76  15.408
## 77  20.708
## 78  22.504
## 79  22.064
## 80  18.308
## 81  19.352
## 82  16.896
## 83  17.620
## 84  17.504
## 85  17.908
## 86  20.420
## 87  20.064
## 88  21.236
## 89  19.772
## 90  22.924
## 91  19.276
## 92  19.432
## 93  19.148
## 94  20.380
## 95  22.536
## 96  21.004
## 97  17.420
## 98  22.044
## 99  19.368
## 100 22.180
## 101 22.640
## 102 16.608
## 103 19.892
## 104 22.388
## 105 20.400
## 106 19.880
## 107 21.424
## 108 21.172
## 109 18.116
## 110 19.736
## 111 19.640
## 112 21.176
## 113 19.860
## 114 19.896
## 115 20.364
## 116 19.948
## 117 19.436
## 118 19.368
## 119 18.240
## 120 18.200
## 121 21.140
## 122 20.652
## 123 20.944
## 124 22.828
## 125 20.876
## 126 20.988
## 127 18.064
## 128 20.144
## 129 22.084
## 130 21.008
## 131 22.032
## 132 20.832
## 133 17.532
## 134 18.564
## 135 17.132
## 136 22.496
## 137 20.740
## 138 20.876
## 139 19.204
## 140 17.752
## 141 18.748
## 142 21.008
## 143 22.020
## 144 22.508
## 145 19.364
## 146 19.800
## 147 20.188
## 148 22.812
## 149 17.692
## 150 19.876
## 151 20.188
## 152 20.252
## 153 21.272
## 154 16.736
## 155 18.516
## 156 20.728
## 157 25.028
## 158 22.668
## 159 18.464
## 160 17.720
## 161 19.612
## 162 21.740
## 163 22.008
## 164 22.928
## 165 20.900
## 166 19.392
## 167 20.432
## 168 21.300
## 169 23.844
## 170 19.748
## 171 18.728
## 172 23.512
## 173 19.904
## 174 22.016
## 175 21.228
## 176 15.228
## 177 21.312
## 178 17.304
## 179 19.876
## 180 21.072
## 181 21.904
## 182 19.988
## 183 21.404
## 184 20.604
## 185 20.372
## 186 19.344
## 187 21.036
## 188 17.324
## 189 19.248
## 190 25.068
## 191 20.480
## 192 20.732
## 193 18.732
## 194 19.944
## 195 19.080
## 196 19.036
## 197 16.468
## 198 17.244
## 199 21.908
## 200 18.872

Plot of sampling distribution

library(ggplot2)
sample_means %>% ggplot(mapping = aes(x = mean_f)) +
  geom_histogram(color = "white", bins = 10)

Mean of the sampling distribution

The mean of the sampling distribution of means provides an estimate for our population mean:

sample_means %>% summarize(mean_samp_dist = mean(mean_f))
##   mean_samp_dist
## 1        20.2159

Population mean

global13 %>% summarize(pop_mean = mean(FEMALEOFFICE))
## # A tibble: 1 x 1
##   pop_mean
##      <dbl>
## 1 20.38286

Standard deviation of the sampling distribution

The standard deviation of the sampling distribution of means (also known as the standard error) provides as estimate of how much variability we can expect in the means as we go from one sample to another.

sample_means %>% summarize(std_error = sd(mean_f))
##   std_error
## 1  1.847808

How could we improve our estimate of the population mean?

  • Increase the sample size
  • Run more simulations (Instead of do()ing 200 we could do() 500, for example)

What if we didn't have the whole population?

We could resample one randomly selected sample to create a guess to what the sampling distribution might look like.

Important note: The bootstrap distribution is an approximation of the sampling distribution.

The bootstrap distribution will always be centered near the original sample statistic so it is better used as an estimate of the shape of the sampling distribution and as an estimate of the standard error.

Bootstrap our original sample

  • Let's use our original sample from global13 which we called sample1.
  • Let's look to see what one such resample() of sample1 gives us:
(boot1 <- sample1 %>% resample(orig.id = TRUE))
## # A tibble: 25 x 3
##        country FEMALEOFFICE orig.id
##          <chr>        <dbl>   <chr>
## 1  Netherlands         39.1      16
## 2       Greece         14.7       4
## 3  Switzerland         27.2      19
## 4    Australia         29.7       9
## 5      Iceland         33.3       2
## 6     Colombia          9.7      25
## 7    Australia         29.7       9
## 8      Germany         31.1       6
## 9  Switzerland         27.2      19
## 10 Switzerland         27.2      19
## # ... with 15 more rows

Computing the bootstrap mean

boot1 %>% summarize(boot_mean = mean(FEMALEOFFICE))
## # A tibble: 1 x 1
##   boot_mean
##       <dbl>
## 1    24.104

Repeating this process

Repeating this process over and over again, we can create a bootstrap distribution from the means of the bootstrap samples.

boot_means <- do(200) * 
  sample1 %>% resample(orig.id = TRUE) %>% 
  summarize(boot_mean = mean(FEMALEOFFICE))

Plotting the bootstrap distribution

boot_means %>% ggplot(mapping = aes(x = boot_mean)) +
  geom_histogram(color = "white", bins = 10)

Using the bootstrap distribution

To estimate the population mean

boot_means %>% summarize(mean_boot_dist = mean(boot_mean))
##   mean_boot_dist
## 1       21.43202

To estimate the standard error

boot_means %>% summarize(sd_boot_dist = sd(boot_mean))
##   sd_boot_dist
## 1     1.716578

Summary

  • This activity was meant to show you how the concepts of sampling and resampling interact

  • Bootstrapping works by allowing us to create a guess as to what the sampling distribution looks like.
    • This allows us to test hypotheses and create confidence intervals on unknown population values.
    • It also allows us to not necessarily worry about formulas and mathematical probability.
  • It does still require us to have a random sample from our population as our original sample though. Why is that?

Reason to be concerned

  • Remember that the original population distribution was right-skewed. Therefore, the mean doesn't make sense as a measure and may be causing some problems.

  • Repeat the sampling and bootstrapping steps above. This time using the median and the IQR instead of the mean and standard deviation.

  • Discuss the results for Wednesday.

To do for next time