```{r setup, include=FALSE}
pkg <- c("tidyr", "dplyr", "ggplot2",
"knitr", "rmarkdown", "readr",
"DT","devtools", "broom")
new.pkg <- pkg[!(pkg %in% installed.packages())]
if (length(new.pkg)) {
install.packages(new.pkg, repos = "http://cran.rstudio.com")
}
lapply(pkg, library, character.only = TRUE)
if(!require(oilabs))
devtools::install_github("ismayc/oilabs")
options(digits = 5)
```
# Problem Statement
Students at Virginia Tech studied which vehicles come to a complete stop at an intersection with four-way
stop signs, selecting at random the cars to observe. They looked at several factors to see which (if any) were associated with coming to a
complete stop. (They defined a complete stop as “the speed of the vehicle will become zero at least for an
[instant]”). Some of these variables included the age of the driver, how many passengers were in the
vehicle, and type of vehicle. The variable we are going to investigate is the arrival position of vehicles
approaching an intersection all traveling in the same direction. They classified this arrival pattern into
three groups: whether the vehicle arrives alone, is the lead in a group of vehicles, or is a follower in a
group of vehicles. The students studied one specific intersection in Northern Virginia at a variety of
different times. Because random assignment was not used, this is an observational study. Also note that
no vehicle from one group is paired with a vehicle from another group. In other words, there is
independence between the different groups of vehicles. [Tweaked a bit from @isi2014 [p. 8-2 - 8-13]]
# Competing Hypotheses
## In words
- Null hypothesis: There is no association between the arrival position of the vehicle and whether
or not it comes to a complete stop.
- Alternative hypothesis: There is an association between the arrival position of the vehicle and
whether or not it comes to a complete stop.
## Another way in words
- Null hypothesis: The long-run probability that a single vehicle will stop is the **same** as the long-run
probability a lead vehicle will stop, which is the **same** as the long-run probability that a
following vehicle will stop. In other words all three long-run probabilities are actually the same.
- Alternative hypothesis: At least one of these parameter (long-run) probabilities is different from the others
## In symbols (with annotations)
- $H_0: p_{single} = p_{lead} = p_{follow}$, where $p$ represents the long-run probability a vehicle will stop.
- $H_A$: At least one of these parameter probabilities is different from the others
## Set $\alpha$
It's important to set the significance level before starting the testing using the data. Let's set the significance level at 5\% here.
# Exploring the sample data
--------------------------------------------------------------------------------------------------------
Single Vehicle Lead Vehicle Following Vehicle TOTAL
----------------------- --------------------- --------------- ---------------------- -------------------
Complete Stop 151 (0.858) 38 (0.905) 76 (0.776) **265**
Not Complete Stop 25 (0.142) 4 (0.095) 22 (0.224) **51**
**Total** **176** **42** **98** **_316_**
----------------------- --------------------- --------------- ---------------------- -------------------
Table: Observed Counts and (Conditional Probabilities)
```{r bargraph}
stop <- c(rep("complete", 265), rep("not_complete", 52))
vehicle_type <- c(rep("single", 151), rep("lead", 38), rep("follow", 76),
rep("single", 25), rep("lead", 5), rep("follow", 22))
df <- data.frame(stop, vehicle_type)
ggplot(data = df, mapping = aes(x = vehicle_type, fill = stop)) +
geom_bar(position = "fill", color = "black") +
xlab("\nArrival Position of Vehicle") +
ylab("Conditional Probability\n")
```
## Guess about statistical significance
We are looking to see if a difference exists in the heights of the bars corresponding to `complete`. Based solely on the picture, we have reason to believe that a difference exists since the `follow` bar seems to be lower than the other two by quite a big margin. BUT...it's important to use statistics to see if that difference is actually statistically significant!
# Check conditions
Remember that in order to use the short-cut (formula-based, theoretical) approach, we need to check that some conditions are met.
1. _Independence_: Each case that contributes a count to the table must be independent of all the other cases in the table.
This condition is met since cars were selected at random to observe.
2. _Sample size_: Each cell count must have at least 5 expected cases.
This is met by observing the table above.
3. _Degrees of freedom_: We need 3 or more columns in the table.
This is met by observing the table above.
# Test statistic
The test statistic is a random variable based on the sample data. Here, we want to look for deviations from what we would expect cells in the table if the null hypothesis were true. This requires us to calculate expected counts via
$$\text{Expected Count}_{\text{row } i, \text{col } j} = \dfrac{\text{row } i \text{ total} \times \text{column } j \text{ total}}{\text{table total}}$$
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Single Vehicle Lead Vehicle Following Vehicle TOTAL
----------------------- ---------------------------------------- ---------------------------------------- ---------------------------------------- ----------------------------------------
Complete Stop 151 (147.59) 38 (35.22) 76 (82.18) **265**
Not Complete Stop 25 (28.41) 4 (6.78) 22 (15.82) **51**
**Total** **176** **42** **98** **_316_**
----------------------- ---------------------------------------- ---------------------------------------- ---------------------------------------- ----------------------------------------
Table: Observed Counts and (Expected Counts)
$X^2 = \sum_{\text{all cells in the table}} \dfrac{(\text{observed count} - \text{expected count})^2}{\text{expected count}}$
Assuming the conditions outlined above are met, $X^2 \sim \chi^2(df = (R - 1) \times (C - 1))$ where $R$ is the number of rows in the table and $C$ is the number of columns.
## Observed test statistic
While one could compute this observed test statistic by "hand", the focus here is on the set-up of the problem and in understanding which formula for the test statistic applies. We can use the `inference` function in the `oilabs` package to perform this analysis for us.
```{r infer}
inference(data = df, y = stop,
x = vehicle_type,
statistic = "proportion",
type = "ht",
alternative = "greater",
method = "theoretical",
show_eda_plot = FALSE,
show_inf_plot = FALSE)
```
We see here that the $x^2_{obs}$ value is around 4 with $df = (2 - 1)(3 - 1) = 2$.
# Compute $p$-value
The $p$-value---the probability of observing a $\chi^2_{df = 2}$ value of 4 or more in our null distribution---is (to one decimal place) 10%. This can also be calculated in R directly:
```{r pval}
1 - pchisq(3.9476, df = 2)
```
Note that we could also do this test directly without invoking the `inference` function using the `chisq.test` function.
```{r chisq.test}
chisq.test(x = table(df$vehicle_type, df$stop), correct = FALSE)
```
# State conclusion
We, therefore, do not have sufficient evidence to reject the null hypothesis. Our initial guess that a statistically significant difference existed in the proportions was not backed up by this statistical analysis. We do not have evidence to suggest that there is a dependency between the arrival position of the vehicle and
whether or not it comes to a complete stop.
---