Problem 1
What does the select
function do?
- A. Pick rows based on conditions about their values
- B. Sort the rows based on one or more variables
- C. Choose variables/columns by their names
- D. Choose observational units by their names
Problem 2
If you wanted to find the standard deviation for population in 1972, which code would be correct?
gap >%> filter(year == 1972) >%>
summarize(sd_pop = sd(pop))
gap >%> filter(year = 1972) >%>
summarize(sd_pop = sd(pop))
gap %>% filter(year == 1972) %>%
summarize(sd_pop == sd(pop))
gap %>% filter(year == 1972) %>%
summarize(sd_pop = sd(pop))
Problem 3
What does the filter
function do?
- A. Pick rows based on conditions about their values
- B. Sort the rows based on one or more variables
- C. Choose variables/columns by their names
- D. Choose observational units by their names
Problem 4
What is faceting?
- A. Creates small multiples of the same plot over a different categorical variable
- B. Creates small multiples of the same plot over a different numerical variable
- C. Create large multiples of the same plot over a different categorical variable
- D. Create small variables of the same plot over a similar numerical variable
Problem 5
What is the pipe operator?
Problem 6
In a tidy data set, what does a row refer to?
- A. The observational unit of the data set
- B. The variables
- C. There are not rows in tidy data sets, only in messy data sets
- D. Each row represents a different observational unit
Problem 7
chr
refers to what type of variables?
- A. Categorical
- B. Chronological
- C. Continuous
- D. Conclusive
Problem 8
What type of plot is usually preferred for an explanatory categorical variable and a response categorical variable?
- A. Faceted boxplot
- B. Side-by-side barplot
- C. Stacked barplot
- D. Faceted barplot
Problem 9
If you have produced a scatter plot, this means your data involves:
- A. Explanatory data that is continuous and response data that is numerical
- B. Multiple response values per explanatory value
- C. A single response value per explanatory value
- D. Both A and B
Problem 10
What does the %>%
do?
- A.
%>%
does nothing, you made this up.
- B. Acts as a
~
similar to ggplot
- C. Chains together
dplyr
functions
- D. It is used similarly to the
!
in the dplyr
package
Problem 11
In a ‘Tidy Data’ set…
- A. each observation forms a column, each variable forms a table, and each type of observational unit forms a row.
- B. each variable forms a column, each observation forms a row, and each observational unit forms a table.
- C. each observation forms a table, each variable forms a row, and each observational unit forms a column.
- D. None of the above.
Problem 12
The first thing you should do when given a data set is to
- A. count the columns and the rows.
- B. use the ‘View’ function to view the data.
- C. identify the observation unit, specify the variables, and give the types of variables you are presented with.
- D. All of the above.
Problem 13
The ‘select’ function allows you to
- A. Pick rows based on conditions about their values.
- B. Sort the rows based on one or more variables.
- C. Make a new variable in the data frame.
- D. Choose variables/columns by their names.
Problem 14
The ‘filter’ function allows you to
- A. Pick rows based on conditions about their values.
- B. Choose variables/columns by their names.
- C. Sort the rows based on one or more variables.
- D. none of the above.
Problem 15
A scatter plot is most appropriate when
- A. looking at the relationship between two categorical variables.
- B. looking at the relationship between two continuous variables.
- C. looking at the relationship between one continuous variable and one categorical variable.
- D. looking at one continuous variable.
Problem 16
A barplot would best be utilized in the case of
- A. One categorical predictor and one categorical response.
- B. One numerical predictor and one numerical response.
- C. One categorical predictor and one numerical response.
- D. One categorical variable.
Problem 17
What produces the same type of plot as geom_point
?
- A. geom_alpha
- B. geom_scattered
- C. geom_jitter
- D. Both A and C
Problem 18
Which of these is true?
- A. != corresponds to ‘not equal to’
- B. >= corresponds to ‘greater than’
- C. > functions as ‘+’
- D. <= labels a vector
Problem 19
Given the ‘weather’ dataset, what function would we use to choose only the variables of humidity and precipitation?
- A. arrange(weather, humid, precip)
- B. select(weather, humid, precip)
- C. filter(weather, humid, precip)
- D. summarize(weather, humid) %>% group_by(precip)
Problem 20
What is a useful function of the %>% ?
- A. It makes functions easier to read.
- B. It saves us from confusing parentheses.
- C. It emphasizes sequential breakdown.
- D. All of the above.
Problem 21
What is the process of decomposing frames into less redundant tables without losing info?
- A. framing
- B. decomposing
- C. tabling
- D. normalizing
Problem 22
What graph is most useful for a single categorical variable?
- A. Barplot
- B. Line-graph
- C. Pie chart
- D. Faceted barplot
Problem 23
What can we do in dplyr if we wish to chain together our codes?
- A. %>%
- B. +
- C. ()
- D. \(>\)
Problem 24
How can we bring two different data frames together in dplyr
- A. inner_join()
- B. select()
- C. rename()
- D. color()
Problem 25
Which of the following is not a part of the Five Main Verbs (FMV)?
- A. Mutate
- B. Summarize
- C. Mean
- D. Arrange
Problem 26
What package would you use the pipe in?
- A. dplyr
- B. ggplot
- C. A and B
- D. None of the above
Problem 27
What does the ‘mutate’ function do?
- A. Makes a new variable in the data frame
- B. Changes a specific variable slightly
- C. Turns your variable into a mutant
- D. Two of the above
Problem 28
What R code would you use to find the mean and standard deviation of temperature in the weather data set?
- A. summarize(weather, mean = mean(temp), std_dev = sd(temp))
- B. summarize(weather, mean = mean(temp, na.rm = TRUE), std_dev = sd(temp, na.rm = TRUE))
- C. summarize(weather, mean = (mean = temp), std_dev = (sd = temp))
- D. summarize(weather, mean = (mean == temp), std_dev = (sd == temp))
Problem 29
What does ! correspond to?
- A. Less than or equal to
- B. A very happy code
- C. Not equal to
- D. Greater than or equal to
Problem 30
What code will make a plot showing number of Strongly Autocratic
and Mildly Autocratic
countries in Africa over year, in each region. Color by subregion, with white borders.
gap %>% filter(region == "Africa") %>%
filter(dem_rank == "Strongly Autocratic" | "Mildly Autocratic") %>%
ggplot(mapping = aes(x = year, fill = subRegion)) +
geom_histogram(position = "dodge", bins = 10, color = "white")
gap %>% filter(region == "Africa") %>%
filter(dem_rank == "Strongly Autocratic" | dem_rank == "Mildly Autocratic") %>%
ggplot(mapping = aes(x = year, fill = subRegion)) +
geom_histogram(position = "dodge", bins = 10, color = "white")
gap %>% filter(region == Africa) %>%
filter(dem_rank = "Strongly Autocratic" | dem_rank = "Mildly Autocratic") %>%
ggplot(mapping = aes(x = year, fill = subRegion)) +
geom_histogram(position = "dodge", bins = 10, color = "white")
gap %>% filter(region == "Africa") %>%
filter(dem_rank == "Strongly Autocratic", "Mildly Autocratic") %>%
ggplot(mapping = aes(x = dem_rank, fill = subRegion)) +
geom_histogram(position = "dodge", bins = 10, color = "white")
Problem 31
Find the mean gdpPercap of each region for each year
- A. region_perCap <- gap %>% group_by(year) %>% summarize(mean_perCap =mean(gdpPercap))
- B. region_perCap <- gap %>% summarize(region, year, mean_perCap =mean(gdpPercap))
- C. region_perCap <- gap %>% group_by(region, year) %>% summarize(mean_perCap =mean(gdpPercap))
- D. region_perCap <- gap %>% group_by(region) %>% summarize(mean_perCap =mean(gdpPercap))
Problem 32
Now plot the mean gdpPercap per year of each region, fill color by region.
- A.gap %>% group_by(region) %>% summarize(mean_perCap =mean(gdpPercap)) %>% ggplot(mapping = aes(x = mean_perCap, y = region, fill = region)) + geom_boxplot()
- B. gap %>% group_by(region, year) %>% summarize(mean_perCap =mean(gdpPercap)) %>% ggplot(mapping = aes(x = region, y = mean_perCap, fill = region)) + geom_boxplot()
- C. gap %>% group_by(region, year) %>% summarize(mean_perCap =(gdpPercap)) %>% ggplot(mapping = aes(x = mean_perCap, y = region, fill = region)) + geom_boxplot()
- D. gap %>% group_by(region, year) %>% summarize(mean_perCap =mean(gdpPercap)) %>% ggplot(mapping = aes(x = mean_perCap, y = region, fill = region)) + geom_boxplot()
Problem 33
What is the mean dep_delay of flights leaving NYC in 2013? (write the code to find the answer)
- A. 12.63904
- B. 12.54918
- C. 12.54997
- D. 12.63907
Problem 34
Create a data frame showing only American Airlines flights that were delayed. (no on-time flights or early arrivals!)
- A.
AA_delayed <- flights %>% filter(carrier == "AA", dep_delay > 2)
- B.
AA_delayed <- flights %>% filter(carrier == "AA", dep_delay > 0)
- C.
AA_delayed <- flights %>% filter(carrier == "AA", dep_delay > -1)
- D.
AA_delayed <- flights %>% filter(carrier == "AA", dep_delay < 0)
Problem 35
When presented with a dataset, what are the first couple things you should do?
- A. Specify the variables
- B. Identify the observation unit
- C. Give the types of variables you are presented with
- D. All of the above.
Problem 36
How can piping be more beneficial instead of using the other way of doing things as you saw throughout this chapter?
- A. Saves you from confusing parentheses!
- B. Emphasizes the sequential breaking down of tasks.
- C. Makes it more readable.
- D. All of the above.
- E. None of the above.
Problem 37
What can be done when missing values from a data set need to be excluded from analysis?
- A. narm = TRUE
- B. na.rm = TRUE
- C. na.rm = FALSE
- D. na = TRUE
Problem 38
If you wanted to remove the dem_rank
variable from the gap
data set, which verb/function would be most important to utilize?
- A. select()
- B. arrange()
- C. filter()
- D. mutate()
Problem 39
What are critical functions you would need in order to find:
the mean for dem_rank
in the year 1952 in the gap
data frame?
- A. select(), filter(), mean_demrank(), mean()
- B. filter(), summarize(), mean(), na.rm = TRUE
- C. filter(), summarize(), mean()
- D. select(), summarize(), filter(), mean()
Problem 40
What does int, num and chr stand for while representing values in a data table.
- A. Integer, numeric, character
- B. Interval, numeric, character
- C. Integer, numeric, category
- D. Interval, numeric, category
Problem 41
If there is a N/A in the data set, what should we include in the R code to get rid of it.
- A. na.rm = TRUE
- B. na.rm = FALSE
- C. n/a = TRUE
- D. n/a = FALSE
Problem 42
If you were making a histogram, what would the ending part of the code to make the outline color black and the colors of the bars red.
- A. geom_histogram(bins = 5, color = “black”, fill = “red”)
- B. geom_histogram(bins = 5, color = black, fill = red)
- C. geom_histogram(bins = 5, color = “red”, fill = “black”)
- D. geom_histogram(bins = 5, color = red, fill = black)
Problem 43
In chapter 5 one of the r chunks that are given is portland_flights <- filter(flights, dest == “PDX”). What is the portland_flights <- for?
- A. That is to create and name the new data set.
- B. It’s to look in the already created data set portland_flights.
- C. It automatically extracts all the portland flights in the data set.
- D. None of the above.
Problem 44
In the flights data set how do you find the mean and standard deviation for all the air times of the flights.
- tim_air_flights <- summarise(flights, mean = mean(air_time, na.rm = TRUE), std_dev = sd(air_time, na.rm = TRUE))
- tim_air_flights <- summarize(flights, mean = mean(air_time, na.rm = FALSE), std_dev = sd(air_time, na.rm = FALSE))
- tim_air_flights <- summarise(air_time, mean = mean(air_time), std_dev = sd(air_time))
- tim_air_flights <- summarise(flights, mean = mean(air_time), std_dev = sd(air_time))
Problem 45
In a histogram, what does changing the amount of bins from 25 to 50 do to a data set?
- A. Shows how data is grouped into bins.
- B. Gives a more detailed distribution of the distribution of values in the data frame.
- C. Increase the amount of the data included in the data frame.
- D. All the above.
Problem 46
What needs to be added to the following code in order to create a side-by-side bar plot?
ggplot(data = flights_namedports, mapping = aes(x = carrier, fill = name)) +
- A. geom_bar(position = dodge)
- B. geom_sidebyside
- C. geom_bar
- D. geom_bar(position = “dodge”)
Problem 47
What situation below favors the use of a scatter plot?
- A. A graph showing relationship of two categorical variables.
- B. The relationship of one continuous variable across different levels of one categorical variable.
- C. The relationship between two continuous variables.
- D. The relationship of one continuous variable and two categorical variables.
Problem 48
What are the FMV
Five main verbs?
- A.
select
, filter
, group
, change
, rename
.
- B.
select
, filter
, summarize
, mutate
, arrange
.
- C.
filter
, select
, group_by
, extract
, mutate
.
- D. None of the above.
Problem 49
What is the proper AND THEN
sentence for the following code?
named_freq_dests <- ten_freq_dests %>%
inner_join(airports_small, by = c("dest" = "faa")) %>%
rename(airport_name = name)
named_freq_dests
- A. Make a
ten_freq_dests
and then join variables on new column and then rename column airport_name
.
- B. From the
named_freq_dests
data frame create a data frame ten_freq_dests
and then join the columns to bring data frames together and then rename that new column from name to airport_name
.
- C. Create a new data frame and then join the variables into one large variable and then name it
airport_name
.
- D. None of the above.
Problem 50
What makes a data set “tidy”?"
- A. Each variable forms a row, each observation forms a column, each observational unit forms the table.
- B.Each variable forms a column, each observation forms a row, each observational unit forms the table.
- C.The observational units form the columns, each observation forms a row, each observation forms the table.
- D.Each table forms a column, each observation forms a row, each variable forms the table.
Problem 51
library(nycflights13)
What is the code for this graph?
- A.ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(bin width=10,color=“black”, fill=“lightblue”, na.rm=TRUE)
- B.ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(bins=10,color=“black”, fill=“lightblue”, na.rm=TRUE)
- C.ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(bins=10,color=black, fill=lightblue, na.rm=TRUE)
- D.ggplot(data = weather, mapping = aes(x = temp)) + geom_boxplot(bins=10,color=“black”, fill=“lightblue”, na.rm=TRUE)
Problem 52
Is the “weather” dataset tidy? How should we fix it if it’s not?
- A. Yes, its tidy. It’s very neat and organized.
- B. No, the date variables should be combined.
- C. No, the wind variables should be combined.
- D.No, the dataset is too big to be tidy.
Problem 53
What is the standard deviation for population in 1952?
- A.
gap %>% select(year == 1952) %>% summarize(sd_pop = sd(pop))
- B.
gap %>% filter(year == 1952) %>% summarize(sd_pop = sd(pop))
- C.
gap %>% summarize(year == 1952, sd_pop = sd(pop))
- D.
gap %>% filter(year = 1952) %>% summarize(sd_pop = sd(pop))
Problem 54
Select the ’delay variables from the “flights dataset.
- A.flights_delay <- filter(flights, contains(“delay”))
- B.flights_delay == select(flights, contains(“delay”))
- C.flights_delay <- select(flights, contains(delay))
- D.flights_delay <- select(flights, contains(“delay”))
Problem 55
What does the select()
verb correspond with?
- A. Values
- B. Observational units
- C. Rows
- D.Columns by variable names
Problem 56
What does the str
function do?
- A. Straightens up data
- B. Stores data
- C. Sorts variables
- D. Specifies variables
Problem 57
What are the five named graphs?
- A. line, scatter, histogram, boxplot, barplot
- B. line, scatter, histogram, faceted- barplot, boxplot
- C. line, scatter, histogram, pie, boxplot
- D. line, scatter, histogram, faceted- histogram, boxplot
Problem 58
Which is not an element of the Grammar of Graphics?
- A.
geom
- B.
aes
- C.
pin
- D.
stat
Problem 59
What does the pipe %>%
stand for in words?
- A. because
- B. then this
- C. and, then
- D. plus
Problem 60
What are the five main verbs used in manipulating data?
- A. select, filter, summarize, arrange, mutate
- B. select, organize, arrange, summarize, change
- C. library, filter, view, arrange, mutate
- D. select, packages, summarize, arrange, change
Problem 61
Which of these is a correct aspect of tidy data:
- A. Each observational unit forms a column
- B. Each value forms a row
- C. Two or more variables are in each column
- D. Each variable forms a column
Problem 62
If I wanted to plot a numeric variable over a categorical variable, which plot(s) would I use?
- A. Barplot
- B. Scatterplot and linegraph
- C. Stacked barplot
- D. Faceted histogram and Boxplot
Problem 63
What plot would I use if I wanted to plot a single, DISCRETE variable?
- A. Histogram
- B. Boxplot
- C. Scatterplot
- D. Barplot
Problem 64
What is wrong with my coding here:
ggplot(data = flights, x = carrier, fill = origin) + geom_bar()
- A.
origin
is not in quotes
- B. You can’t use
fill
with geom_bar
- C. You should have used
y
instead of x
- D. You forgot to wrap your variables in
aes
Problem 65
What is the correct way to rename?
- A. “New before, old after”
- B. “Old before, new after”
- C. “Just new before”
- D. It doesn’t matter
Problem 66
Which of the following is FALSE
- A. Each variable forms a column
- B. Each observation forms a row
- C. Each type of observational unit forms a table
- D. Each type of observational unit forms a column
Problem 67
Which code is correct for producing a histogram with the rain variable with specified bins and color with no warning messages?
A.ggplot(data = weather, mapping = aes(x = rain, y = humid)) + geom_histogram(binwidth = 10, fill = “limegreen”, color = “black”, na.rm = TRUE)
B. ggplot(data = weather, mapping = aes(x = rain)) + geom_boxplot(binwidth = 10, fill = “limegreen”, color = “black”, na.rm = TRUE)
C.ggplot(data = weather, mapping = aes(x = rain)) + geom_histogram(fill = “limegreen”, color = “black”, na.rm = TRUE)
D.ggplot(data = weather, mapping = aes(x = rain)) + geom_histogram(binwidth = 10, fill = “limegreen”, color = “black”, na.rm = TRUE)
Problem 68
What are the five named graphics (5NG)?
- A. Histograms, Boxplots, Pie chart, Scatter plot, Line graph
- B. Scatter plot, Histogram, Boxplot, Pie chart, Pictograph
- C. Boxplot, Scatter plot, Line plot, Histogram, Barplot
- D. None of the above
Problem 69
Which of the following is correct about the five main verbs?
- A.
select
- Pick rows based on conditions about their values
- B.
arrange
- Sort the rows based on one or more variables
- C.
mutate
- Create summary measures of variables (or groups of observations on variables using group_by
)
- D. None of the above
Problem 70
What are the FMV we can use in piping?
- A. mean, median, standard deviation, summarize, order
- B. select, filter, summarize, mutate, arrange
- C.order, top, mutate, change, mean
- D. select, filter, find, arrange, summarize
Problem 71
Which of the following would give you am appropriate plot based on the variables?
- A. two continuous variables-bar plot
- two categorical- faceted histogram
- C. one categorical-bar plot
- D.one continuous variable- faceted box plot
Problem 72
What function will show you the mean and SD?
- A.weather %>% filter(year == 2013) %>% summarize(mean_humid = mean(humid, na.rm=TRUE), sd_humid = sd(humid, na.rm=TRUE))
- B.weather %>% filter summarize(mean_humid = mean(humid, na.rm=TRUE), sd_humid = sd(humid, na.rm=TRUE))
- C.weather %>% filter(year == 2013) %>% summarize(mean_humid = mean(humid), sd_humid = sd(humid))
- D.weather %>% filter(year == 2013) %>% summarize(mean_humid <- mean(humid, na.rm=TRUE), sd_humid <- sd(humid, na.rm=TRUE))
Problem 73
What does the ‘!’ do in a R chunk?
- A. adds emphasis and excitement!
- B.it means, “not equal to”, gives you everything expect those specific variables given
- C. ignore the variables selected with the ‘!’ function
- D.there is not meaning to ‘!’
Problem 74
Create a pipe function using ‘GAP’ data to extract the top 10 populations.
- A. Gap %>% filter(origin == “region”) %>% top_n(n = 100)
- B.filter(origin = “region”) %>% top_n(n = 100, wt = pop)
- C.Gap %>% filter(origin == “region”) %>% top_n(100)
- D.Gap %>% filter(origin == “region”) %>% top_n(n = 100, wt = pop)
Problem 75
What is the purpose of %>%?
- A. Does the same thing as adding a plus sign
- B. can chain together dplyr functions
- C. chain together plotting code
- D. both B and C
Problem 76
How do you get the standard deviation?
- A. by using “summarize”
- B. by using “std_dev”
- C. Using “gather”
- D. None of the above
Problem 77
How do you create a new variable?
- A. “mutate” function
- B. “create” function
- C. “summarize” function
- D. Both a and b
Problem 78
What is not a graph we can make?
- A. histograms
- B. boxplots
- C. barplots
- D. scatter histograms
Problem 79
Is it possible to merge data frames
- A. No
- B. Yes
- C. Yes only if the data frames have the same amount of rows.
- D. Yes only if the data frames have the same amount of columns.