Problem 1

What does the select function do?

  • A. Pick rows based on conditions about their values
  • B. Sort the rows based on one or more variables
  • C. Choose variables/columns by their names
  • D. Choose observational units by their names

Problem 2

If you wanted to find the standard deviation for population in 1972, which code would be correct?

gap >%> filter(year == 1972) >%>
  summarize(sd_pop = sd(pop))
  • B.
gap >%> filter(year = 1972) >%>
  summarize(sd_pop = sd(pop))
  • C.
gap %>% filter(year == 1972) %>%
  summarize(sd_pop == sd(pop))
  • D.
gap %>% filter(year == 1972) %>%
  summarize(sd_pop = sd(pop))

Problem 3

What does the filter function do?

  • A. Pick rows based on conditions about their values
  • B. Sort the rows based on one or more variables
  • C. Choose variables/columns by their names
  • D. Choose observational units by their names

Problem 4

What is faceting?

  • A. Creates small multiples of the same plot over a different categorical variable
  • B. Creates small multiples of the same plot over a different numerical variable
  • C. Create large multiples of the same plot over a different categorical variable
  • D. Create small variables of the same plot over a similar numerical variable

Problem 5

What is the pipe operator?

  • %>%
  • >%>
  • >>%
  • >>%>>

Problem 6

In a tidy data set, what does a row refer to?

  • A. The observational unit of the data set
  • B. The variables
  • C. There are not rows in tidy data sets, only in messy data sets
  • D. Each row represents a different observational unit

Problem 7

chr refers to what type of variables?

  • A. Categorical
  • B. Chronological
  • C. Continuous
  • D. Conclusive

Problem 8

What type of plot is usually preferred for an explanatory categorical variable and a response categorical variable?

  • A. Faceted boxplot
  • B. Side-by-side barplot
  • C. Stacked barplot
  • D. Faceted barplot

Problem 9

If you have produced a scatter plot, this means your data involves:

  • A. Explanatory data that is continuous and response data that is numerical
  • B. Multiple response values per explanatory value
  • C. A single response value per explanatory value
  • D. Both A and B

Problem 10

What does the %>% do?

  • A. %>% does nothing, you made this up.
  • B. Acts as a ~ similar to ggplot
  • C. Chains together dplyr functions
  • D. It is used similarly to the ! in the dplyr package

Problem 11

In a ‘Tidy Data’ set…

  • A. each observation forms a column, each variable forms a table, and each type of observational unit forms a row.
  • B. each variable forms a column, each observation forms a row, and each observational unit forms a table.
  • C. each observation forms a table, each variable forms a row, and each observational unit forms a column.
  • D. None of the above.

Problem 12

The first thing you should do when given a data set is to

  • A. count the columns and the rows.
  • B. use the ‘View’ function to view the data.
  • C. identify the observation unit, specify the variables, and give the types of variables you are presented with.
  • D. All of the above.

Problem 13

The ‘select’ function allows you to

  • A. Pick rows based on conditions about their values.
  • B. Sort the rows based on one or more variables.
  • C. Make a new variable in the data frame.
  • D. Choose variables/columns by their names.

Problem 14

The ‘filter’ function allows you to

  • A. Pick rows based on conditions about their values.
  • B. Choose variables/columns by their names.
  • C. Sort the rows based on one or more variables.
  • D. none of the above.

Problem 15

A scatter plot is most appropriate when

  • A. looking at the relationship between two categorical variables.
  • B. looking at the relationship between two continuous variables.
  • C. looking at the relationship between one continuous variable and one categorical variable.
  • D. looking at one continuous variable.

Problem 16

A barplot would best be utilized in the case of

  • A. One categorical predictor and one categorical response.
  • B. One numerical predictor and one numerical response.
  • C. One categorical predictor and one numerical response.
  • D. One categorical variable.

Problem 17

What produces the same type of plot as geom_point?

  • A. geom_alpha
  • B. geom_scattered
  • C. geom_jitter
  • D. Both A and C

Problem 18

Which of these is true?

  • A. != corresponds to ‘not equal to’
  • B. >= corresponds to ‘greater than’
  • C. > functions as ‘+’
  • D. <= labels a vector

Problem 19

Given the ‘weather’ dataset, what function would we use to choose only the variables of humidity and precipitation?

  • A. arrange(weather, humid, precip)
  • B. select(weather, humid, precip)
  • C. filter(weather, humid, precip)
  • D. summarize(weather, humid) %>% group_by(precip)

Problem 20

What is a useful function of the %>% ?

  • A. It makes functions easier to read.
  • B. It saves us from confusing parentheses.
  • C. It emphasizes sequential breakdown.
  • D. All of the above.

Problem 21

What is the process of decomposing frames into less redundant tables without losing info?

  • A. framing
  • B. decomposing
  • C. tabling
  • D. normalizing

Problem 22

What graph is most useful for a single categorical variable?

  • A. Barplot
  • B. Line-graph
  • C. Pie chart
  • D. Faceted barplot

Problem 23

What can we do in dplyr if we wish to chain together our codes?

  • A. %>%
  • B. +
  • C. ()
  • D. \(>\)

Problem 24

How can we bring two different data frames together in dplyr

  • A. inner_join()
  • B. select()
  • C. rename()
  • D. color()

Problem 25

Which of the following is not a part of the Five Main Verbs (FMV)?

  • A. Mutate
  • B. Summarize
  • C. Mean
  • D. Arrange

Problem 26

What package would you use the pipe in?

  • A. dplyr
  • B. ggplot
  • C. A and B
  • D. None of the above

Problem 27

What does the ‘mutate’ function do?

  • A. Makes a new variable in the data frame
  • B. Changes a specific variable slightly
  • C. Turns your variable into a mutant
  • D. Two of the above

Problem 28

What R code would you use to find the mean and standard deviation of temperature in the weather data set?

  • A. summarize(weather, mean = mean(temp), std_dev = sd(temp))
  • B. summarize(weather, mean = mean(temp, na.rm = TRUE), std_dev = sd(temp, na.rm = TRUE))
  • C. summarize(weather, mean = (mean = temp), std_dev = (sd = temp))
  • D. summarize(weather, mean = (mean == temp), std_dev = (sd == temp))

Problem 29

What does ! correspond to?

  • A. Less than or equal to
  • B. A very happy code
  • C. Not equal to
  • D. Greater than or equal to

Problem 30

What code will make a plot showing number of Strongly Autocratic and Mildly Autocratic countries in Africa over year, in each region. Color by subregion, with white borders.

gap %>% filter(region == "Africa") %>% 
  filter(dem_rank == "Strongly Autocratic" | "Mildly Autocratic") %>%
  ggplot(mapping = aes(x = year, fill = subRegion)) + 
  geom_histogram(position = "dodge", bins = 10, color = "white")
gap %>% filter(region == "Africa") %>%
  filter(dem_rank == "Strongly Autocratic" | dem_rank == "Mildly Autocratic") %>% 
  ggplot(mapping = aes(x = year, fill = subRegion)) +
  geom_histogram(position = "dodge", bins = 10, color = "white")
gap %>% filter(region == Africa) %>%
  filter(dem_rank = "Strongly Autocratic" | dem_rank = "Mildly Autocratic") %>%
  ggplot(mapping = aes(x = year, fill = subRegion)) + 
  geom_histogram(position = "dodge", bins = 10, color = "white")
  • D.
gap %>% filter(region == "Africa") %>%
  filter(dem_rank == "Strongly Autocratic", "Mildly Autocratic") %>%
  ggplot(mapping = aes(x = dem_rank, fill = subRegion)) +
  geom_histogram(position = "dodge", bins = 10, color = "white")

Problem 31

Find the mean gdpPercap of each region for each year

  • A. region_perCap <- gap %>% group_by(year) %>% summarize(mean_perCap =mean(gdpPercap))
  • B. region_perCap <- gap %>% summarize(region, year, mean_perCap =mean(gdpPercap))
  • C. region_perCap <- gap %>% group_by(region, year) %>% summarize(mean_perCap =mean(gdpPercap))
  • D. region_perCap <- gap %>% group_by(region) %>% summarize(mean_perCap =mean(gdpPercap))

Problem 32

Now plot the mean gdpPercap per year of each region, fill color by region.

  • A.gap %>% group_by(region) %>% summarize(mean_perCap =mean(gdpPercap)) %>% ggplot(mapping = aes(x = mean_perCap, y = region, fill = region)) + geom_boxplot()
  • B. gap %>% group_by(region, year) %>% summarize(mean_perCap =mean(gdpPercap)) %>% ggplot(mapping = aes(x = region, y = mean_perCap, fill = region)) + geom_boxplot()
  • C. gap %>% group_by(region, year) %>% summarize(mean_perCap =(gdpPercap)) %>% ggplot(mapping = aes(x = mean_perCap, y = region, fill = region)) + geom_boxplot()
  • D. gap %>% group_by(region, year) %>% summarize(mean_perCap =mean(gdpPercap)) %>% ggplot(mapping = aes(x = mean_perCap, y = region, fill = region)) + geom_boxplot()

Problem 33

What is the mean dep_delay of flights leaving NYC in 2013? (write the code to find the answer)

  • A. 12.63904
  • B. 12.54918
  • C. 12.54997
  • D. 12.63907

Problem 34

Create a data frame showing only American Airlines flights that were delayed. (no on-time flights or early arrivals!)

  • A. AA_delayed <- flights %>% filter(carrier == "AA", dep_delay > 2)
  • B. AA_delayed <- flights %>% filter(carrier == "AA", dep_delay > 0)
  • C. AA_delayed <- flights %>% filter(carrier == "AA", dep_delay > -1)
  • D. AA_delayed <- flights %>% filter(carrier == "AA", dep_delay < 0)

Problem 35

When presented with a dataset, what are the first couple things you should do?

  • A. Specify the variables
  • B. Identify the observation unit
  • C. Give the types of variables you are presented with
  • D. All of the above.

Problem 36

How can piping be more beneficial instead of using the other way of doing things as you saw throughout this chapter?

  • A. Saves you from confusing parentheses!
  • B. Emphasizes the sequential breaking down of tasks.
  • C. Makes it more readable.
  • D. All of the above.
  • E. None of the above.

Problem 37

What can be done when missing values from a data set need to be excluded from analysis?

  • A. narm = TRUE
  • B. na.rm = TRUE
  • C. na.rm = FALSE
  • D. na = TRUE

Problem 38

If you wanted to remove the dem_rank variable from the gap data set, which verb/function would be most important to utilize?

  • A. select()
  • B. arrange()
  • C. filter()
  • D. mutate()

Problem 39

What are critical functions you would need in order to find:

the mean for dem_rank in the year 1952 in the gap data frame?

  • A. select(), filter(), mean_demrank(), mean()
  • B. filter(), summarize(), mean(), na.rm = TRUE
  • C. filter(), summarize(), mean()
  • D. select(), summarize(), filter(), mean()

Problem 40

What does int, num and chr stand for while representing values in a data table.

  • A. Integer, numeric, character
  • B. Interval, numeric, character
  • C. Integer, numeric, category
  • D. Interval, numeric, category

Problem 41

If there is a N/A in the data set, what should we include in the R code to get rid of it.

  • A. na.rm = TRUE
  • B. na.rm = FALSE
  • C. n/a = TRUE
  • D. n/a = FALSE

Problem 42

If you were making a histogram, what would the ending part of the code to make the outline color black and the colors of the bars red.

  • A. geom_histogram(bins = 5, color = “black”, fill = “red”)
  • B. geom_histogram(bins = 5, color = black, fill = red)
  • C. geom_histogram(bins = 5, color = “red”, fill = “black”)
  • D. geom_histogram(bins = 5, color = red, fill = black)

Problem 43

In chapter 5 one of the r chunks that are given is portland_flights <- filter(flights, dest == “PDX”). What is the portland_flights <- for?

  • A. That is to create and name the new data set.
  • B. It’s to look in the already created data set portland_flights.
  • C. It automatically extracts all the portland flights in the data set.
  • D. None of the above.

Problem 44

In the flights data set how do you find the mean and standard deviation for all the air times of the flights.

    1. tim_air_flights <- summarise(flights, mean = mean(air_time, na.rm = TRUE), std_dev = sd(air_time, na.rm = TRUE))
    1. tim_air_flights <- summarize(flights, mean = mean(air_time, na.rm = FALSE), std_dev = sd(air_time, na.rm = FALSE))
    1. tim_air_flights <- summarise(air_time, mean = mean(air_time), std_dev = sd(air_time))
    1. tim_air_flights <- summarise(flights, mean = mean(air_time), std_dev = sd(air_time))

Problem 45

In a histogram, what does changing the amount of bins from 25 to 50 do to a data set?

  • A. Shows how data is grouped into bins.
  • B. Gives a more detailed distribution of the distribution of values in the data frame.
  • C. Increase the amount of the data included in the data frame.
  • D. All the above.

Problem 46

What needs to be added to the following code in order to create a side-by-side bar plot?

ggplot(data = flights_namedports, mapping = aes(x = carrier, fill = name)) +
  • A. geom_bar(position = dodge)
  • B. geom_sidebyside
  • C. geom_bar
  • D. geom_bar(position = “dodge”)

Problem 47

What situation below favors the use of a scatter plot?

  • A. A graph showing relationship of two categorical variables.
  • B. The relationship of one continuous variable across different levels of one categorical variable.
  • C. The relationship between two continuous variables.
  • D. The relationship of one continuous variable and two categorical variables.

Problem 48

What are the FMV Five main verbs?

  • A. select, filter, group, change, rename.
  • B. select, filter, summarize, mutate, arrange.
  • C. filter, select, group_by, extract, mutate.
  • D. None of the above.

Problem 49

What is the proper AND THEN sentence for the following code?

named_freq_dests <- ten_freq_dests %>%
  inner_join(airports_small, by = c("dest" = "faa")) %>%
  rename(airport_name = name)
named_freq_dests
  • A. Make a ten_freq_dests and then join variables on new column and then rename column airport_name.
  • B. From the named_freq_dests data frame create a data frame ten_freq_dests and then join the columns to bring data frames together and then rename that new column from name to airport_name.
  • C. Create a new data frame and then join the variables into one large variable and then name it airport_name.
  • D. None of the above.

Problem 50

What makes a data set “tidy”?"

  • A. Each variable forms a row, each observation forms a column, each observational unit forms the table.
  • B.Each variable forms a column, each observation forms a row, each observational unit forms the table.
  • C.The observational units form the columns, each observation forms a row, each observation forms the table.
  • D.Each table forms a column, each observation forms a row, each variable forms the table.

Problem 51

library(nycflights13)

What is the code for this graph?

  • A.ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(bin width=10,color=“black”, fill=“lightblue”, na.rm=TRUE)
  • B.ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(bins=10,color=“black”, fill=“lightblue”, na.rm=TRUE)
  • C.ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(bins=10,color=black, fill=lightblue, na.rm=TRUE)
  • D.ggplot(data = weather, mapping = aes(x = temp)) + geom_boxplot(bins=10,color=“black”, fill=“lightblue”, na.rm=TRUE)

Problem 52

Is the “weather” dataset tidy? How should we fix it if it’s not?

  • A. Yes, its tidy. It’s very neat and organized.
  • B. No, the date variables should be combined.
  • C. No, the wind variables should be combined.
  • D.No, the dataset is too big to be tidy.

Problem 53

What is the standard deviation for population in 1952?

  • A. gap %>% select(year == 1952) %>% summarize(sd_pop = sd(pop))
  • B. gap %>% filter(year == 1952) %>% summarize(sd_pop = sd(pop))
  • C. gap %>% summarize(year == 1952, sd_pop = sd(pop))
  • D. gap %>% filter(year = 1952) %>% summarize(sd_pop = sd(pop))

Problem 54

Select the ’delay variables from the “flights dataset.

  • A.flights_delay <- filter(flights, contains(“delay”))
  • B.flights_delay == select(flights, contains(“delay”))
  • C.flights_delay <- select(flights, contains(delay))
  • D.flights_delay <- select(flights, contains(“delay”))

Problem 55

What does the select() verb correspond with?

  • A. Values
  • B. Observational units
  • C. Rows
  • D.Columns by variable names

Problem 56

What does the str function do?

  • A. Straightens up data
  • B. Stores data
  • C. Sorts variables
  • D. Specifies variables

Problem 57

What are the five named graphs?

  • A. line, scatter, histogram, boxplot, barplot
  • B. line, scatter, histogram, faceted- barplot, boxplot
  • C. line, scatter, histogram, pie, boxplot
  • D. line, scatter, histogram, faceted- histogram, boxplot

Problem 58

Which is not an element of the Grammar of Graphics?

  • A. geom
  • B. aes
  • C. pin
  • D. stat

Problem 59

What does the pipe %>% stand for in words?

  • A. because
  • B. then this
  • C. and, then
  • D. plus

Problem 60

What are the five main verbs used in manipulating data?

  • A. select, filter, summarize, arrange, mutate
  • B. select, organize, arrange, summarize, change
  • C. library, filter, view, arrange, mutate
  • D. select, packages, summarize, arrange, change

Problem 61

Which of these is a correct aspect of tidy data:

  • A. Each observational unit forms a column
  • B. Each value forms a row
  • C. Two or more variables are in each column
  • D. Each variable forms a column

Problem 62

If I wanted to plot a numeric variable over a categorical variable, which plot(s) would I use?

  • A. Barplot
  • B. Scatterplot and linegraph
  • C. Stacked barplot
  • D. Faceted histogram and Boxplot

Problem 63

What plot would I use if I wanted to plot a single, DISCRETE variable?

  • A. Histogram
  • B. Boxplot
  • C. Scatterplot
  • D. Barplot

Problem 64

What is wrong with my coding here:

ggplot(data = flights, x = carrier, fill = origin) + geom_bar()
  • A. origin is not in quotes
  • B. You can’t use fill with geom_bar
  • C. You should have used y instead of x
  • D. You forgot to wrap your variables in aes

Problem 65

What is the correct way to rename?

  • A. “New before, old after”
  • B. “Old before, new after”
  • C. “Just new before”
  • D. It doesn’t matter

Problem 66

Which of the following is FALSE

  • A. Each variable forms a column
  • B. Each observation forms a row
  • C. Each type of observational unit forms a table
  • D. Each type of observational unit forms a column

Problem 67

Which code is correct for producing a histogram with the rain variable with specified bins and color with no warning messages?

  • A.ggplot(data = weather, mapping = aes(x = rain, y = humid)) + geom_histogram(binwidth = 10, fill = “limegreen”, color = “black”, na.rm = TRUE)

  • B. ggplot(data = weather, mapping = aes(x = rain)) + geom_boxplot(binwidth = 10, fill = “limegreen”, color = “black”, na.rm = TRUE)

  • C.ggplot(data = weather, mapping = aes(x = rain)) + geom_histogram(fill = “limegreen”, color = “black”, na.rm = TRUE)

  • D.ggplot(data = weather, mapping = aes(x = rain)) + geom_histogram(binwidth = 10, fill = “limegreen”, color = “black”, na.rm = TRUE)

Problem 68

What are the five named graphics (5NG)?

  • A. Histograms, Boxplots, Pie chart, Scatter plot, Line graph
  • B. Scatter plot, Histogram, Boxplot, Pie chart, Pictograph
  • C. Boxplot, Scatter plot, Line plot, Histogram, Barplot
  • D. None of the above

Problem 69

Which of the following is correct about the five main verbs?

  • A. select - Pick rows based on conditions about their values
  • B. arrange - Sort the rows based on one or more variables
  • C. mutate - Create summary measures of variables (or groups of observations on variables using group_by)
  • D. None of the above

Problem 70

What are the FMV we can use in piping?

  • A. mean, median, standard deviation, summarize, order
  • B. select, filter, summarize, mutate, arrange
  • C.order, top, mutate, change, mean
  • D. select, filter, find, arrange, summarize

Problem 71

Which of the following would give you am appropriate plot based on the variables?

  • A. two continuous variables-bar plot
    1. two categorical- faceted histogram
  • C. one categorical-bar plot
  • D.one continuous variable- faceted box plot

Problem 72

What function will show you the mean and SD?

  • A.weather %>% filter(year == 2013) %>% summarize(mean_humid = mean(humid, na.rm=TRUE), sd_humid = sd(humid, na.rm=TRUE))
  • B.weather %>% filter summarize(mean_humid = mean(humid, na.rm=TRUE), sd_humid = sd(humid, na.rm=TRUE))
  • C.weather %>% filter(year == 2013) %>% summarize(mean_humid = mean(humid), sd_humid = sd(humid))
  • D.weather %>% filter(year == 2013) %>% summarize(mean_humid <- mean(humid, na.rm=TRUE), sd_humid <- sd(humid, na.rm=TRUE))

Problem 73

What does the ‘!’ do in a R chunk?

  • A. adds emphasis and excitement!
  • B.it means, “not equal to”, gives you everything expect those specific variables given
  • C. ignore the variables selected with the ‘!’ function
  • D.there is not meaning to ‘!’

Problem 74

Create a pipe function using ‘GAP’ data to extract the top 10 populations.

  • A. Gap %>% filter(origin == “region”) %>% top_n(n = 100)
  • B.filter(origin = “region”) %>% top_n(n = 100, wt = pop)
  • C.Gap %>% filter(origin == “region”) %>% top_n(100)
  • D.Gap %>% filter(origin == “region”) %>% top_n(n = 100, wt = pop)

Problem 75

What is the purpose of %>%?

  • A. Does the same thing as adding a plus sign
  • B. can chain together dplyr functions
  • C. chain together plotting code
  • D. both B and C

Problem 76

How do you get the standard deviation?

  • A. by using “summarize”
  • B. by using “std_dev”
  • C. Using “gather”
  • D. None of the above

Problem 77

How do you create a new variable?

  • A. “mutate” function
  • B. “create” function
  • C. “summarize” function
  • D. Both a and b

Problem 78

What is not a graph we can make?

  • A. histograms
  • B. boxplots
  • C. barplots
  • D. scatter histograms

Problem 79

Is it possible to merge data frames

  • A. No
  • B. Yes
  • C. Yes only if the data frames have the same amount of rows.
  • D. Yes only if the data frames have the same amount of columns.