Problem 1

What does the select function do?

A. Pick rows based on conditions about their values
B. Sort the rows based on one or more variables
C. Choose variables/columns by their names
D. Choose observational units by their names

Problem 2

If you wanted to find the standard deviation for population in 1972, which code would be correct?

gap >%> filter(year == 1972) >%>
  summarize(sd_pop = sd(pop))

gap >%> filter(year = 1972) >%>
  summarize(sd_pop = sd(pop))

gap %>% filter(year == 1972) %>%
  summarize(sd_pop == sd(pop))

gap %>% filter(year == 1972) %>%
  summarize(sd_pop = sd(pop))

Problem 3

What does the filter function do?

A. Pick rows based on conditions about their values
B. Sort the rows based on one or more variables
C. Choose variables/columns by their names
D. Choose observational units by their names

Problem 4

What is faceting?

A. Creates small multiples of the same plot over a different categorical variable
B. Creates small multiples of the same plot over a different numerical variable
C. Create large multiples of the same plot over a different categorical variable
D. Create small variables of the same plot over a similar numerical variable

Problem 5

What is the pipe operator?

%>%
>%>
>>%
>>%>>

Problem 6

In a tidy data set, what does a row refer to?

A. The observational unit of the data set
B. The variables
C. There are not rows in tidy data sets, only in messy data sets
D. Each row represents a different observational unit

Problem 7

chr refers to what type of variables?

A. Categorical
B. Chronological
C. Continuous
D. Conclusive

Problem 8

What type of plot is usually preferred for an explanatory categorical variable and a response categorical variable?

A. Faceted boxplot
B. Side-by-side barplot
C. Stacked barplot
D. Faceted barplot

Problem 9

If you have produced a scatter plot, this means your data involves:

A. Explanatory data that is continuous and response data that is numerical
B. Multiple response values per explanatory value
C. A single response value per explanatory value
D. Both A and B

Problem 10

What does the %>% do?

A. %>% does nothing, you made this up.
B. Acts as a ~ similar to ggplot
C. Chains together dplyr functions
D. It is used similarly to the ! in the dplyr package

Problem 11

In a ‘Tidy Data’ set…

A. each observation forms a column, each variable forms a table, and each type of observational unit forms a row.
B. each variable forms a column, each observation forms a row, and each observational unit forms a table.
C. each observation forms a table, each variable forms a row, and each observational unit forms a column.
D. None of the above.

Problem 12

The first thing you should do when given a data set is to

A. count the columns and the rows.
B. use the ‘View’ function to view the data.
C. identify the observation unit, specify the variables, and give the types of variables you are presented with.
D. All of the above.

Problem 13

The ‘select’ function allows you to

A. Pick rows based on conditions about their values.
B. Sort the rows based on one or more variables.
C. Make a new variable in the data frame.
D. Choose variables/columns by their names.

Problem 14

The ‘filter’ function allows you to

A. Pick rows based on conditions about their values.
B. Choose variables/columns by their names.
C. Sort the rows based on one or more variables.
D. none of the above.

Problem 15

A scatter plot is most appropriate when

A. looking at the relationship between two categorical variables.
B. looking at the relationship between two continuous variables.
C. looking at the relationship between one continuous variable and one categorical variable.
D. looking at one continuous variable.

Problem 16

A barplot would best be utilized in the case of

A. One categorical predictor and one categorical response.
B. One numerical predictor and one numerical response.
C. One categorical predictor and one numerical response.
D. One categorical variable.

Problem 17

What produces the same type of plot as geom_point?

A. geom_alpha
B. geom_scattered
C. geom_jitter
D. Both A and C

Problem 18

Which of these is true?

A. != corresponds to ‘not equal to’
B. >= corresponds to ‘greater than’
C. > functions as ‘+’
D. <= labels a vector

Problem 19

Given the ‘weather’ dataset, what function would we use to choose only the variables of humidity and precipitation?

A. arrange(weather, humid, precip)
B. select(weather, humid, precip)
C. filter(weather, humid, precip)
D. summarize(weather, humid) %>% group_by(precip)

Problem 20

What is a useful function of the %>% ?

A. It makes functions easier to read.
B. It saves us from confusing parentheses.
C. It emphasizes sequential breakdown.
D. All of the above.

Problem 21

What is the process of decomposing frames into less redundant tables without losing info?

A. framing
B. decomposing
C. tabling
D. normalizing

Problem 22

What graph is most useful for a single categorical variable?

A. Barplot
B. Line-graph
C. Pie chart
D. Faceted barplot

Problem 23

What can we do in dplyr if we wish to chain together our codes?

A. %>%
B. +
C. ()
D. \(>\)

Problem 24

How can we bring two different data frames together in dplyr

A. inner_join()
B. select()
C. rename()
D. color()

Problem 25

Which of the following is not a part of the Five Main Verbs (FMV)?

A. Mutate
B. Summarize
C. Mean
D. Arrange

Problem 26

What package would you use the pipe in?

A. dplyr
B. ggplot
C. A and B
D. None of the above

Problem 27

What does the ‘mutate’ function do?

A. Makes a new variable in the data frame
B. Changes a specific variable slightly
C. Turns your variable into a mutant
D. Two of the above

Problem 28

What R code would you use to find the mean and standard deviation of temperature in the weather data set?

A. summarize(weather, mean = mean(temp), std_dev = sd(temp))
B. summarize(weather, mean = mean(temp, na.rm = TRUE), std_dev = sd(temp, na.rm = TRUE))
C. summarize(weather, mean = (mean = temp), std_dev = (sd = temp))
D. summarize(weather, mean = (mean == temp), std_dev = (sd == temp))

Problem 29

What does ! correspond to?

A. Less than or equal to
B. A very happy code
C. Not equal to
D. Greater than or equal to

Problem 30

What code will make a plot showing number of Strongly Autocratic and Mildly Autocratic countries in Africa over year, in each region. Color by subregion, with white borders.

gap %>% filter(region == "Africa") %>% 
  filter(dem_rank == "Strongly Autocratic" | "Mildly Autocratic") %>%
  ggplot(mapping = aes(x = year, fill = subRegion)) + 
  geom_histogram(position = "dodge", bins = 10, color = "white")

gap %>% filter(region == "Africa") %>%
  filter(dem_rank == "Strongly Autocratic" | dem_rank == "Mildly Autocratic") %>% 
  ggplot(mapping = aes(x = year, fill = subRegion)) +
  geom_histogram(position = "dodge", bins = 10, color = "white")

gap %>% filter(region == Africa) %>%
  filter(dem_rank = "Strongly Autocratic" | dem_rank = "Mildly Autocratic") %>%
  ggplot(mapping = aes(x = year, fill = subRegion)) + 
  geom_histogram(position = "dodge", bins = 10, color = "white")

gap %>% filter(region == "Africa") %>%
  filter(dem_rank == "Strongly Autocratic", "Mildly Autocratic") %>%
  ggplot(mapping = aes(x = dem_rank, fill = subRegion)) +
  geom_histogram(position = "dodge", bins = 10, color = "white")

Problem 31

Find the mean gdpPercap of each region for each year

A. region_perCap <- gap %>% group_by(year) %>% summarize(mean_perCap =mean(gdpPercap))
B. region_perCap <- gap %>% summarize(region, year, mean_perCap =mean(gdpPercap))
C. region_perCap <- gap %>% group_by(region, year) %>% summarize(mean_perCap =mean(gdpPercap))
D. region_perCap <- gap %>% group_by(region) %>% summarize(mean_perCap =mean(gdpPercap))

Problem 32

Now plot the mean gdpPercap per year of each region, fill color by region.

A.gap %>% group_by(region) %>% summarize(mean_perCap =mean(gdpPercap)) %>% ggplot(mapping = aes(x = mean_perCap, y = region, fill = region)) + geom_boxplot()
B. gap %>% group_by(region, year) %>% summarize(mean_perCap =mean(gdpPercap)) %>% ggplot(mapping = aes(x = region, y = mean_perCap, fill = region)) + geom_boxplot()
C. gap %>% group_by(region, year) %>% summarize(mean_perCap =(gdpPercap)) %>% ggplot(mapping = aes(x = mean_perCap, y = region, fill = region)) + geom_boxplot()
D. gap %>% group_by(region, year) %>% summarize(mean_perCap =mean(gdpPercap)) %>% ggplot(mapping = aes(x = mean_perCap, y = region, fill = region)) + geom_boxplot()

Problem 33

What is the mean dep_delay of flights leaving NYC in 2013? (write the code to find the answer)

A. 12.63904
B. 12.54918
C. 12.54997
D. 12.63907

Problem 34

Create a data frame showing only American Airlines flights that were delayed. (no on-time flights or early arrivals!)

A. AA_delayed <- flights %>% filter(carrier == "AA", dep_delay > 2)
B. AA_delayed <- flights %>% filter(carrier == "AA", dep_delay > 0)
C. AA_delayed <- flights %>% filter(carrier == "AA", dep_delay > -1)
D. AA_delayed <- flights %>% filter(carrier == "AA", dep_delay < 0)

Problem 35

When presented with a dataset, what are the first couple things you should do?

A. Specify the variables
B. Identify the observation unit
C. Give the types of variables you are presented with
D. All of the above.

Problem 36

How can piping be more beneficial instead of using the other way of doing things as you saw throughout this chapter?

A. Saves you from confusing parentheses!
B. Emphasizes the sequential breaking down of tasks.
C. Makes it more readable.
D. All of the above.
E. None of the above.

Problem 37

What can be done when missing values from a data set need to be excluded from analysis?

A. narm = TRUE
B. na.rm = TRUE
C. na.rm = FALSE
D. na = TRUE

Problem 38

If you wanted to remove the dem_rank variable from the gap data set, which verb/function would be most important to utilize?

A. select()
B. arrange()
C. filter()
D. mutate()

Problem 39

What are critical functions you would need in order to find:

the mean for dem_rank in the year 1952 in the gap data frame?

A. select(), filter(), mean_demrank(), mean()
B. filter(), summarize(), mean(), na.rm = TRUE
C. filter(), summarize(), mean()
D. select(), summarize(), filter(), mean()

Problem 40

What does int, num and chr stand for while representing values in a data table.

A. Integer, numeric, character
B. Interval, numeric, character
C. Integer, numeric, category
D. Interval, numeric, category

Problem 41

If there is a N/A in the data set, what should we include in the R code to get rid of it.

A. na.rm = TRUE
B. na.rm = FALSE
C. n/a = TRUE
D. n/a = FALSE

Problem 42

If you were making a histogram, what would the ending part of the code to make the outline color black and the colors of the bars red.

A. geom_histogram(bins = 5, color = “black”, fill = “red”)
B. geom_histogram(bins = 5, color = black, fill = red)
C. geom_histogram(bins = 5, color = “red”, fill = “black”)
D. geom_histogram(bins = 5, color = red, fill = black)

Problem 43

In chapter 5 one of the r chunks that are given is portland_flights <- filter(flights, dest == “PDX”). What is the portland_flights <- for?

A. That is to create and name the new data set.
B. It’s to look in the already created data set portland_flights.
C. It automatically extracts all the portland flights in the data set.
D. None of the above.

Problem 44

In the flights data set how do you find the mean and standard deviation for all the air times of the flights.

1. tim_air_flights <- summarise(flights, mean = mean(air_time, na.rm = TRUE), std_dev = sd(air_time, na.rm = TRUE))
1. tim_air_flights <- summarize(flights, mean = mean(air_time, na.rm = FALSE), std_dev = sd(air_time, na.rm = FALSE))
1. tim_air_flights <- summarise(air_time, mean = mean(air_time), std_dev = sd(air_time))
1. tim_air_flights <- summarise(flights, mean = mean(air_time), std_dev = sd(air_time))

Problem 45

In a histogram, what does changing the amount of bins from 25 to 50 do to a data set?

A. Shows how data is grouped into bins.
B. Gives a more detailed distribution of the distribution of values in the data frame.
C. Increase the amount of the data included in the data frame.
D. All the above.

Problem 46

What needs to be added to the following code in order to create a side-by-side bar plot?

ggplot(data = flights_namedports, mapping = aes(x = carrier, fill = name)) +

A. geom_bar(position = dodge)
B. geom_sidebyside
C. geom_bar
D. geom_bar(position = “dodge”)

Problem 47

What situation below favors the use of a scatter plot?

A. A graph showing relationship of two categorical variables.
B. The relationship of one continuous variable across different levels of one categorical variable.
C. The relationship between two continuous variables.
D. The relationship of one continuous variable and two categorical variables.

Problem 48

What are the FMV Five main verbs?

A. select, filter, group, change, rename.
B. select, filter, summarize, mutate, arrange.
C. filter, select, group_by, extract, mutate.
D. None of the above.

Problem 49

What is the proper AND THEN sentence for the following code?

named_freq_dests <- ten_freq_dests %>%
  inner_join(airports_small, by = c("dest" = "faa")) %>%
  rename(airport_name = name)
named_freq_dests

A. Make a ten_freq_dests and then join variables on new column and then rename column airport_name.
B. From the named_freq_dests data frame create a data frame ten_freq_dests and then join the columns to bring data frames together and then rename that new column from name to airport_name.
C. Create a new data frame and then join the variables into one large variable and then name it airport_name.
D. None of the above.

Problem 50

What makes a data set “tidy”?"

A. Each variable forms a row, each observation forms a column, each observational unit forms the table.
B.Each variable forms a column, each observation forms a row, each observational unit forms the table.
C.The observational units form the columns, each observation forms a row, each observation forms the table.
D.Each table forms a column, each observation forms a row, each variable forms the table.

Problem 51

library(nycflights13)

What is the code for this graph?

A.ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(bin width=10,color=“black”, fill=“lightblue”, na.rm=TRUE)
B.ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(bins=10,color=“black”, fill=“lightblue”, na.rm=TRUE)
C.ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(bins=10,color=black, fill=lightblue, na.rm=TRUE)
D.ggplot(data = weather, mapping = aes(x = temp)) + geom_boxplot(bins=10,color=“black”, fill=“lightblue”, na.rm=TRUE)

Problem 52

Is the “weather” dataset tidy? How should we fix it if it’s not?

A. Yes, its tidy. It’s very neat and organized.
B. No, the date variables should be combined.
C. No, the wind variables should be combined.
D.No, the dataset is too big to be tidy.

Problem 53

What is the standard deviation for population in 1952?

A. gap %>% select(year == 1952) %>% summarize(sd_pop = sd(pop))
B. gap %>% filter(year == 1952) %>% summarize(sd_pop = sd(pop))
C. gap %>% summarize(year == 1952, sd_pop = sd(pop))
D. gap %>% filter(year = 1952) %>% summarize(sd_pop = sd(pop))

Problem 54

Select the ’delay variables from the “flights dataset.

A.flights_delay <- filter(flights, contains(“delay”))
B.flights_delay == select(flights, contains(“delay”))
C.flights_delay <- select(flights, contains(delay))
D.flights_delay <- select(flights, contains(“delay”))

Problem 55

What does the select() verb correspond with?

A. Values
B. Observational units
C. Rows
D.Columns by variable names

Problem 56

What does the str function do?

A. Straightens up data
B. Stores data
C. Sorts variables
D. Specifies variables

Problem 57

What are the five named graphs?

A. line, scatter, histogram, boxplot, barplot
B. line, scatter, histogram, faceted- barplot, boxplot
C. line, scatter, histogram, pie, boxplot
D. line, scatter, histogram, faceted- histogram, boxplot

Problem 58

Which is not an element of the Grammar of Graphics?

A. geom
B. aes
C. pin
D. stat

Problem 59

What does the pipe %>% stand for in words?

A. because
B. then this
C. and, then
D. plus

Problem 60

What are the five main verbs used in manipulating data?

A. select, filter, summarize, arrange, mutate
B. select, organize, arrange, summarize, change
C. library, filter, view, arrange, mutate
D. select, packages, summarize, arrange, change

Problem 61

Which of these is a correct aspect of tidy data:

A. Each observational unit forms a column
B. Each value forms a row
C. Two or more variables are in each column
D. Each variable forms a column

Problem 62

If I wanted to plot a numeric variable over a categorical variable, which plot(s) would I use?

A. Barplot
B. Scatterplot and linegraph
C. Stacked barplot
D. Faceted histogram and Boxplot

Problem 63

What plot would I use if I wanted to plot a single, DISCRETE variable?

A. Histogram
B. Boxplot
C. Scatterplot
D. Barplot

Problem 64

What is wrong with my coding here:

ggplot(data = flights, x = carrier, fill = origin) + geom_bar()

A. origin is not in quotes
B. You can’t use fill with geom_bar
C. You should have used y instead of x
D. You forgot to wrap your variables in aes

Problem 65

What is the correct way to rename?

A. “New before, old after”
B. “Old before, new after”
C. “Just new before”
D. It doesn’t matter

Problem 66

Which of the following is FALSE

A. Each variable forms a column
B. Each observation forms a row
C. Each type of observational unit forms a table
D. Each type of observational unit forms a column

Problem 67

Which code is correct for producing a histogram with the rain variable with specified bins and color with no warning messages?

A.ggplot(data = weather, mapping = aes(x = rain, y = humid)) + geom_histogram(binwidth = 10, fill = “limegreen”, color = “black”, na.rm = TRUE)
B. ggplot(data = weather, mapping = aes(x = rain)) + geom_boxplot(binwidth = 10, fill = “limegreen”, color = “black”, na.rm = TRUE)
C.ggplot(data = weather, mapping = aes(x = rain)) + geom_histogram(fill = “limegreen”, color = “black”, na.rm = TRUE)
D.ggplot(data = weather, mapping = aes(x = rain)) + geom_histogram(binwidth = 10, fill = “limegreen”, color = “black”, na.rm = TRUE)

Problem 68

What are the five named graphics (5NG)?

A. Histograms, Boxplots, Pie chart, Scatter plot, Line graph
B. Scatter plot, Histogram, Boxplot, Pie chart, Pictograph
C. Boxplot, Scatter plot, Line plot, Histogram, Barplot
D. None of the above

Problem 69

Which of the following is correct about the five main verbs?

A. select - Pick rows based on conditions about their values
B. arrange - Sort the rows based on one or more variables
C. mutate - Create summary measures of variables (or groups of observations on variables using group_by)
D. None of the above

Problem 70

What are the FMV we can use in piping?

A. mean, median, standard deviation, summarize, order
B. select, filter, summarize, mutate, arrange
C.order, top, mutate, change, mean
D. select, filter, find, arrange, summarize

Problem 71

Which of the following would give you am appropriate plot based on the variables?

A. two continuous variables-bar plot
1. two categorical- faceted histogram
C. one categorical-bar plot
D.one continuous variable- faceted box plot

Problem 72

What function will show you the mean and SD?

A.weather %>% filter(year == 2013) %>% summarize(mean_humid = mean(humid, na.rm=TRUE), sd_humid = sd(humid, na.rm=TRUE))
B.weather %>% filter summarize(mean_humid = mean(humid, na.rm=TRUE), sd_humid = sd(humid, na.rm=TRUE))
C.weather %>% filter(year == 2013) %>% summarize(mean_humid = mean(humid), sd_humid = sd(humid))
D.weather %>% filter(year == 2013) %>% summarize(mean_humid <- mean(humid, na.rm=TRUE), sd_humid <- sd(humid, na.rm=TRUE))

Problem 73

What does the ‘!’ do in a R chunk?

A. adds emphasis and excitement!
B.it means, “not equal to”, gives you everything expect those specific variables given
C. ignore the variables selected with the ‘!’ function
D.there is not meaning to ‘!’

Problem 74

Create a pipe function using ‘GAP’ data to extract the top 10 populations.

A. Gap %>% filter(origin == “region”) %>% top_n(n = 100)
B.filter(origin = “region”) %>% top_n(n = 100, wt = pop)
C.Gap %>% filter(origin == “region”) %>% top_n(100)
D.Gap %>% filter(origin == “region”) %>% top_n(n = 100, wt = pop)

Problem 75

What is the purpose of %>%?

A. Does the same thing as adding a plus sign
B. can chain together dplyr functions
C. chain together plotting code
D. both B and C

Problem 76

How do you get the standard deviation?

A. by using “summarize”
B. by using “std_dev”
C. Using “gather”
D. None of the above

Problem 77

How do you create a new variable?

A. “mutate” function
B. “create” function
C. “summarize” function
D. Both a and b

Problem 78

What is not a graph we can make?

A. histograms
B. boxplots
C. barplots
D. scatter histograms

Problem 79

Is it possible to merge data frames

A. No
B. Yes
C. Yes only if the data frames have the same amount of rows.
D. Yes only if the data frames have the same amount of columns.

All practice problems from PS10