STAT 19000: Project 3 — Fall 2020

Motivation: data.frame`s are the primary data structure you will work with when using R. It is important to understand how to insert, retrieve, and update data in a `data.frame.

Context: In the previous project we got our feet wet, and ran our first R code, and learned about accessing data inside vectors. In this project we will continue to reinforce what we’ve already learned and introduce a new, flexible data structure called `data.frame`s.

Scope: r, data.frames, recycling, factors

Learning Objectives
  • Explain what "recycling" is in R and predict behavior of provided statements.

  • Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.

  • Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.

  • Read and write basic (csv) data.

  • Explain and demonstrate: positional, named, and logical indexing.

  • List the differences between lists, vectors, factors, and data.frames, and when to use each.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/disney

Questions

Question 1

Read the dataset /class/datamine/data/disney/splash_mountain.csv into a data.frame called splash_mountain. How many columns, or features are in each dataset? How many rows or observations?

Items to submit
  • R code used to solve the problem.

  • How many columns or features in each dataset?

Question 2

Splash Mountain is a fan favorite ride at Disney World’s Magic Kingdom theme park. splash_mountain contains a series of dates and datetimes. For each datetime, splash_mountain contains a posted minimum wait time, SPOSTMIN, and an actual minimum wait time, SACTMIN. What is the average posted minimum wait time for Splash Mountain? What is the standard deviation? Based on the fact that SPOSTMIN represents the posted minimum wait time for our ride, does our mean and standard deviation make sense? Explain. (You might look ahead to Question 3 before writing the answer to Question 2.)

If you got NA or NaN as a result, see here.

Items to submit
  • R code used to solve this problem.

  • The results of running the R code.

  • 1-2 sentences explaining why or why not the results make sense.

Question 3

In (2), we got some peculiar values for the mean and standard deviation. If you read the "attractions" tab in the file /class/datamine/data/disney/touringplans_data_dictionary.xlsx, you will find that -999 is used as a value in SPOSTMIN and SACTMIN to indicate the ride as being closed. Recalculate the mean and standard deviation of SPOSTMIN, excluding values that are -999. Does this seem to have fixed our problem?

Items to submit
  • R code used to solve this problem.

  • The result of running the R code.

  • A statement indicating whether or not the value look reasonable now.

Question 4

SPOSTMIN and SACTMIN aren’t the greatest feature/column names. An outsider looking at the data.frame wouldn’t be able to immediately get the gist of what they represent. Change SPOSTMIN to posted_min_wait_time and SACTMIN to actual_wait_time.

Hint: You can always use hard-coded integers to change names manually, however, if you use which, you can get the index of the column name that you would like to change. For data.frames like splash_mountain, this is a lot more efficient than manually counting which column is the one with a certain name.

Items to submit
  • R code used to solve the problem.

  • The output from executing names(splash_mountain) or colnames(splash_mountain).

Question 5

Use the cut function to create a new vector called quarter that breaks the date column up by quarter. Use the labels argument in the factor function to label the quarters "q1", "q2", …​, "qX" where X is the last quarter. Add quarter as a column named quarter in splash_mountain. How many quarters are there?

If you have 2 years of data, this will result in 8 quarters: "q1", …​, "q8".

We can generate sequential data using seq and paste0:

paste0("item", seq(1, 5))

or

paste0("item", 1:5)
Items to submit
  • R code used to solve the problem.

  • The head and tail of splash_mountain.

  • The number of quarters in the new quarter column.

Question 5 is intended to be a little more challenging, so we worked through the exact same steps, with two other data sets. That way, if you work through these, all you will need to do, to solve Question 5, is to follow the example, and change two things, namely, the data set itself (in the read.csv file) and also the format of the date.

This basically steps you through everything in Question 5.

We hope that these are helpful resources for you! We appreciate you very much and we are here to support you! You would not know how to solve this question on your own—​because we are just getting started—​but we like to sometimes put in a question like this, in which you get introduced to several new things, and we will dive deeper into these ideas as we push ahead.

Question 6

Please include a statement in Project 3 that says, "I acknowledge that the STAT 19000/29000/39000 1-credit Data Mine seminar will be recorded and posted on Piazza, for participants in this course." or if you disagree with this statement, please consult with us at datamine@purdue.edu for an alternative plan.