What is “Tidy Data”?

Note

FYI : Here is the official page for Tidy data. This page is based off of the original paper describing tidy data that Hadley Wickham wrote for the Journal of Statistical Software : Tidy Data.

Tidy data is a standard way of organizing data that makes it easy to work with. It is a concept that was popularized by the tidyverse packages in R. tidyverse is, in a sense, a “library” that contains several packages that are designed to work together, and are designed to work with tidy data. The tidyverse packages include the dplyr and tidyr packages among others. These packages are designed to work with tidy data. As you start to learn more about R, you will discover several of the packages included in the tidyverse. These packages are designed to work with tidy data. The tidyverse packages include several packages that provide tools for reading in data (the readr package), cleaning data (the dplyr package), transforming data (the tidyr package), and visual data (the ggplot2 package). These tools are designed to work with tidy data, so it is important to understand what tidy data is and how to organize data in a tidy format.


Hadley Wickham, the author of the tidyverse packages, defines tidy data as follows:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

The purpose for creating data sets that are tidy is to make it easier to analyze and visualize data. Tidy data is easy to work with because it follows a consistent structure that makes it easy to manipulate and visualize data. If we know that the data set is set up as a tidy data set, we can use the tidyverse packages to work with the data. These packages provide tools that are set up to work with tidy data.

This consistency makes it easier to read in data, to clean data, to transform data, and to visualize data. When data is tidy, it is easier to work with because we can use the same tools to work with the data. This means we don’t have to learn new tools for each new data set that we work with.

When data is not tidy, it can be difficult to work with. For example, if data is spread across multiple columns, it can be difficult to analyze and visualize the data. You may have to completely rewrite your code or create a completely new script to work with the data. By organizing data in a tidy format, it is easier to work with and analyze data because we have these packages that are created to work with a data set that has been formatted as “tidy”.

Example

Consider the following data:

# If needed, install the tidyverse package

# install.packages("tidyverse")

# If it is already installed, make sure it is loaded up to use :

# Load the tidyverse package

library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.5.2


Let’s create a data frame that is not tidy. We will create a data frame with three rows and four columns. The first column contains the names of three people, and the other columns contain data for the years 2010, 2011, and 2012. Each row represents a person, and each column represents their age during that year.

df <- tibble(
  name = c("John Smith", "Jane Doe", "Mary Johnson"),
  `2010` = c(25, 30, 35),
  `2011` = c(26, 31, 36),
  `2012` = c(27, 32, 37)
)

df
# A tibble: 3 × 4
  name         `2010` `2011` `2012`
  <chr>         <dbl>  <dbl>  <dbl>
1 John Smith       25     26     27
2 Jane Doe         30     31     32
3 Mary Johnson     35     36     37


In this case, the data is not tidy because the years are spread across columns. In order for this to be considered “tidy” data, we would need to think about the data in a different way.

The variables we are using are name, year, and age. In order for the data to be tidy, we want each observation (row) to contain a name, a year, and the age. This tells us we would need to have a column for each of these variables. In this case, we would need to have a column for the name of the person, a column for the year that the data was collected, and a column for the age of the person the year the data was collected.

Here is what the tidy data would look like if we ordered the data in a tidy format by name, year, and age:

df_tidy <- tibble(
  name = c("John Smith", "John Smith", "John Smith", "Jane Doe", "Jane Doe", 
           "Jane Doe", "Mary Johnson", "Mary Johnson", "Mary Johnson"),
  year = c(2010, 2011, 2012, 2010, 2011, 2012, 2010, 2011, 2012),
  age = c(25, 26, 27, 30, 31, 32, 35, 36, 37)
)

df_tidy
# A tibble: 9 × 3
  name          year   age
  <chr>        <dbl> <dbl>
1 John Smith    2010    25
2 John Smith    2011    26
3 John Smith    2012    27
4 Jane Doe      2010    30
5 Jane Doe      2011    31
6 Jane Doe      2012    32
7 Mary Johnson  2010    35
8 Mary Johnson  2011    36
9 Mary Johnson  2012    37

We have now cleaned the data so that we can work with it in a tidy format.

As an example of how you could use the tools from the tidyverse package, you could use the pivot_longer( ) function from the tidyr package to convert the data from the original data frame to a tidy data frame. This is just an example to show you a more elegant way to convert the data to a tidy format. You will perform more advanced cleaning options as you learn more about the tidyverse packages.

df_tidy <- df %>% 
  pivot_longer(cols = -name, names_to = "year", values_to = "value")

df_tidy
# A tibble: 9 × 3
  name         year  value
  <chr>        <chr> <dbl>
1 John Smith   2010     25
2 John Smith   2011     26
3 John Smith   2012     27
4 Jane Doe     2010     30
5 Jane Doe     2011     31
6 Jane Doe     2012     32
7 Mary Johnson 2010     35
8 Mary Johnson 2011     36
9 Mary Johnson 2012     37

Now the data is tidy because each variable forms a column, each observation forms a row, and each type of observational unit forms a table.

Is Non-Tidy Data Bad?

Can you work with non-tidy data? Yes.

Can you analyze non-tidy data? Yes.

Can you visualize non-tidy data? Yes.

While tidy data has some nice properties that make it easier to work with, it is not the only way to work with data. You can work with non-tidy data, but it may be more difficult to do so. You may have to write more code or create a completely new script to work with the data.

When we think ease of use and reproducibility, having a structure to the data can make life easier. If we know going into the coding that the data is tidy, we know that we have an entire library dedicated to working with data that is set up in this particular way.

Does this mean that you will only work with tidy data? No. You will work with non-tidy data. In fact, there may be some projects where you don’t want to turn the data into a tidy dataset – and that’s OK. Your job will be to determine the best way to work with the data that you have.

Summary

When data is not tidy, it can be difficult to work with. For example, if data is spread across multiple columns, it can be difficult to analyze and visualize the data. You may have to completely rewrite your code or create a completely new script to work with the data. By organizing data in a tidy format, it is easier to work with and analyze data because we have these packages that are created to work with a data set that has been formatted as “tidy”.


Practice Problems

In this assignment, you will identify whether a given dataset is in tidy format. Each problem will present a dataset from a different background along with a brief description. Your task is to determine if the dataset is tidy. If it is not tidy, describe why and provide a tidy version of the dataset.

Problem 1: Weather Data

The following dataset contains weather data for three cities.

City Jan_Temp Feb_Temp Mar_Temp
New York 30 32 45
Los Angeles 58 60 65
Chicago 25 28 40

The dataset is not tidy. There are three variables : City, Month, and Temperature. Here is the tidy version:

City Month Temperature
New York Jan 30
Los Angeles Jan 58
Chicago Jan 25
New York Feb 32
Los Angeles Feb 60
Chicago Feb 28
New York Mar 45
Los Angeles Mar 65
Chicago Mar 40


Problem 2: Student Grades

The following dataset contains grades for students in three subjects.

Student Math Science History
Alice 90 88 84
Bob 85 92 78
Charlie 95 89 91

The dataset is not tidy. The three variables are Student, Subject, and Grade. Here is the tidy version:

Student Subject Grade
Alice Math 90
Bob Math 85
Charlie Math 95
Alice Science 88
Bob Science 92
Charlie Science 89
Alice History 84
Bob History 78
Charlie History 91


Problem 3: Sales Data

The following dataset contains monthly sales data for different products.

Product Jan_Sales Feb_Sales Mar_Sales
A 100 110 120
B 150 160 170
C 200 210 220

The dataset is not tidy. The variables are Product, Month, and Sales. Here is the tidy version:

Product Month Sales
A Jan 100
B Jan 150
C Jan 200
A Feb 110
B Feb 160
C Feb 210
A Mar 120
B Mar 170
C Mar 220

Problem 4: Patient Health Data

The following dataset contains health data for patients.

Patient Height Weight Age
John 170 70 30
Jane 160 55 25
Doe 180 80 40

The dataset is tidy.

Problem 5: Financial Data

The following dataset contains quarterly financial data for companies.

Company Q1_Revenue Q2_Revenue Q3_Revenue
X 1000 1100 1200
Y 2000 2100 2200
Z 3000 3100 3200

The dataset is not tidy. The htree variables are Company, Quarter, and Revenue. Here is the tidy version:

Company Quarter Revenue
X Q1 1000
Y Q1 2000
Z Q1 3000
X Q2 1100
Y Q2 2100
Z Q2 3100
X Q3 1200
Y Q3 2200
Z Q3 3200

Problem 6: Sports Statistics

The following dataset contains statistics for players in a sports team.

Player Goals Assists Saves
Player1 5 3 2
Player2 8 5 1
Player3 7 4 3

The dataset is tidy.

Problem 7: Movie Ratings

The following dataset contains ratings for movies by different critics.

Movie Critic1 Critic2 Critic3
Movie A 4.5 4.0 4.7
Movie B 3.8 3.9 4.0
Movie C 4.7 4.8 4.9

The dataset is not tidy. The three variables are Movie, Critic, and Rating. Here is the tidy version:

Movie Critic Rating
Movie A Critic1 4.5
Movie B Critic1 3.8
Movie C Critic1 4.7
Movie A Critic2 4.0
Movie B Critic2 3.9
Movie C Critic2 4.8
Movie A Critic3 4.7
Movie B Critic3 4.0
Movie C Critic3 4.9

Problem 8: Employee Salary Data

The following dataset contains salary data for employees in different departments.

Employee Dept1_Salary Dept2_Salary Dept3_Salary
E1 50000 52000 54000
E2 55000 57000 59000
E3 60000 62000 64000

The dataset is not tidy. The variables are Employee, Dept1_Salary, Dept2_Salary, and Dept3_Salary.Here is the tidy version:

Employee Department Salary
E1 Dept1 50000
E2 Dept1 55000
E3 Dept1 60000
E1 Dept2 52000
E2 Dept2 57000
E3 Dept2 62000
E1 Dept3 54000
E2 Dept3 59000
E3 Dept3 64000

Problem 9: Product Reviews

The following dataset contains reviews for products.

Product Review1 Review2 Review3
Product1 Good Very Good Excellent
Product2 Average Good Good
Product3 Excellent Very Good Good

The dataset is not tidy. The variables are Product, Review_Number, and Review. Here is the tidy version:

Product Review_Number Review
Product1 1 Good
Product2 1 Average
Product3 1 Excellent
Product1 2 Very Good
Product2 2 Good
Product3 2 Very Good
Product1 3 Excellent
Product2 3 Good
Product3 3 Good

Problem 10: Course Enrollment Data

The following dataset contains enrollment data for courses.

Course Semester1 Semester2 Semester3
Course1 30 35 32
Course2 25 28 26
Course3 20 22 24

The dataset is not tidy. The variables are Course, Semester, and Enrollment. Here is the tidy version:

Course Semester Enrollment
Course1 1 30
Course2 1 25
Course3 1 20
Course1 2 35
Course2 2 28
Course3 2 22
Course1 3 32
Course2 3 26
Course3 3 24

Problem 11: Sales Data

The following table shows the monthly sales data for three products.

Product January February March
Product A 100 120 130
Product B 150 160 170
Product C 200 220 230

The dataset is not tidy. The variables are Product, Month, and Sales. Here is the tidy version.

Product Month Sales
A Jan 100
B Jan 150
C Jan 200
A Feb 110
B Feb 160
C Feb 210
A Mar 120
B Mar 170
C Mar 220

Problem 12: Survey Data

The following table represents the results of a survey where respondents rated their satisfaction with three services.

Respondent Service1_Satisfaction Service2_Satisfaction Service3_Satisfaction
R1 5 4 3
R2 4 3 2
R3 3 2 1

The dataset is not tidy. The variables are Respondent, Service, and Satisfaction. Here is the tidy version.

Respondent Service Satisfaction
R1 Service1 5
R2 Service1 4
R3 Service1 3
R1 Service2 4
R2 Service2 3
R3 Service2 2
R1 Service3 3
R2 Service3 2
R3 Service3 1

Problem 13: Weather Data

The table below shows the temperature readings at different times of the day for a week.

Day Morning Noon Evening
Monday 20 25 22
Tuesday 21 26 23
Wednesday 19 24 21
Thursday 22 27 24
Friday 20 25 22

The dataset is not tidy. The variables are Day, Time, and Temperature. Here is the tidy version.

Day Time Temperature
Monday Morning 20
Tuesday Morning 21
Wednesday Morning 19
Thursday Morning 22
Friday Morning 20
Monday Noon 25
Tuesday Noon 26
Wednesday Noon 24
Thursday Noon 27
Friday Noon 25
Monday Evening 22
Tuesday Evening 23
Wednesday Evening 21
Thursday Evening 24
Friday Evening 22

Problem 14: Exam Scores

The following table lists the scores of students in three subjects.

Student Math Science History
Student1 85 88 80
Student2 90 92 85
Student3 95 96 90

The dataset is not tidy. The variables are Student, Subject, Score. Here is the tidy version.

Student Subject Score
Student1 Math 85
Student2 Math 90
Student3 Math 95
Student1 Science 88
Student2 Science 92
Student3 Science 96
Student1 History 80
Student2 History 85
Student3 History 90

Problem 15: Hospital Data

The table below shows the number of patients admitted to different wards of a hospital over three months.

Ward January February March
Ward A 30 35 40
Ward B 25 30 35
Ward C 20 25 30

The dataset is not tidy. The variables are Ward, Month, and Patients. Here is the tidy version.

Ward Month Patients
Ward A January 30
Ward B January 25
Ward C January 20
Ward A February 35
Ward B February 30
Ward C February 25
Ward A March 40
Ward B March 35
Ward C March 30

Problem 16: Marketing Data

The following table represents the results of a marketing campaign showing the number of leads generated from different channels.

Channel Week1 Week2 Week3
Email 50 55 60
Social Media 60 65 70
SEO 70 75 80

The dataset is not tidy. The variables are Channel, Week, Leads. Here is the tidy version.

Channel Week Leads
Email Week1 50
Social Media Week1 60
SEO Week1 70
Email Week2 55
Social Media Week2 65
SEO Week2 75
Email Week3 60
Social Media Week3 70
SEO Week3 80

Problem 17: Fitness Data

The table below shows the workouts completed by three athletes over a week.

Athlete Monday Wednesday Friday
Athlete1 30 35 40
Athlete2 40 45 50
Athlete3 50 55 60

The dataset is not tidy. The variables are Athlete, Day, Workout_Time. Here is the tidy version.

Athlete Day Workout_Time
Athlete1 Monday 30
Athlete2 Monday 40
Athlete3 Monday 50
Athlete1 Wednesday 35
Athlete2 Wednesday 45
Athlete3 Wednesday 55
Athlete1 Friday 40
Athlete2 Friday 50
Athlete3 Friday 60

Problem 18: Financial Data

The following table shows the quarterly profits for three companies.

Company Q1 Q2 Q3 Q4
Company A 10000 12000 13000 14000
Company B 15000 16000 17000 18000
Company C 20000 22000 23000 24000

The dataset is not tidy. The variables are Company, Quarter, and Profit. Here is the tidy version.

Company Quarter Profit
Company A Q1 10000
Company B Q1 15000
Company C Q1 20000
Company A Q2 12000
Company B Q2 16000
Company C Q2 22000
Company A Q3 13000
Company B Q3 17000
Company C Q3 23000
Company A Q4 14000
Company B Q4 18000
Company C Q4 24000

Problem 19: Attendance Data

The table below shows the attendance numbers for different events over three days.

Event Day1 Day2 Day3
Event A 100 110 120
Event B 150 160 170
Event C 200 210 220

The dataset is not tidy. The variables are Event, Day, and Attendance. Here is the tidy version.

Event Day Attendance
Event A Day1 100
Event A Day2 110
Event A Day3 120
Event B Day1 150
Event B Day2 160
Event B Day3 170
Event C Day1 200
Event C Day2 210
Event C Day3 220

Problem 20: Production Data

The following table represents the production output of different products over three shifts.

Product Shift1 Shift2 Shift3
Product X 300 350 400
Product Y 400 450 500
Product Z 500 550 600

The dataset is not tidy. The variables are Product, Shift, and Production. Here is the tidy version.

Product Shift Production
Product X Shift1 300
Product X Shift2 350
Product X Shift3 400
Product Y Shift1 400
Product Y Shift2 450
Product Y Shift3 500
Product Z Shift1 500
Product Z Shift2 550
Product Z Shift3 600