FYI : Here is the official page for Tidy data. This page is based off of the original paper describing tidy data that Hadley Wickham wrote for the Journal of Statistical Software : Tidy Data.
Tidy data is a standard way of organizing data that makes it easy to work with. It is a concept that was popularized by the tidyverse packages in R. tidyverse is, in a sense, a “library” that contains several packages that are designed to work together, and are designed to work with tidy data. The tidyverse packages include the dplyr and tidyr packages among others. These packages are designed to work with tidy data. As you start to learn more about R, you will discover several of the packages included in the tidyverse. These packages are designed to work with tidy data. The tidyverse packages include several packages that provide tools for reading in data (the readr package), cleaning data (the dplyr package), transforming data (the tidyr package), and visual data (the ggplot2 package). These tools are designed to work with tidy data, so it is important to understand what tidy data is and how to organize data in a tidy format.
Hadley Wickham, the author of the tidyverse packages, defines tidy data as follows:
Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.
The purpose for creating data sets that are tidy is to make it easier to analyze and visualize data. Tidy data is easy to work with because it follows a consistent structure that makes it easy to manipulate and visualize data. If we know that the data set is set up as a tidy data set, we can use the tidyverse packages to work with the data. These packages provide tools that are set up to work with tidy data.
This consistency makes it easier to read in data, to clean data, to transform data, and to visualize data. When data is tidy, it is easier to work with because we can use the same tools to work with the data. This means we don’t have to learn new tools for each new data set that we work with.
When data is not tidy, it can be difficult to work with. For example, if data is spread across multiple columns, it can be difficult to analyze and visualize the data. You may have to completely rewrite your code or create a completely new script to work with the data. By organizing data in a tidy format, it is easier to work with and analyze data because we have these packages that are created to work with a data set that has been formatted as “tidy”.
Example
Consider the following data:
# If needed, install the tidyverse package# install.packages("tidyverse")# If it is already installed, make sure it is loaded up to use :# Load the tidyverse packagelibrary(tidyverse)
Warning: package 'ggplot2' was built under R version 4.5.2
Let’s create a data frame that is not tidy. We will create a data frame with three rows and four columns. The first column contains the names of three people, and the other columns contain data for the years 2010, 2011, and 2012. Each row represents a person, and each column represents their age during that year.
# A tibble: 3 × 4
name `2010` `2011` `2012`
<chr> <dbl> <dbl> <dbl>
1 John Smith 25 26 27
2 Jane Doe 30 31 32
3 Mary Johnson 35 36 37
In this case, the data is not tidy because the years are spread across columns. In order for this to be considered “tidy” data, we would need to think about the data in a different way.
The variables we are using are name, year, and age. In order for the data to be tidy, we want each observation (row) to contain a name, a year, and the age. This tells us we would need to have a column for each of these variables. In this case, we would need to have a column for the name of the person, a column for the year that the data was collected, and a column for the age of the person the year the data was collected.
Here is what the tidy data would look like if we ordered the data in a tidy format by name, year, and age:
# A tibble: 9 × 3
name year age
<chr> <dbl> <dbl>
1 John Smith 2010 25
2 John Smith 2011 26
3 John Smith 2012 27
4 Jane Doe 2010 30
5 Jane Doe 2011 31
6 Jane Doe 2012 32
7 Mary Johnson 2010 35
8 Mary Johnson 2011 36
9 Mary Johnson 2012 37
We have now cleaned the data so that we can work with it in a tidy format.
As an example of how you could use the tools from the tidyverse package, you could use the pivot_longer( ) function from the tidyr package to convert the data from the original data frame to a tidy data frame. This is just an example to show you a more elegant way to convert the data to a tidy format. You will perform more advanced cleaning options as you learn more about the tidyverse packages.
# A tibble: 9 × 3
name year value
<chr> <chr> <dbl>
1 John Smith 2010 25
2 John Smith 2011 26
3 John Smith 2012 27
4 Jane Doe 2010 30
5 Jane Doe 2011 31
6 Jane Doe 2012 32
7 Mary Johnson 2010 35
8 Mary Johnson 2011 36
9 Mary Johnson 2012 37
Now the data is tidy because each variable forms a column, each observation forms a row, and each type of observational unit forms a table.
Is Non-Tidy Data Bad?
Can you work with non-tidy data? Yes.
Can you analyze non-tidy data? Yes.
Can you visualize non-tidy data? Yes.
While tidy data has some nice properties that make it easier to work with, it is not the only way to work with data. You can work with non-tidy data, but it may be more difficult to do so. You may have to write more code or create a completely new script to work with the data.
When we think ease of use and reproducibility, having a structure to the data can make life easier. If we know going into the coding that the data is tidy, we know that we have an entire library dedicated to working with data that is set up in this particular way.
Does this mean that you will only work with tidy data? No. You will work with non-tidy data. In fact, there may be some projects where you don’t want to turn the data into a tidy dataset – and that’s OK. Your job will be to determine the best way to work with the data that you have.
Summary
When data is not tidy, it can be difficult to work with. For example, if data is spread across multiple columns, it can be difficult to analyze and visualize the data. You may have to completely rewrite your code or create a completely new script to work with the data. By organizing data in a tidy format, it is easier to work with and analyze data because we have these packages that are created to work with a data set that has been formatted as “tidy”.
Practice Problems
In this assignment, you will identify whether a given dataset is in tidy format. Each problem will present a dataset from a different background along with a brief description. Your task is to determine if the dataset is tidy. If it is not tidy, describe why and provide a tidy version of the dataset.
Problem 1: Weather Data
The following dataset contains weather data for three cities.
City
Jan_Temp
Feb_Temp
Mar_Temp
New York
30
32
45
Los Angeles
58
60
65
Chicago
25
28
40
Answer
The dataset is not tidy. There are three variables : City, Month, and Temperature. Here is the tidy version:
City
Month
Temperature
New York
Jan
30
Los Angeles
Jan
58
Chicago
Jan
25
New York
Feb
32
Los Angeles
Feb
60
Chicago
Feb
28
New York
Mar
45
Los Angeles
Mar
65
Chicago
Mar
40
Problem 2: Student Grades
The following dataset contains grades for students in three subjects.
Student
Math
Science
History
Alice
90
88
84
Bob
85
92
78
Charlie
95
89
91
Answer
The dataset is not tidy. The three variables are Student, Subject, and Grade. Here is the tidy version:
Student
Subject
Grade
Alice
Math
90
Bob
Math
85
Charlie
Math
95
Alice
Science
88
Bob
Science
92
Charlie
Science
89
Alice
History
84
Bob
History
78
Charlie
History
91
Problem 3: Sales Data
The following dataset contains monthly sales data for different products.
Product
Jan_Sales
Feb_Sales
Mar_Sales
A
100
110
120
B
150
160
170
C
200
210
220
Answer
The dataset is not tidy. The variables are Product, Month, and Sales. Here is the tidy version:
Product
Month
Sales
A
Jan
100
B
Jan
150
C
Jan
200
A
Feb
110
B
Feb
160
C
Feb
210
A
Mar
120
B
Mar
170
C
Mar
220
Problem 4: Patient Health Data
The following dataset contains health data for patients.
Patient
Height
Weight
Age
John
170
70
30
Jane
160
55
25
Doe
180
80
40
Answer
The dataset is tidy.
Problem 5: Financial Data
The following dataset contains quarterly financial data for companies.
Company
Q1_Revenue
Q2_Revenue
Q3_Revenue
X
1000
1100
1200
Y
2000
2100
2200
Z
3000
3100
3200
Answer
The dataset is not tidy. The htree variables are Company, Quarter, and Revenue. Here is the tidy version:
Company
Quarter
Revenue
X
Q1
1000
Y
Q1
2000
Z
Q1
3000
X
Q2
1100
Y
Q2
2100
Z
Q2
3100
X
Q3
1200
Y
Q3
2200
Z
Q3
3200
Problem 6: Sports Statistics
The following dataset contains statistics for players in a sports team.
Player
Goals
Assists
Saves
Player1
5
3
2
Player2
8
5
1
Player3
7
4
3
Answer
The dataset is tidy.
Problem 7: Movie Ratings
The following dataset contains ratings for movies by different critics.
Movie
Critic1
Critic2
Critic3
Movie A
4.5
4.0
4.7
Movie B
3.8
3.9
4.0
Movie C
4.7
4.8
4.9
Answer
The dataset is not tidy. The three variables are Movie, Critic, and Rating. Here is the tidy version:
Movie
Critic
Rating
Movie A
Critic1
4.5
Movie B
Critic1
3.8
Movie C
Critic1
4.7
Movie A
Critic2
4.0
Movie B
Critic2
3.9
Movie C
Critic2
4.8
Movie A
Critic3
4.7
Movie B
Critic3
4.0
Movie C
Critic3
4.9
Problem 8: Employee Salary Data
The following dataset contains salary data for employees in different departments.
Employee
Dept1_Salary
Dept2_Salary
Dept3_Salary
E1
50000
52000
54000
E2
55000
57000
59000
E3
60000
62000
64000
Answer
The dataset is not tidy. The variables are Employee, Dept1_Salary, Dept2_Salary, and Dept3_Salary.Here is the tidy version:
Employee
Department
Salary
E1
Dept1
50000
E2
Dept1
55000
E3
Dept1
60000
E1
Dept2
52000
E2
Dept2
57000
E3
Dept2
62000
E1
Dept3
54000
E2
Dept3
59000
E3
Dept3
64000
Problem 9: Product Reviews
The following dataset contains reviews for products.
Product
Review1
Review2
Review3
Product1
Good
Very Good
Excellent
Product2
Average
Good
Good
Product3
Excellent
Very Good
Good
Answer
The dataset is not tidy. The variables are Product, Review_Number, and Review. Here is the tidy version:
Product
Review_Number
Review
Product1
1
Good
Product2
1
Average
Product3
1
Excellent
Product1
2
Very Good
Product2
2
Good
Product3
2
Very Good
Product1
3
Excellent
Product2
3
Good
Product3
3
Good
Problem 10: Course Enrollment Data
The following dataset contains enrollment data for courses.
Course
Semester1
Semester2
Semester3
Course1
30
35
32
Course2
25
28
26
Course3
20
22
24
Answer
The dataset is not tidy. The variables are Course, Semester, and Enrollment. Here is the tidy version:
Course
Semester
Enrollment
Course1
1
30
Course2
1
25
Course3
1
20
Course1
2
35
Course2
2
28
Course3
2
22
Course1
3
32
Course2
3
26
Course3
3
24
Problem 11: Sales Data
The following table shows the monthly sales data for three products.
Product
January
February
March
Product A
100
120
130
Product B
150
160
170
Product C
200
220
230
Answer
The dataset is not tidy. The variables are Product, Month, and Sales. Here is the tidy version.
Product
Month
Sales
A
Jan
100
B
Jan
150
C
Jan
200
A
Feb
110
B
Feb
160
C
Feb
210
A
Mar
120
B
Mar
170
C
Mar
220
Problem 12: Survey Data
The following table represents the results of a survey where respondents rated their satisfaction with three services.
Respondent
Service1_Satisfaction
Service2_Satisfaction
Service3_Satisfaction
R1
5
4
3
R2
4
3
2
R3
3
2
1
Answer
The dataset is not tidy. The variables are Respondent, Service, and Satisfaction. Here is the tidy version.
Respondent
Service
Satisfaction
R1
Service1
5
R2
Service1
4
R3
Service1
3
R1
Service2
4
R2
Service2
3
R3
Service2
2
R1
Service3
3
R2
Service3
2
R3
Service3
1
Problem 13: Weather Data
The table below shows the temperature readings at different times of the day for a week.
Day
Morning
Noon
Evening
Monday
20
25
22
Tuesday
21
26
23
Wednesday
19
24
21
Thursday
22
27
24
Friday
20
25
22
Answer
The dataset is not tidy. The variables are Day, Time, and Temperature. Here is the tidy version.
Day
Time
Temperature
Monday
Morning
20
Tuesday
Morning
21
Wednesday
Morning
19
Thursday
Morning
22
Friday
Morning
20
Monday
Noon
25
Tuesday
Noon
26
Wednesday
Noon
24
Thursday
Noon
27
Friday
Noon
25
Monday
Evening
22
Tuesday
Evening
23
Wednesday
Evening
21
Thursday
Evening
24
Friday
Evening
22
Problem 14: Exam Scores
The following table lists the scores of students in three subjects.
Student
Math
Science
History
Student1
85
88
80
Student2
90
92
85
Student3
95
96
90
Answer
The dataset is not tidy. The variables are Student, Subject, Score. Here is the tidy version.
Student
Subject
Score
Student1
Math
85
Student2
Math
90
Student3
Math
95
Student1
Science
88
Student2
Science
92
Student3
Science
96
Student1
History
80
Student2
History
85
Student3
History
90
Problem 15: Hospital Data
The table below shows the number of patients admitted to different wards of a hospital over three months.
Ward
January
February
March
Ward A
30
35
40
Ward B
25
30
35
Ward C
20
25
30
Answer
The dataset is not tidy. The variables are Ward, Month, and Patients. Here is the tidy version.
Ward
Month
Patients
Ward A
January
30
Ward B
January
25
Ward C
January
20
Ward A
February
35
Ward B
February
30
Ward C
February
25
Ward A
March
40
Ward B
March
35
Ward C
March
30
Problem 16: Marketing Data
The following table represents the results of a marketing campaign showing the number of leads generated from different channels.
Channel
Week1
Week2
Week3
Email
50
55
60
Social Media
60
65
70
SEO
70
75
80
Answer
The dataset is not tidy. The variables are Channel, Week, Leads. Here is the tidy version.
Channel
Week
Leads
Email
Week1
50
Social Media
Week1
60
SEO
Week1
70
Email
Week2
55
Social Media
Week2
65
SEO
Week2
75
Email
Week3
60
Social Media
Week3
70
SEO
Week3
80
Problem 17: Fitness Data
The table below shows the workouts completed by three athletes over a week.
Athlete
Monday
Wednesday
Friday
Athlete1
30
35
40
Athlete2
40
45
50
Athlete3
50
55
60
Answer
The dataset is not tidy. The variables are Athlete, Day, Workout_Time. Here is the tidy version.
Athlete
Day
Workout_Time
Athlete1
Monday
30
Athlete2
Monday
40
Athlete3
Monday
50
Athlete1
Wednesday
35
Athlete2
Wednesday
45
Athlete3
Wednesday
55
Athlete1
Friday
40
Athlete2
Friday
50
Athlete3
Friday
60
Problem 18: Financial Data
The following table shows the quarterly profits for three companies.
Company
Q1
Q2
Q3
Q4
Company A
10000
12000
13000
14000
Company B
15000
16000
17000
18000
Company C
20000
22000
23000
24000
Answer
The dataset is not tidy. The variables are Company, Quarter, and Profit. Here is the tidy version.
Company
Quarter
Profit
Company A
Q1
10000
Company B
Q1
15000
Company C
Q1
20000
Company A
Q2
12000
Company B
Q2
16000
Company C
Q2
22000
Company A
Q3
13000
Company B
Q3
17000
Company C
Q3
23000
Company A
Q4
14000
Company B
Q4
18000
Company C
Q4
24000
Problem 19: Attendance Data
The table below shows the attendance numbers for different events over three days.
Event
Day1
Day2
Day3
Event A
100
110
120
Event B
150
160
170
Event C
200
210
220
Answer
The dataset is not tidy. The variables are Event, Day, and Attendance. Here is the tidy version.
Event
Day
Attendance
Event A
Day1
100
Event A
Day2
110
Event A
Day3
120
Event B
Day1
150
Event B
Day2
160
Event B
Day3
170
Event C
Day1
200
Event C
Day2
210
Event C
Day3
220
Problem 20: Production Data
The following table represents the production output of different products over three shifts.
Product
Shift1
Shift2
Shift3
Product X
300
350
400
Product Y
400
450
500
Product Z
500
550
600
Answer
The dataset is not tidy. The variables are Product, Shift, and Production. Here is the tidy version.