What is “Tidy Data”?

Note

FYI : Here is the official page for Tidy data. This page is based off of the original paper describing tidy data that Hadley Wickham wrote for the Journal of Statistical Software : Tidy Data.

Tidy data is a standard way of organizing data that makes it easy to work with. It is a concept that was popularized by the tidyverse packages in R. tidyverse is, in a sense, a “library” that contains several packages that are designed to work together, and are designed to work with tidy data. The tidyverse packages include the dplyr and tidyr packages among others. These packages are designed to work with tidy data. As you start to learn more about R, you will discover several of the packages included in the tidyverse. These packages are designed to work with tidy data. The tidyverse packages include several packages that provide tools for reading in data (the readr package), cleaning data (the dplyr package), transforming data (the tidyr package), and visual data (the ggplot2 package). These tools are designed to work with tidy data, so it is important to understand what tidy data is and how to organize data in a tidy format.

Hadley Wickham, the author of the tidyverse packages, defines tidy data as follows:

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

The purpose for creating data sets that are tidy is to make it easier to analyze and visualize data. Tidy data is easy to work with because it follows a consistent structure that makes it easy to manipulate and visualize data. If we know that the data set is set up as a tidy data set, we can use the tidyverse packages to work with the data. These packages provide tools that are set up to work with tidy data.

This consistency makes it easier to read in data, to clean data, to transform data, and to visualize data. When data is tidy, it is easier to work with because we can use the same tools to work with the data. This means we don’t have to learn new tools for each new data set that we work with.

When data is not tidy, it can be difficult to work with. For example, if data is spread across multiple columns, it can be difficult to analyze and visualize the data. You may have to completely rewrite your code or create a completely new script to work with the data. By organizing data in a tidy format, it is easier to work with and analyze data because we have these packages that are created to work with a data set that has been formatted as “tidy”.

Example

Consider the following data:

# If needed, install the tidyverse package

# install.packages("tidyverse")

# If it is already installed, make sure it is loaded up to use :

# Load the tidyverse package

library(tidyverse)

Warning: package 'ggplot2' was built under R version 4.5.2

Let’s create a data frame that is not tidy. We will create a data frame with three rows and four columns. The first column contains the names of three people, and the other columns contain data for the years 2010, 2011, and 2012. Each row represents a person, and each column represents their age during that year.

df <- tibble(
  name = c("John Smith", "Jane Doe", "Mary Johnson"),
  `2010` = c(25, 30, 35),
  `2011` = c(26, 31, 36),
  `2012` = c(27, 32, 37)
)

df

# A tibble: 3 × 4
  name         `2010` `2011` `2012`
  <chr>         <dbl>  <dbl>  <dbl>
1 John Smith       25     26     27
2 Jane Doe         30     31     32
3 Mary Johnson     35     36     37

In this case, the data is not tidy because the years are spread across columns. In order for this to be considered “tidy” data, we would need to think about the data in a different way.

The variables we are using are name, year, and age. In order for the data to be tidy, we want each observation (row) to contain a name, a year, and the age. This tells us we would need to have a column for each of these variables. In this case, we would need to have a column for the name of the person, a column for the year that the data was collected, and a column for the age of the person the year the data was collected.

Here is what the tidy data would look like if we ordered the data in a tidy format by name, year, and age:

df_tidy <- tibble(
  name = c("John Smith", "John Smith", "John Smith", "Jane Doe", "Jane Doe", 
           "Jane Doe", "Mary Johnson", "Mary Johnson", "Mary Johnson"),
  year = c(2010, 2011, 2012, 2010, 2011, 2012, 2010, 2011, 2012),
  age = c(25, 26, 27, 30, 31, 32, 35, 36, 37)
)

df_tidy

# A tibble: 9 × 3
  name          year   age
  <chr>        <dbl> <dbl>
1 John Smith    2010    25
2 John Smith    2011    26
3 John Smith    2012    27
4 Jane Doe      2010    30
5 Jane Doe      2011    31
6 Jane Doe      2012    32
7 Mary Johnson  2010    35
8 Mary Johnson  2011    36
9 Mary Johnson  2012    37

We have now cleaned the data so that we can work with it in a tidy format.

As an example of how you could use the tools from the tidyverse package, you could use the pivot_longer( ) function from the tidyr package to convert the data from the original data frame to a tidy data frame. This is just an example to show you a more elegant way to convert the data to a tidy format. You will perform more advanced cleaning options as you learn more about the tidyverse packages.

df_tidy <- df %>% 
  pivot_longer(cols = -name, names_to = "year", values_to = "value")

df_tidy

# A tibble: 9 × 3
  name         year  value
  <chr>        <chr> <dbl>
1 John Smith   2010     25
2 John Smith   2011     26
3 John Smith   2012     27
4 Jane Doe     2010     30
5 Jane Doe     2011     31
6 Jane Doe     2012     32
7 Mary Johnson 2010     35
8 Mary Johnson 2011     36
9 Mary Johnson 2012     37

Now the data is tidy because each variable forms a column, each observation forms a row, and each type of observational unit forms a table.

Is Non-Tidy Data Bad?

Can you work with non-tidy data? Yes.

Can you analyze non-tidy data? Yes.

Can you visualize non-tidy data? Yes.

While tidy data has some nice properties that make it easier to work with, it is not the only way to work with data. You can work with non-tidy data, but it may be more difficult to do so. You may have to write more code or create a completely new script to work with the data.

When we think ease of use and reproducibility, having a structure to the data can make life easier. If we know going into the coding that the data is tidy, we know that we have an entire library dedicated to working with data that is set up in this particular way.

Does this mean that you will only work with tidy data? No. You will work with non-tidy data. In fact, there may be some projects where you don’t want to turn the data into a tidy dataset – and that’s OK. Your job will be to determine the best way to work with the data that you have.

Summary

Practice Problems

In this assignment, you will identify whether a given dataset is in tidy format. Each problem will present a dataset from a different background along with a brief description. Your task is to determine if the dataset is tidy. If it is not tidy, describe why and provide a tidy version of the dataset.

Problem 1: Weather Data

The following dataset contains weather data for three cities.

City	Jan_Temp	Feb_Temp	Mar_Temp
New York	30	32	45
Los Angeles	58	60	65
Chicago	25	28	40

Answer

The dataset is not tidy. There are three variables : City, Month, and Temperature. Here is the tidy version:

City	Month	Temperature
New York	Jan	30
Los Angeles	Jan	58
Chicago	Jan	25
New York	Feb	32
Los Angeles	Feb	60
Chicago	Feb	28
New York	Mar	45
Los Angeles	Mar	65
Chicago	Mar	40

Problem 2: Student Grades

The following dataset contains grades for students in three subjects.

Student	Math	Science	History
Alice	90	88	84
Bob	85	92	78
Charlie	95	89	91

Answer

The dataset is not tidy. The three variables are Student, Subject, and Grade. Here is the tidy version:

Student	Subject	Grade
Alice	Math	90
Bob	Math	85
Charlie	Math	95
Alice	Science	88
Bob	Science	92
Charlie	Science	89
Alice	History	84
Bob	History	78
Charlie	History	91

Problem 3: Sales Data

The following dataset contains monthly sales data for different products.

Product	Jan_Sales	Feb_Sales	Mar_Sales
A	100	110	120
B	150	160	170
C	200	210	220

Answer

The dataset is not tidy. The variables are Product, Month, and Sales. Here is the tidy version:

Product	Month	Sales
A	Jan	100
B	Jan	150
C	Jan	200
A	Feb	110
B	Feb	160
C	Feb	210
A	Mar	120
B	Mar	170
C	Mar	220

Problem 4: Patient Health Data

The following dataset contains health data for patients.

Patient	Height	Weight	Age
John	170	70	30
Jane	160	55	25
Doe	180	80	40

Answer

The dataset is tidy.

Problem 5: Financial Data

The following dataset contains quarterly financial data for companies.

Company	Q1_Revenue	Q2_Revenue	Q3_Revenue
X	1000	1100	1200
Y	2000	2100	2200
Z	3000	3100	3200

Answer

The dataset is not tidy. The htree variables are Company, Quarter, and Revenue. Here is the tidy version:

Company	Quarter	Revenue
X	Q1	1000
Y	Q1	2000
Z	Q1	3000
X	Q2	1100
Y	Q2	2100
Z	Q2	3100
X	Q3	1200
Y	Q3	2200
Z	Q3	3200

Problem 6: Sports Statistics

The following dataset contains statistics for players in a sports team.

Player	Goals	Assists	Saves
Player1	5	3	2
Player2	8	5	1
Player3	7	4	3

Answer

The dataset is tidy.

Problem 7: Movie Ratings

The following dataset contains ratings for movies by different critics.

Movie	Critic1	Critic2	Critic3
Movie A	4.5	4.0	4.7
Movie B	3.8	3.9	4.0
Movie C	4.7	4.8	4.9

Answer

The dataset is not tidy. The three variables are Movie, Critic, and Rating. Here is the tidy version:

Movie	Critic	Rating
Movie A	Critic1	4.5
Movie B	Critic1	3.8
Movie C	Critic1	4.7
Movie A	Critic2	4.0
Movie B	Critic2	3.9
Movie C	Critic2	4.8
Movie A	Critic3	4.7
Movie B	Critic3	4.0
Movie C	Critic3	4.9

Problem 8: Employee Salary Data

The following dataset contains salary data for employees in different departments.

Employee	Dept1_Salary	Dept2_Salary	Dept3_Salary
E1	50000	52000	54000
E2	55000	57000	59000
E3	60000	62000	64000

Answer

The dataset is not tidy. The variables are Employee, Dept1_Salary, Dept2_Salary, and Dept3_Salary.Here is the tidy version:

Employee	Department	Salary
E1	Dept1	50000
E2	Dept1	55000
E3	Dept1	60000
E1	Dept2	52000
E2	Dept2	57000
E3	Dept2	62000
E1	Dept3	54000
E2	Dept3	59000
E3	Dept3	64000

Problem 9: Product Reviews

The following dataset contains reviews for products.

Product	Review1	Review2	Review3
Product1	Good	Very Good	Excellent
Product2	Average	Good	Good
Product3	Excellent	Very Good	Good

Answer

The dataset is not tidy. The variables are Product, Review_Number, and Review. Here is the tidy version:

Product	Review_Number	Review
Product1	1	Good
Product2	1	Average
Product3	1	Excellent
Product1	2	Very Good
Product2	2	Good
Product3	2	Very Good
Product1	3	Excellent
Product2	3	Good
Product3	3	Good

Problem 10: Course Enrollment Data

The following dataset contains enrollment data for courses.

Course	Semester1	Semester2	Semester3
Course1	30	35	32
Course2	25	28	26
Course3	20	22	24

Answer

The dataset is not tidy. The variables are Course, Semester, and Enrollment. Here is the tidy version:

Course	Semester	Enrollment
Course1	1	30
Course2	1	25
Course3	1	20
Course1	2	35
Course2	2	28
Course3	2	22
Course1	3	32
Course2	3	26
Course3	3	24

Problem 11: Sales Data

The following table shows the monthly sales data for three products.

Product	January	February	March
Product A	100	120	130
Product B	150	160	170
Product C	200	220	230

Answer

The dataset is not tidy. The variables are Product, Month, and Sales. Here is the tidy version.

Product	Month	Sales
A	Jan	100
B	Jan	150
C	Jan	200
A	Feb	110
B	Feb	160
C	Feb	210
A	Mar	120
B	Mar	170
C	Mar	220

Problem 12: Survey Data

The following table represents the results of a survey where respondents rated their satisfaction with three services.

Respondent	Service1_Satisfaction	Service2_Satisfaction	Service3_Satisfaction
R1	5	4	3
R2	4	3	2
R3	3	2	1

Answer

The dataset is not tidy. The variables are Respondent, Service, and Satisfaction. Here is the tidy version.

Respondent	Service	Satisfaction
R1	Service1	5
R2	Service1	4
R3	Service1	3
R1	Service2	4
R2	Service2	3
R3	Service2	2
R1	Service3	3
R2	Service3	2
R3	Service3	1

Problem 13: Weather Data

The table below shows the temperature readings at different times of the day for a week.

Day	Morning	Noon	Evening
Monday	20	25	22
Tuesday	21	26	23
Wednesday	19	24	21
Thursday	22	27	24
Friday	20	25	22

Answer

The dataset is not tidy. The variables are Day, Time, and Temperature. Here is the tidy version.

Day	Time	Temperature
Monday	Morning	20
Tuesday	Morning	21
Wednesday	Morning	19
Thursday	Morning	22
Friday	Morning	20
Monday	Noon	25
Tuesday	Noon	26
Wednesday	Noon	24
Thursday	Noon	27
Friday	Noon	25
Monday	Evening	22
Tuesday	Evening	23
Wednesday	Evening	21
Thursday	Evening	24
Friday	Evening	22

Problem 14: Exam Scores

The following table lists the scores of students in three subjects.

Student	Math	Science	History
Student1	85	88	80
Student2	90	92	85
Student3	95	96	90

Answer

The dataset is not tidy. The variables are Student, Subject, Score. Here is the tidy version.

Student	Subject	Score
Student1	Math	85
Student2	Math	90
Student3	Math	95
Student1	Science	88
Student2	Science	92
Student3	Science	96
Student1	History	80
Student2	History	85
Student3	History	90

Problem 15: Hospital Data

The table below shows the number of patients admitted to different wards of a hospital over three months.

Ward	January	February	March
Ward A	30	35	40
Ward B	25	30	35
Ward C	20	25	30

Answer

The dataset is not tidy. The variables are Ward, Month, and Patients. Here is the tidy version.

Ward	Month	Patients
Ward A	January	30
Ward B	January	25
Ward C	January	20
Ward A	February	35
Ward B	February	30
Ward C	February	25
Ward A	March	40
Ward B	March	35
Ward C	March	30

Problem 16: Marketing Data

The following table represents the results of a marketing campaign showing the number of leads generated from different channels.

Channel	Week1	Week2	Week3
Email	50	55	60
Social Media	60	65	70
SEO	70	75	80

Answer

The dataset is not tidy. The variables are Channel, Week, Leads. Here is the tidy version.

Channel	Week	Leads
Email	Week1	50
Social Media	Week1	60
SEO	Week1	70
Email	Week2	55
Social Media	Week2	65
SEO	Week2	75
Email	Week3	60
Social Media	Week3	70
SEO	Week3	80

Problem 17: Fitness Data

The table below shows the workouts completed by three athletes over a week.

Athlete	Monday	Wednesday	Friday
Athlete1	30	35	40
Athlete2	40	45	50
Athlete3	50	55	60

Answer

The dataset is not tidy. The variables are Athlete, Day, Workout_Time. Here is the tidy version.

Athlete	Day	Workout_Time
Athlete1	Monday	30
Athlete2	Monday	40
Athlete3	Monday	50
Athlete1	Wednesday	35
Athlete2	Wednesday	45
Athlete3	Wednesday	55
Athlete1	Friday	40
Athlete2	Friday	50
Athlete3	Friday	60

Problem 18: Financial Data

The following table shows the quarterly profits for three companies.

Company	Q1	Q2	Q3	Q4
Company A	10000	12000	13000	14000
Company B	15000	16000	17000	18000
Company C	20000	22000	23000	24000

Answer

The dataset is not tidy. The variables are Company, Quarter, and Profit. Here is the tidy version.

Company	Quarter	Profit
Company A	Q1	10000
Company B	Q1	15000
Company C	Q1	20000
Company A	Q2	12000
Company B	Q2	16000
Company C	Q2	22000
Company A	Q3	13000
Company B	Q3	17000
Company C	Q3	23000
Company A	Q4	14000
Company B	Q4	18000
Company C	Q4	24000

Problem 19: Attendance Data

The table below shows the attendance numbers for different events over three days.

Event	Day1	Day2	Day3
Event A	100	110	120
Event B	150	160	170
Event C	200	210	220

Answer

The dataset is not tidy. The variables are Event, Day, and Attendance. Here is the tidy version.

Event	Day	Attendance
Event A	Day1	100
Event A	Day2	110
Event A	Day3	120
Event B	Day1	150
Event B	Day2	160
Event B	Day3	170
Event C	Day1	200
Event C	Day2	210
Event C	Day3	220

Problem 20: Production Data

The following table represents the production output of different products over three shifts.

Product	Shift1	Shift2	Shift3
Product X	300	350	400
Product Y	400	450	500
Product Z	500	550	600

Answer

The dataset is not tidy. The variables are Product, Shift, and Production. Here is the tidy version.

Product	Shift	Production
Product X	Shift1	300
Product X	Shift2	350
Product X	Shift3	400
Product Y	Shift1	400
Product Y	Shift2	450
Product Y	Shift3	500
Product Z	Shift1	500
Product Z	Shift2	550
Product Z	Shift3	600