Advanced Data Piping


We have seen examples of how we can finish off a task by using multiple lines of code. For example, we can read in a dataset, filter it, and then create a new variable all in separate lines of code. However, we can also use the pipe operator %>% to connect these tasks together in a single line of code. This is called “piping”.

Note

Note that there are two ways to use the piping command. You can use %>% from the dplyr package, or you can use the native R pipe |> which is built into R. In this lesson, we will use the %>% operator from dplyr but feel free to use either one.


Let’s look at an example of how we can use piping to accomplish the same task we did before, but in a more concise way. We will use the built-in mtcars dataset to demonstrate this.

mtcars Example

In this example, we first filter the mtcars dataset to only include cars with 6 cylinders. Then, we create a new variable hp_per_wt which is the horsepower divided by weight. Finally, we summarise the data to get the average horsepower per weight.

# Without piping

# filter mtcars to only include vehicles with 6 cylinders and store the result
# in the variable mtcars_filtered

mtcars_filtered <- filter(mtcars, cyl == 6)

# create a new variable hp_per_wt, which is hp divided by wt, and add it to mtcars_filtered
# as a new column called hp_per_wt

mtcars_filtered <- mutate(mtcars_filtered, hp_per_wt = hp / wt)

# use the summarise command on the data to get the average hp_per_wt and store 
# it in the variable mtcars_summary

mtcars_summary <- summarise(mtcars_filtered, avg_hp_per_wt = mean(hp_per_wt))

# print the result

mtcars_summary
  avg_hp_per_wt
1      39.92794

We could achieve the same result by chaining the commands together using the pipe operator %>%.

# With piping
mtcars %>%
  filter(cyl == 6) %>%
  mutate(hp_per_wt = hp / wt) %>%
  summarise(avg_hp_per_wt = mean(hp_per_wt))
  avg_hp_per_wt
1      39.92794

We can see that the result is the same, but the code is more concise and easier to read. It also makes more sense to read as a series of steps:

  • First, start with the mtcars dataset and send (pipe) it to the next command
  • Second, filter the cyl variable and choose the cars with 6 cylinders, then take that output and send it to the next command
  • Third, take the output and use the mutate command to create a new variable called
    hp_per_wt, which is horsepower divided by weight, then send that output to the next command
  • Lastly, use the summarise command to get the average of the new hp_per_wt variable.

As our analysis of data gets more complex, piping can help us keep our code organized and easier to read. You will be creating code that will easily exceed 10 - 20 lines and being able to use piping will help you manage that complexity.

Piping with Logical Operators

If you recall, in the Step 2 - Intro to R section, you were introduced to logical operators.

These operators can be very useful when filtering data. Here is a quick refresher on the common logical operators:

  • AND : &
  • OR : |
  • NOT : !
  • EQUALS :==
  • NOT EQUALS :!=

If needed, please review the Step 2 - Intro to R section to remind yourself the meaning of these operators and how to use them.

We can use these logical operators in conjunction with piping to filter data based on multiple conditions.

airquality Example

In this example, we will use the built-in airquality dataset to demonstrate how to use logical operators with piping. We will filter the dataset to only include days where the temperature is above 80 degrees and the wind speed is below 10 mph.

# Using piping with logical operators to filter airquality dataset

# Recall the & (AND) operator requires both conditions to be TRUE

# We will use the head() command at the end so we only see the first 6 rows of
# the filtered data.

airquality %>%
  filter(Temp > 80 & Wind < 10) %>%
  head()
  Ozone Solar.R Wind Temp Month Day
1    NA     186  9.2   84     6   4
2    NA     220  8.6   85     6   5
3    29     127  9.7   82     6   7
4    NA     273  6.9   87     6   8
5    NA     250  9.2   92     6  12
6    23     148  8.0   82     6  13

If you examine the output, you will see that all the rows have a temperature greater than 80 and a wind speed less than 10 mph.

We could add even more conditions to our filter. For example, we could filter the data to only include days where the temperature is above 80 degrees, the wind speed is below 10 mph, and the ozone level is above 50.

airquality %>%
  filter(Temp > 80 & Wind < 10 & Ozone > 50) %>%
  head()
  Ozone Solar.R Wind Temp Month Day
1   135     269  4.1   84     7   1
2    64     175  4.6   83     7   5
3    77     276  5.1   88     7   7
4    97     267  6.3   92     7   8
5    97     272  5.7   92     7   9
6    85     175  7.4   89     7  10

We can also mix and match the logical operators,as needed. For example, we could filter the data to only include days where the temperature is above 80 degrees OR the wind speed is below 10 mph, AND the ozone level is above 50.

Notice the command we want to carry out. It is in two parts: 1. (Temp > 80 OR Wind < 10) 2. AND Ozone > 50

We will need to use parentheses to group the first part of the command. We will follow that up with the AND operator and the second part of the command.

airquality %>%
  filter( (Temp > 80 | Wind < 10) & Ozone > 50) %>%
  head()
  Ozone Solar.R Wind Temp Month Day
1   115     223  5.7   79     5  30
2    71     291 13.8   90     6   9
3   135     269  4.1   84     7   1
4    64     175  4.6   83     7   5
5    77     276  5.1   88     7   7
6    97     267  6.3   92     7   8

Let’s pause for a moment to look at the output.

  • On the first line, the temperature is 79 which is NOT greater than 80, but the wind speed is 7.4 which IS less than 10. Since we used the OR operator, only one of those conditions needs to be TRUE for that part of the command to be TRUE.

    The ozone level is 115 which is greater than 50, so the entire command is TRUE and that row is included in the output.

  • On the second line, the temperature is 90 which IS greater than 80, so the first part of the command is TRUE even though the Wind speed is 13.8.

    The ozone level is 71 which is greater than 50, so the entire command is TRUE and that row is included in the output.

This verifies for us that for the first part of the command, as long as one of the conditions is TRUE then the entire part is considered TRUE. Then, if the second part is also TRUE, the entire command is TRUE and that row is included in the output.

Important

Notice that the use of parenthesis is very important when mixing logical operators. Without the parentheses, R would not know how to group the conditions and the output would be incorrect.

# Incorrect output without parentheses

airquality %>%
  filter( Temp > 80 | Wind < 10 & Ozone > 50) %>%
  head()
  Ozone Solar.R Wind Temp Month Day
1    45     252 14.9   81     5  29
2   115     223  5.7   79     5  30
3    NA     186  9.2   84     6   4
4    NA     220  8.6   85     6   5
5    29     127  9.7   82     6   7
6    NA     273  6.9   87     6   8

You can clearly see that the output from the second command is different from the first command. In this example, we did NOT use parentheses, and R evaluated the command in a way in which we do not intend. It followed the command as we entered it in, but we were incorrect in how we should have set this up. This will lead us to results that are incorrect so make sure you understand the order of the operations and the use of parenthesis.

We can also use the NOT operator ! to filter data. For example, we could filter the airquality dataset to only include days where the temperature is NOT above 80 degrees.

airquality %>%
  filter( !(Temp > 80) ) %>%
  head()
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

In this example, we used the NOT operator ! to negate the condition Temp > 80. This means that we are filtering the dataset to only include rows where the temperature is less than or equal to 80 degrees.

There are also other logical operators we can use, such as EQUALS == and NOT EQUALS !=. For example, we could filter the airquality dataset to only include days where the month is equal to 6 (June).

airquality %>%
  filter(Month == 6) %>%
  head()
  Ozone Solar.R Wind Temp Month Day
1    NA     286  8.6   78     6   1
2    NA     287  9.7   74     6   2
3    NA     242 16.1   67     6   3
4    NA     186  9.2   84     6   4
5    NA     220  8.6   85     6   5
6    NA     264 14.3   79     6   6

Consider the following example where we use multiple logical operators together.

Example
airquality %>%
  filter( (Temp > 80 | Wind < 10) & Ozone != 75 ) %>%
  head()

What is happening in this command?

  • First, we are filtering the airquality dataset to only include rows where the temperature is greater than 80 degrees OR the wind speed is less than 10 mph.
  • Second, we are using the AND operator to further filter the dataset to only include rows where the ozone level is NOT equal to 75.
  • Finally, we are using the head() command to only show the first 6 rows of the filtered dataset.
airquality %>%
  filter( (Temp > 80 | Wind < 10) & Ozone != 75 ) %>%
  head()
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    23     299  8.6   65     5   7
4     7      NA  6.9   74     5  11
5    16     256  9.7   69     5  12
6    11     290  9.2   66     5  13

Logical Operators with Character Variables

We can also use logical operators with character variables. For example, we could filter the built-in iris dataset to only include rows where the species is either “setosa” or “versicolor”.

iris %>%
  filter(Species == "setosa" | Species == "versicolor") %>%
  head()
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

In the iris dataset, the Species variable is a character variable with three possible values: “setosa”, “versicolor”, and “virginica”.

In the previous example, we used the OR operator | to filter the dataset to only include rows where the species is either “setosa” or “versicolor”. We could also achieve the same result by using the NOT EQUALS operator != to filter the dataset to only include rows where the species is NOT “virginica”.

iris %>%
  filter(Species != "virginica") %>%
  head()
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

We can also mix and match logical operators with character variables. For example, we could filter the iris dataset to only include rows where the species is “setosa” AND the sepal length is greater than 5.

iris %>%
  filter(Species == "setosa" & Sepal.Length > 5) %>%
  head()
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          5.4         3.9          1.7         0.4  setosa
3          5.4         3.7          1.5         0.2  setosa
4          5.8         4.0          1.2         0.2  setosa
5          5.7         4.4          1.5         0.4  setosa
6          5.4         3.9          1.3         0.4  setosa

If we wanted all of the entries where the Petal.Length is less than 14 AND the Sepal.Width is greater than 2.5 followed by filtering the Species is “virginica”, we would say the following :

iris %>%
  filter( (Petal.Length < 14 & Sepal.Width > 2.5) & Species == "virginica") %>%
  head()
  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1          6.3         3.3          6.0         2.5 virginica
2          5.8         2.7          5.1         1.9 virginica
3          7.1         3.0          5.9         2.1 virginica
4          6.3         2.9          5.6         1.8 virginica
5          6.5         3.0          5.8         2.2 virginica
6          7.6         3.0          6.6         2.1 virginica

Using %in% with Piping and Logical Operators

We can also use the %in% operator to filter data based on multiple values. The %in% command allows us to check if a value is present in a vector of values. For example, we could filter the iris dataset to only include rows where the species is either “setosa” or “versicolor” using the %in% operator.

iris %>%
  filter(Species %in% c("setosa", "versicolor")) %>%
  head()
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

We can combine %in% with several commands and logical operators to make a longer pipe

iris %>%
  filter( (Species %in% c("setosa", "versicolor")) & (Sepal.Length > 5) ) %>%
  mutate(Sepal.Ratio = Sepal.Length / Sepal.Width) %>%
  summarise(Avg.Sepal.Ratio = mean(Sepal.Ratio))
  Avg.Sepal.Ratio
1        1.926324

In this example, we are filtering the iris dataset to only include rows where the species is either “setosa” or “versicolor” AND the sepal length is greater than 5. Then, we are creating a new variable called Sepal.Ratio which is the sepal length divided by the sepal width. Finally, we are summarising the data to get the average sepal ratio.

Piping and Grouping

We can also use piping in conjunction with the group_by() command to group data by one or more variables. For example, we could group the iris dataset by species and then calculate the average sepal length for each species.

iris %>%
  group_by(Species) %>%
  summarise(Avg.Sepal.Length = mean(Sepal.Length))
# A tibble: 3 × 2
  Species    Avg.Sepal.Length
  <fct>                 <dbl>
1 setosa                 5.01
2 versicolor             5.94
3 virginica              6.59

In this example, we are grouping the iris dataset by the Species variable and then using the summarise() command to calculate the average sepal length for each species. In other words, we looked at all of the setosa entries and calculated the average sepal length for that species, then we did the same for versicolor and virginica.

We can also combine grouping with filtering and mutating. For example, we could filter the iris dataset to only include rows where the sepal length is greater than 5, group the data by species, and then create a new variable called Sepal.Ratio which is the sepal length divided by the sepal width.

iris %>%
  filter(Sepal.Length > 5) %>%
  group_by(Species) %>%
  mutate(Sepal.Ratio = Sepal.Length / Sepal.Width) %>%
  summarise(Avg.Sepal.Ratio = mean(Sepal.Ratio))
# A tibble: 3 × 2
  Species    Avg.Sepal.Ratio
  <fct>                <dbl>
1 setosa                1.44
2 versicolor            2.16
3 virginica             2.24

You do need to be careful when grouping data. After you are finished doing all of your calculations with grouped data, you should always ungroup the data before proceeding with any further analysis. This is because if you do not ungroup the data, any further analysis you do will be done on the grouped data, which may not be what you intend.

You can do this by using the ungroup() command. For example, we could group the iris dataset by species, create a new variable called Sepal.Ratio which is the sepal length divided by the sepal width, ungroup the data, and then calculate the average sepal ratio for the entire dataset.

iris %>%
  group_by(Species) %>%
  mutate(Sepal.Ratio = Sepal.Length / Sepal.Width) %>%
  ungroup() %>%
  summarise(Avg.Sepal.Ratio = mean(Sepal.Ratio))
# A tibble: 1 × 1
  Avg.Sepal.Ratio
            <dbl>
1            1.95

Always make sure you are carrying out the commands as you intend. In the example above, the result of 1.95 is the average of the three grouped values in the previous example (1.44, 2.16, 2.24). If the groups are all the same size, then this is the same as the average of all the Sepal.Ratio values. However, if the groups are not the same size, then this will NOT be the same as the average of all the Sepal.Ratio values.

How could we see the count of each species in the iris dataset? We could use the n() function within the summarise() command to get the count of each species.

iris %>%
  group_by(Species) %>%
  summarise(Count = n())
# A tibble: 3 × 2
  Species    Count
  <fct>      <int>
1 setosa        50
2 versicolor    50
3 virginica     50

This command groups the iris dataset by the Species variable and then uses the summarise() command to create a variable called Count that will calculate the count of each species using the n() command.

So in this case, if I wanted a general average of the Sepal.Ratio for all of the entries in the iris dataset, I could just use mutate() to create the new variable Sepal.Ratio and then use summarise() to calculate the average. Note that since all the groups were the same size, this should give the same result as before.

iris %>%
  mutate(Sepal.Ratio = Sepal.Length / Sepal.Width) %>%
  summarise(Avg.Sepal.Ratio = mean(Sepal.Ratio))
  Avg.Sepal.Ratio
1        1.953681

Exercises

Now it’s your turn to practice using piping with logical operators and grouping.

  1. Using the built-in mtcars dataset, filter the data to only include cars with 4 cylinders AND a horsepower greater than 100. Then, create a new variable called hp_per_wt which is the horsepower divided by weight. Finally, summarise the data to get the average horsepower per weight.

Think this through one step at a time.

  1. Start with the dataset mtcars
  2. Use the filter() command with the appropriate logical operators to filter the data to only include cars with 4 cylinders AND a horsepower greater than 100.
  3. Use the mutate() command to create a new variable called hp_per_wt which is the horsepower divided by weight.
  4. Use the summarise() command to get the average horsepower per weight.

Solution

# Your code here
mtcars %>%
  filter(cyl == 4 & hp > 100) %>%
  mutate(hp_per_wt = hp / wt) %>%
  summarise(avg_hp_per_wt = mean(hp_per_wt))
  1. Using the built-in iris dataset, filter the data to only include rows where the species is either “versicolor” OR “virginica” AND the petal length is greater than 4. Then, create a new variable called Petal.Ratio which is the petal length divided by the petal width. Finally, summarise the data to get the average petal ratio.
  1. Start with the dataset iris
  2. Use the filter() command with the appropriate logical operators to filter the data to only include rows where the species is either “versicolor” OR “virgin ica” AND the petal length is greater than 4.
  3. Use the mutate() command to create a new variable called Petal.R atio which is the petal length divided by the petal width.
  4. Use the summarise() command to get the average petal ratio.

Solution


iris %>%
  filter( (Species == "versicolor" | Species == "virginica") &
          Petal.Length > 4 ) %>%
  mutate(Petal.Ratio = Petal.Length / Petal.Width) %>%
  summarise(Avg.Petal.Ratio = mean(Petal.Ratio))
  1. Using the built-in airquality dataset, filter the data to only include days where the temperature is above 85 degrees OR the wind speed is below 8 mph, AND the ozone level is NOT equal to 100. Then, group the data by month and create a new variable called Temp_Wind_Ratio which is the temperature divided by the wind speed. Finally, summarise the data to get the average temp-wind ratio for each month.
  1. Start with the dataset airquality
  2. Use the filter() command with the appropriate logical operators to filter the data to only include days where the temperature is above 85 degrees OR the wind speed is below 8 mph, AND the ozone level is NOT equal to 100
  3. Use the group_by() command to group the data by month
  4. Use the mutate() command to create a new variable called Temp_Wind _Ratio which is the temperature divided by the wind speed.
  5. Use the summarise() command to get the average temp-wind ratio for each month.

Solution


airquality %>%
  filter( (Temp > 85 | Wind < 8) & Ozone != 100 ) %>%
  group_by(Month) %>%
  mutate(Temp_Wind_Ratio = Temp / Wind) %>%
  summarise(Avg.Temp_Wind_Ratio = mean(Temp_Wind_Ratio))
  1. Using the built-in mtcars dataset, filter the data to only include cars with 8 cylinders OR a weight greater than 3.5. Then, group the data by the number of cylinders and create a new variable called mpg_per_wt which is the miles per gallon divided by weight. Finally, summarise the data to get the average mpg per weight for each group of cylinders.
  1. Start with the dataset mtcars
  2. Use the filter() command with the appropriate logical operators to filter the data to only include cars with 8 cylinders OR a weight greater than 3.5
  3. Use the group_by() command to group the data by the number of cylinders
  4. Use the mutate() command to create a new variable called mpg_per _wt which is the miles per gallon divided by weight.
  5. Use the summarise() command to get the average mpg per weight for each group of cylinders.

Solution

mtcars %>%
  filter(cyl == 8 | wt > 3.5) %>%
  group_by(cyl) %>%
  mutate(mpg_per_wt = mpg / wt) %>%
  summarise(avg_mpg_per_wt = mean(mpg_per_wt))
  1. Using the built-in iris dataset, filter the data to only include rows where the species is NOT “setosa” AND the sepal width is greater than 2.5. Then, group the data by species and create a new variable called Sepal.Area which is the sepal length multiplied by the sepal width. Finally, summarise the data to get the average sepal area for each species.
  1. Start with the dataset iris
  2. Use the filter() command with the appropriate logical operators to filter the data to only include rows where the species is NOT “setosa” AND the sepal width is greater than 2.5
  3. Use the group_by() command to group the data by species
  4. Use the mutate() command to create a new variable called Sepal.Area which is the sepal length multiplied by the sepal width.
  5. Use the summarise() command to get the average sepal area for each species.

Solution

iris %>%
  filter(Species != "setosa" & Sepal.Width > 2.5)
  group_by(Species) %>%
  mutate(Sepal.Area = Sepal.Length * Sepal.Width) %>%
  summarise(Avg.Sepal.Area = mean(Sepal.Area))

Here is a practice problem using %in%

  1. Using the built-in iris dataset, use the %>% command to filter the data to only include rows where the species is in the set of “setosa” and “virginica” AND the petal width is less than 1.5. Then, group the data by species and create a new variable called Petal.Area which is the petal length multiplied by the petal width. Finally, summarise the data to get the average petal area for each species.
  1. Start with the dataset iris
  2. Use the filter() command with the appropriate logical operators to filter the data to only include rows where the species is in the set of “setosa” and “virginica” AND the petal width is less than 1.5
  3. Use the group_by() command to group the data by species
  4. Use the mutate() command to create a new variable called Petal.Area which is the petal length multiplied by the petal width.
  5. Use the summarise() command to get the average petal area for each species.

Solution

iris %>%
  filter(Species %in% c("setosa", "virginica") & Petal.Width < 1.5) %>%
  group_by(Species) %>%
  mutate(Petal.Area = Petal.Length * Petal.Width) %>%
  summarise(Avg.Petal.Area = mean(Petal.Area))