R Tutorial

Fundamentals of R

Variables

Input and Output

Decision Making

Control Flow

Functions

Strings

Vectors

Lists

Arrays

Matrices

Factors

DataFrames

Object Oriented Programming

Error Handling

File Handling

Packages in R

Data Interfaces

Data Visualization

Statistics

Machine Learning with R

dplyr Package in R

dplyr is one of the most popular packages in R for data manipulation. Developed by Hadley Wickham, it provides a coherent system to operate on datasets using a set of "verbs" that perform common data manipulation tasks. This tutorial will introduce you to some of the core functionalities of dplyr.

1. Install and Load the dplyr package:

If you haven't already installed it, do so with:

install.packages("dplyr")

Load the package:

library(dplyr)

2. Basic Verbs in dplyr:

The main verbs in dplyr are:

  • select(): Choose variables (columns) from a dataset.
  • filter(): Filter rows based on some criteria.
  • arrange(): Reorder rows.
  • mutate(): Create or transform columns.
  • summarise(): Summarize data.

3. Working with the dplyr Verbs:

a. select():

Choose specific columns from a dataset.

data(mtcars)
select(mtcars, mpg, hp)

b. filter():

Select rows based on a condition.

filter(mtcars, mpg > 20, hp < 100)

c. arrange():

Sort the data based on a column. Use desc() for descending order.

arrange(mtcars, mpg)          # Ascending order
arrange(mtcars, desc(mpg))    # Descending order

d. mutate():

Create a new column or modify an existing one.

mutate(mtcars, efficiency = mpg/hp)

e. summarise():

Create a summary of your data.

summarise(mtcars, avg_mpg = mean(mpg), max_hp = max(hp))

4. Chaining (%>% operator):

dplyr offers a chaining mechanism using %>% (pipe operator) to combine multiple operations.

mtcars %>%
  filter(mpg > 20) %>%
  select(mpg, hp) %>%
  arrange(desc(hp))

This code filters the rows where mpg is more than 20, selects the mpg and hp columns, and arranges them in descending order based on hp.

5. Working with Groups (group_by()):

Grouping is a powerful tool in dplyr, allowing you to split data and operate on each group.

mtcars %>%
  group_by(cyl) %>%
  summarise(avg_mpg = mean(mpg))

This groups the mtcars dataset by the cyl column and calculates the average mpg for each group of cylinders.

6. Joining Data:

dplyr also supports various types of joins such as inner_join(), left_join(), right_join(), and full_join(). They work similarly to SQL joins.

df1 <- data.frame(id = 1:3, name = c("A", "B", "C"))
df2 <- data.frame(id = 2:4, score = c(85, 90, 78))

inner_join(df1, df2, by = "id")

Conclusion:

dplyr simplifies data manipulation tasks in R, making the code readable and efficient. While this tutorial covered the basics, dplyr offers a lot more functionalities that can be explored further in its documentation and vignettes.

  1. dplyr package in R:

    • Description: The dplyr package is a powerful tool for data manipulation in R, providing a set of functions that simplify and streamline common data manipulation tasks.
    • Code:
      # Install and load the dplyr package
      install.packages("dplyr")
      library(dplyr)
      
  2. Data manipulation with dplyr:

    • Description: Use dplyr functions to manipulate data, making tasks like filtering, summarizing, and arranging more intuitive and readable.
    • Code:
      # Sample data frame
      data <- data.frame(
        ID = c(1, 2, 3),
        Name = c("Alice", "Bob", "Charlie"),
        Age = c(25, 30, 22)
      )
      
      # Filter data using dplyr
      filtered_data <- data %>% filter(Age > 25)
      
  3. Filtering data with dplyr in R:

    • Description: Use the filter() function in dplyr to subset data based on specified conditions.
    • Code:
      # Filter data for individuals older than 25
      filtered_data <- data %>% filter(Age > 25)
      
  4. Grouping and summarizing with dplyr:

    • Description: Employ the group_by() and summarize() functions to group data by one or more variables and calculate summary statistics.
    • Code:
      # Group data by Age and calculate mean for each group
      summarized_data <- data %>% group_by(Age) %>% summarize(mean_age = mean(Age))
      
  5. Joining tables with dplyr in R:

    • Description: Use left_join(), right_join(), inner_join(), or other join functions in dplyr to combine tables based on common columns.
    • Code:
      # Sample data frames to join
      df1 <- data.frame(ID = c(1, 2), Value1 = c(10, 20))
      df2 <- data.frame(ID = c(2, 3), Value2 = c(30, 40))
      
      # Left join based on ID
      joined_data <- left_join(df1, df2, by = "ID")
      
  6. Mutating variables with dplyr:

    • Description: Use the mutate() function to create or modify variables (columns) in a data frame.
    • Code:
      # Add a new variable 'IsAdult' based on Age
      mutated_data <- data %>% mutate(IsAdult = ifelse(Age >= 18, "Yes", "No"))
      
  7. Arranging and selecting columns with dplyr in R:

    • Description: Use arrange() to sort rows based on one or more columns, and select() to choose specific columns.
    • Code:
      # Arrange data by Age in descending order
      arranged_data <- data %>% arrange(desc(Age))
      
      # Select only 'Name' and 'Age' columns
      selected_data <- data %>% select(Name, Age)