R Tutorial
Fundamentals of R
Variables
Input and Output
Decision Making
Control Flow
Functions
Strings
Vectors
Lists
Arrays
Matrices
Factors
DataFrames
Object Oriented Programming
Error Handling
File Handling
Packages in R
Data Interfaces
Data Visualization
Statistics
Machine Learning with R
dplyr
is one of the most popular packages in R for data manipulation. Developed by Hadley Wickham, it provides a coherent system to operate on datasets using a set of "verbs" that perform common data manipulation tasks. This tutorial will introduce you to some of the core functionalities of dplyr
.
dplyr
package:If you haven't already installed it, do so with:
install.packages("dplyr")
Load the package:
library(dplyr)
dplyr
:The main verbs in dplyr
are:
select()
: Choose variables (columns) from a dataset.filter()
: Filter rows based on some criteria.arrange()
: Reorder rows.mutate()
: Create or transform columns.summarise()
: Summarize data.dplyr
Verbs:select()
:Choose specific columns from a dataset.
data(mtcars) select(mtcars, mpg, hp)
filter()
:Select rows based on a condition.
filter(mtcars, mpg > 20, hp < 100)
arrange()
:Sort the data based on a column. Use desc()
for descending order.
arrange(mtcars, mpg) # Ascending order arrange(mtcars, desc(mpg)) # Descending order
mutate()
:Create a new column or modify an existing one.
mutate(mtcars, efficiency = mpg/hp)
summarise()
:Create a summary of your data.
summarise(mtcars, avg_mpg = mean(mpg), max_hp = max(hp))
%>%
operator):dplyr
offers a chaining mechanism using %>%
(pipe operator) to combine multiple operations.
mtcars %>% filter(mpg > 20) %>% select(mpg, hp) %>% arrange(desc(hp))
This code filters the rows where mpg
is more than 20, selects the mpg
and hp
columns, and arranges them in descending order based on hp
.
group_by()
):Grouping is a powerful tool in dplyr
, allowing you to split data and operate on each group.
mtcars %>% group_by(cyl) %>% summarise(avg_mpg = mean(mpg))
This groups the mtcars
dataset by the cyl
column and calculates the average mpg
for each group of cylinders.
dplyr
also supports various types of joins such as inner_join()
, left_join()
, right_join()
, and full_join()
. They work similarly to SQL joins.
df1 <- data.frame(id = 1:3, name = c("A", "B", "C")) df2 <- data.frame(id = 2:4, score = c(85, 90, 78)) inner_join(df1, df2, by = "id")
dplyr
simplifies data manipulation tasks in R, making the code readable and efficient. While this tutorial covered the basics, dplyr
offers a lot more functionalities that can be explored further in its documentation and vignettes.
dplyr package in R:
# Install and load the dplyr package install.packages("dplyr") library(dplyr)
Data manipulation with dplyr:
# Sample data frame data <- data.frame( ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"), Age = c(25, 30, 22) ) # Filter data using dplyr filtered_data <- data %>% filter(Age > 25)
Filtering data with dplyr in R:
filter()
function in dplyr to subset data based on specified conditions.# Filter data for individuals older than 25 filtered_data <- data %>% filter(Age > 25)
Grouping and summarizing with dplyr:
group_by()
and summarize()
functions to group data by one or more variables and calculate summary statistics.# Group data by Age and calculate mean for each group summarized_data <- data %>% group_by(Age) %>% summarize(mean_age = mean(Age))
Joining tables with dplyr in R:
left_join()
, right_join()
, inner_join()
, or other join functions in dplyr to combine tables based on common columns.# Sample data frames to join df1 <- data.frame(ID = c(1, 2), Value1 = c(10, 20)) df2 <- data.frame(ID = c(2, 3), Value2 = c(30, 40)) # Left join based on ID joined_data <- left_join(df1, df2, by = "ID")
Mutating variables with dplyr:
mutate()
function to create or modify variables (columns) in a data frame.# Add a new variable 'IsAdult' based on Age mutated_data <- data %>% mutate(IsAdult = ifelse(Age >= 18, "Yes", "No"))
Arranging and selecting columns with dplyr in R:
arrange()
to sort rows based on one or more columns, and select()
to choose specific columns.# Arrange data by Age in descending order arranged_data <- data %>% arrange(desc(Age)) # Select only 'Name' and 'Age' columns selected_data <- data %>% select(Name, Age)