R Tutorial
Fundamentals of R
Variables
Input and Output
Decision Making
Control Flow
Functions
Strings
Vectors
Lists
Arrays
Matrices
Factors
DataFrames
Object Oriented Programming
Error Handling
File Handling
Packages in R
Data Interfaces
Data Visualization
Statistics
Machine Learning with R
Data frame manipulation is one of the most common tasks in R, given that data frames are the de facto data structure for most tabular data operations. This tutorial will introduce you to the basics of data frame manipulation using both base R and the dplyr
package.
df <- data.frame( Name = c("Alice", "Bob", "Charlie"), Age = c(25, 30, 35), Score = c(85, 90, 80) )
df$Gender <- c("Female", "Male", "Male")
Using row and column indices:
subset <- df[1:2, c("Name", "Score")]
Using conditions:
subset <- df[df$Age > 28, ]
Sort by Age
:
df_sorted <- df[order(df$Age), ]
First, you need to install and load the dplyr
package:
install.packages("dplyr") library(dplyr)
df_selected <- select(df, Name, Age)
df_filtered <- filter(df, Age > 28)
Using mutate
:
df <- mutate(df, Age_Next_Year = Age + 1)
Using summarise
:
summary <- summarise(df, Average_Age = mean(Age, na.rm = TRUE))
Using group_by
:
grouped_summary <- df %>% group_by(Gender) %>% summarise(Average_Score = mean(Score, na.rm = TRUE))
df_sorted <- arrange(df, Age)
For descending order:
df_sorted_desc <- arrange(df, desc(Age))
df1 <- data.frame(A = 1:3, B = 4:6) df2 <- data.frame(A = 7:9, B = 10:12) combined <- rbind(df1, df2)
df3 <- data.frame(C = 13:15) combined_cols <- cbind(df1, df3)
Data frame manipulation is foundational in R, and mastering these operations is essential for data analysis. While base R provides robust data manipulation capabilities, the dplyr
package offers a more intuitive syntax and a range of powerful functions, making it a popular choice among R users.
Data wrangling with dplyr in R:
dplyr
is a powerful package for data manipulation. It provides verbs like filter()
, mutate()
, select()
, arrange()
, group_by()
, and summarize()
for efficient data wrangling.# Install and load the dplyr package install.packages("dplyr") library(dplyr) # Example data frame my_data <- data.frame(ID = 1:5, Name = c("John", "Alice", "Bob", "Eva", "Mike"), Age = c(25, 30, 22, 28, 35)) # Filter data filtered_data <- my_data %>% filter(Age > 25) # Mutate data mutated_data <- my_data %>% mutate(AgeGroup = ifelse(Age > 30, "Old", "Young"))
R code for subsetting data frames:
# Subsetting data frame based on conditions subset_data <- my_data[my_data$Age > 25, ]
Joining and merging data frames in R:
# Creating two data frames df1 <- data.frame(ID = 1:3, Value = c("A", "B", "C")) df2 <- data.frame(ID = 2:4, Score = c(10, 15, 20)) # Inner join inner_join(df1, df2, by = "ID") # Full outer join merge(df1, df2, by = "ID", all = TRUE)
Reshaping and transforming data frames in R:
# Reshaping data frame reshaped_data <- spread(my_data, key = Age, value = Name) # Transposing data frame transposed_data <- t(my_data)
Filtering and selecting columns in R data frames:
# Filtering rows based on a condition filtered_rows <- my_data %>% filter(Age > 25) # Selecting specific columns selected_cols <- my_data %>% select(Name, Age)
Sorting and ordering data frames in R:
# Sorting data frame by Age in ascending order sorted_data <- my_data %>% arrange(Age) # Sorting data frame by Age in descending order sorted_desc_data <- my_data %>% arrange(desc(Age))
Handling missing values in R data frames:
# Removing rows with missing values cleaned_data <- na.omit(my_data) # Imputing missing values with mean my_data$Age[is.na(my_data$Age)] <- mean(my_data$Age, na.rm = TRUE)
Aggregating and summarizing data frames in R:
# Aggregating data by Age and calculating mean summarised_data <- my_data %>% group_by(Age) %>% summarise(mean_name = mean(Name))
Working with time series data frames in R:
zoo
or xts
for time series data.# Creating a time series data frame time_series_data <- data.frame(Date = seq(as.Date("2022-01-01"), as.Date("2022-01-05"), by = "days"), Value = c(10, 15, 20, 25, 30)) # Converting to time series object library(xts) time_series_xts <- xts(time_series_data$Value, order.by = as.Date(time_series_data$Date))