R Tutorial

Fundamentals of R

Variables

Input and Output

Decision Making

Control Flow

Functions

Strings

Vectors

Lists

Arrays

Matrices

Factors

DataFrames

Object Oriented Programming

Error Handling

File Handling

Packages in R

Data Interfaces

Data Visualization

Statistics

Machine Learning with R

DataFrame Manipulation in R

Data frame manipulation is one of the most common tasks in R, given that data frames are the de facto data structure for most tabular data operations. This tutorial will introduce you to the basics of data frame manipulation using both base R and the dplyr package.

1. Base R Methods

1.1. Creating a Data Frame

df <- data.frame(
  Name = c("Alice", "Bob", "Charlie"),
  Age = c(25, 30, 35),
  Score = c(85, 90, 80)
)

1.2. Adding a New Column

df$Gender <- c("Female", "Male", "Male")

1.3. Subsetting Data Frame

Using row and column indices:

subset <- df[1:2, c("Name", "Score")]

Using conditions:

subset <- df[df$Age > 28, ]

1.4. Ordering Data Frame

Sort by Age:

df_sorted <- df[order(df$Age), ]

2. dplyr Methods

First, you need to install and load the dplyr package:

install.packages("dplyr")
library(dplyr)

2.1. Selecting Columns

df_selected <- select(df, Name, Age)

2.2. Filtering Rows

df_filtered <- filter(df, Age > 28)

2.3. Adding New Columns

Using mutate:

df <- mutate(df, Age_Next_Year = Age + 1)

2.4. Summarizing Data

Using summarise:

summary <- summarise(df, Average_Age = mean(Age, na.rm = TRUE))

2.5. Grouping and Summarizing

Using group_by:

grouped_summary <- df %>%
  group_by(Gender) %>%
  summarise(Average_Score = mean(Score, na.rm = TRUE))

2.6. Arranging (Sorting) Data

df_sorted <- arrange(df, Age)

For descending order:

df_sorted_desc <- arrange(df, desc(Age))

3. Combining Data Frames

3.1. Binding Rows

df1 <- data.frame(A = 1:3, B = 4:6)
df2 <- data.frame(A = 7:9, B = 10:12)

combined <- rbind(df1, df2)

3.2. Binding Columns

df3 <- data.frame(C = 13:15)

combined_cols <- cbind(df1, df3)

Conclusion

Data frame manipulation is foundational in R, and mastering these operations is essential for data analysis. While base R provides robust data manipulation capabilities, the dplyr package offers a more intuitive syntax and a range of powerful functions, making it a popular choice among R users.

  1. Data wrangling with dplyr in R:

    • dplyr is a powerful package for data manipulation. It provides verbs like filter(), mutate(), select(), arrange(), group_by(), and summarize() for efficient data wrangling.
    # Install and load the dplyr package
    install.packages("dplyr")
    library(dplyr)
    
    # Example data frame
    my_data <- data.frame(ID = 1:5, Name = c("John", "Alice", "Bob", "Eva", "Mike"), Age = c(25, 30, 22, 28, 35))
    
    # Filter data
    filtered_data <- my_data %>% filter(Age > 25)
    
    # Mutate data
    mutated_data <- my_data %>% mutate(AgeGroup = ifelse(Age > 30, "Old", "Young"))
    
  2. R code for subsetting data frames:

    # Subsetting data frame based on conditions
    subset_data <- my_data[my_data$Age > 25, ]
    
  3. Joining and merging data frames in R:

    # Creating two data frames
    df1 <- data.frame(ID = 1:3, Value = c("A", "B", "C"))
    df2 <- data.frame(ID = 2:4, Score = c(10, 15, 20))
    
    # Inner join
    inner_join(df1, df2, by = "ID")
    
    # Full outer join
    merge(df1, df2, by = "ID", all = TRUE)
    
  4. Reshaping and transforming data frames in R:

    # Reshaping data frame
    reshaped_data <- spread(my_data, key = Age, value = Name)
    
    # Transposing data frame
    transposed_data <- t(my_data)
    
  5. Filtering and selecting columns in R data frames:

    # Filtering rows based on a condition
    filtered_rows <- my_data %>% filter(Age > 25)
    
    # Selecting specific columns
    selected_cols <- my_data %>% select(Name, Age)
    
  6. Sorting and ordering data frames in R:

    # Sorting data frame by Age in ascending order
    sorted_data <- my_data %>% arrange(Age)
    
    # Sorting data frame by Age in descending order
    sorted_desc_data <- my_data %>% arrange(desc(Age))
    
  7. Handling missing values in R data frames:

    # Removing rows with missing values
    cleaned_data <- na.omit(my_data)
    
    # Imputing missing values with mean
    my_data$Age[is.na(my_data$Age)] <- mean(my_data$Age, na.rm = TRUE)
    
  8. Aggregating and summarizing data frames in R:

    # Aggregating data by Age and calculating mean
    summarised_data <- my_data %>% group_by(Age) %>% summarise(mean_name = mean(Name))
    
  9. Working with time series data frames in R:

    • Use time-related functions and packages like zoo or xts for time series data.
    # Creating a time series data frame
    time_series_data <- data.frame(Date = seq(as.Date("2022-01-01"), as.Date("2022-01-05"), by = "days"),
                                    Value = c(10, 15, 20, 25, 30))
    
    # Converting to time series object
    library(xts)
    time_series_xts <- xts(time_series_data$Value, order.by = as.Date(time_series_data$Date))