R Tutorial

Fundamentals of R

Variables

Input and Output

Decision Making

Control Flow

Functions

Strings

Vectors

Lists

Arrays

Matrices

Factors

DataFrames

Object Oriented Programming

Error Handling

File Handling

Packages in R

Data Interfaces

Data Visualization

Statistics

Machine Learning with R

Data Wrangling - Data Transformation in R

Data wrangling, also known as data transformation, involves cleaning, structuring, and enriching raw data into a format that's suitable for analysis. R, with its tidyverse suite of packages, especially dplyr, makes data wrangling intuitive and efficient.

1. Installing and Loading Required Packages:

install.packages("tidyverse")
library(tidyverse)

2. Basic Data Manipulation with dplyr:

Let's consider the mpg dataset from the ggplot2 package:

data(mpg)
head(mpg)

a. Selecting Columns with select():

select(mpg, manufacturer, model, hwy)

b. Filtering Rows with filter():

filter(mpg, cyl == 4 & hwy > 30)

c. Arranging Rows with arrange():

Sort data by the hwy column:

arrange(mpg, hwy)

For descending order:

arrange(mpg, desc(hwy))

d. Creating or Modifying Columns with mutate():

Create a new column that calculates miles per gallon as a ratio of cty to hwy:

mutate(mpg, mpg_ratio = cty / hwy)

e. Summarizing Data with summarise():

Calculate the mean highway miles per gallon:

summarise(mpg, mean_hwy = mean(hwy, na.rm = TRUE))

f. Grouped Operations with group_by():

Calculate the mean highway miles per gallon for each number of cylinders:

mpg %>%
  group_by(cyl) %>%
  summarise(mean_hwy = mean(hwy, na.rm = TRUE))

3. Joining Data:

Join operations combine data from multiple datasets. Let's consider two example data frames:

df1 <- tibble(id = 1:3, name = c("Alice", "Bob", "Charlie"))
df2 <- tibble(id = c(2, 3, 4), score = c(90, 85, 70))

a. Inner Join:

inner_join(df1, df2, by = "id")

b. Left Join:

left_join(df1, df2, by = "id")

4. Chaining Operations:

You can chain multiple operations using the pipe (%>%) operator:

mpg %>%
  filter(cyl == 4) %>%
  mutate(mpg_ratio = cty / hwy) %>%
  select(manufacturer, model, mpg_ratio) %>%
  arrange(desc(mpg_ratio))

5. Working with String Data:

The stringr package in the tidyverse offers string manipulation functions:

library(stringr)

# Convert to uppercase
str_to_upper(c("a", "b", "c"))

# Replace characters in a string
str_replace_all("abcabc", "a", "z")

6. Working with Date Data:

The lubridate package makes working with dates easier:

library(lubridate)

# Parse dates
ymd("20230101")

# Extract components
day(ymd("20230515"))

Conclusion:

This tutorial provides an overview of essential data wrangling and transformation techniques in R using the tidyverse packages. Mastering these operations is vital for efficient and robust data analysis in R. Always consult the package documentation and vignettes for more in-depth knowledge and advanced techniques.

  1. Data transformation techniques in R:

    • Data transformation involves modifying, aggregating, or reformatting data.
    # Using dplyr for data transformation
    library(dplyr)
    transformed_data <- original_data %>%
      mutate(TransformedColumn = log(NumericColumn))
    
  2. Tidying data in R:

    • Tidying data involves organizing it into a consistent and structured format.
    # Using tidyr for tidying data
    library(tidyr)
    tidy_data <- original_data %>%
      gather(Key, Value, -ID)
    
  3. Reshaping data in R:

    • Reshaping data is the process of changing its structure using functions like gather, spread, or pivot_longer, pivot_wider.
    # Using tidyr for reshaping data
    library(tidyr)
    long_data <- pivot_longer(original_wide_data, cols = -ID, names_to = "Variable", values_to = "Value")
    
  4. Data cleaning and preprocessing in R:

    • Data cleaning involves handling missing values, outliers, and ensuring data quality.
    # Using dplyr for data cleaning
    library(dplyr)
    cleaned_data <- original_data %>%
      filter(!is.na(NumericColumn)) %>%
      arrange(ID)
    
  5. Manipulating data frames in R:

    • Use dplyr functions for efficient manipulation of data frames.
    # Using dplyr for data frame manipulation
    library(dplyr)
    manipulated_data <- original_data %>%
      select(Column1, Column2) %>%
      filter(Column1 > 5) %>%
      mutate(NewColumn = Column2 * 2)
    
  6. Data wrangling with dplyr and tidyr:

    • dplyr and tidyr are powerful packages for data wrangling tasks.
    # Using dplyr and tidyr for data wrangling
    library(dplyr)
    library(tidyr)
    wrangled_data <- original_data %>%
      gather(Key, Value, -ID) %>%
      filter(!is.na(Value)) %>%
      spread(Key, Value)