R Tutorial
Fundamentals of R
Variables
Input and Output
Decision Making
Control Flow
Functions
Strings
Vectors
Lists
Arrays
Matrices
Factors
DataFrames
Object Oriented Programming
Error Handling
File Handling
Packages in R
Data Interfaces
Data Visualization
Statistics
Machine Learning with R
Data wrangling, also known as data transformation, involves cleaning, structuring, and enriching raw data into a format that's suitable for analysis. R, with its tidyverse
suite of packages, especially dplyr
, makes data wrangling intuitive and efficient.
install.packages("tidyverse") library(tidyverse)
dplyr
:Let's consider the mpg
dataset from the ggplot2
package:
data(mpg) head(mpg)
select()
:select(mpg, manufacturer, model, hwy)
filter()
:filter(mpg, cyl == 4 & hwy > 30)
arrange()
:Sort data by the hwy
column:
arrange(mpg, hwy)
For descending order:
arrange(mpg, desc(hwy))
mutate()
:Create a new column that calculates miles per gallon as a ratio of cty
to hwy
:
mutate(mpg, mpg_ratio = cty / hwy)
summarise()
:Calculate the mean highway miles per gallon:
summarise(mpg, mean_hwy = mean(hwy, na.rm = TRUE))
group_by()
:Calculate the mean highway miles per gallon for each number of cylinders:
mpg %>% group_by(cyl) %>% summarise(mean_hwy = mean(hwy, na.rm = TRUE))
Join operations combine data from multiple datasets. Let's consider two example data frames:
df1 <- tibble(id = 1:3, name = c("Alice", "Bob", "Charlie")) df2 <- tibble(id = c(2, 3, 4), score = c(90, 85, 70))
inner_join(df1, df2, by = "id")
left_join(df1, df2, by = "id")
You can chain multiple operations using the pipe (%>%
) operator:
mpg %>% filter(cyl == 4) %>% mutate(mpg_ratio = cty / hwy) %>% select(manufacturer, model, mpg_ratio) %>% arrange(desc(mpg_ratio))
The stringr
package in the tidyverse
offers string manipulation functions:
library(stringr) # Convert to uppercase str_to_upper(c("a", "b", "c")) # Replace characters in a string str_replace_all("abcabc", "a", "z")
The lubridate
package makes working with dates easier:
library(lubridate) # Parse dates ymd("20230101") # Extract components day(ymd("20230515"))
This tutorial provides an overview of essential data wrangling and transformation techniques in R using the tidyverse
packages. Mastering these operations is vital for efficient and robust data analysis in R. Always consult the package documentation and vignettes for more in-depth knowledge and advanced techniques.
Data transformation techniques in R:
# Using dplyr for data transformation library(dplyr) transformed_data <- original_data %>% mutate(TransformedColumn = log(NumericColumn))
Tidying data in R:
# Using tidyr for tidying data library(tidyr) tidy_data <- original_data %>% gather(Key, Value, -ID)
Reshaping data in R:
gather
, spread
, or pivot_longer
, pivot_wider
.# Using tidyr for reshaping data library(tidyr) long_data <- pivot_longer(original_wide_data, cols = -ID, names_to = "Variable", values_to = "Value")
Data cleaning and preprocessing in R:
# Using dplyr for data cleaning library(dplyr) cleaned_data <- original_data %>% filter(!is.na(NumericColumn)) %>% arrange(ID)
Manipulating data frames in R:
dplyr
functions for efficient manipulation of data frames.# Using dplyr for data frame manipulation library(dplyr) manipulated_data <- original_data %>% select(Column1, Column2) %>% filter(Column1 > 5) %>% mutate(NewColumn = Column2 * 2)
Data wrangling with dplyr and tidyr:
dplyr
and tidyr
are powerful packages for data wrangling tasks.# Using dplyr and tidyr for data wrangling library(dplyr) library(tidyr) wrangled_data <- original_data %>% gather(Key, Value, -ID) %>% filter(!is.na(Value)) %>% spread(Key, Value)