R Tutorial
Fundamentals of R
Variables
Input and Output
Decision Making
Control Flow
Functions
Strings
Vectors
Lists
Arrays
Matrices
Factors
DataFrames
Object Oriented Programming
Error Handling
File Handling
Packages in R
Data Interfaces
Data Visualization
Statistics
Machine Learning with R
Data munging, also known as data wrangling, is the process of transforming and mapping data from its raw form into another format to make it more appropriate for various downstream purposes, such as analytics.
In R, several packages can help you with data munging, but the most popular and comprehensive is the tidyverse
collection, especially dplyr
and tidyr
.
This tutorial will cover some fundamental data munging tasks using these packages.
install.packages("tidyverse") library(tidyverse)
a. Selecting Columns:
df <- tibble(x1 = 1:5, x2 = 6:10, x3 = 11:15) selected_df <- df %>% select(x1, x3)
b. Filtering Rows:
filtered_df <- df %>% filter(x1 > 3)
c. Adding New Columns:
new_df <- df %>% mutate(new_column = x1 + x2)
d. Sorting Data:
sorted_df <- df %>% arrange(desc(x1))
To demonstrate, let's create a data frame with NA values:
df_na <- tibble(a = c(1, 2, NA, 4, 5), b = c(NA, 2, 3, 4, 5))
a. Removing Missing Values:
df_no_na <- df_na %>% drop_na()
b. Replacing Missing Values:
Replace NA in column a
with 0:
df_replace_na <- df_na %>% replace_na(list(a = 0))
df_group <- tibble(category = c("A", "B", "A", "B", "B"), value = c(10, 20, 30, 40, 50)) summary_df <- df_group %>% group_by(category) %>% summarise(mean_value = mean(value), total_value = sum(value))
a. Spreading:
Converts long-format data into wide format.
df_long <- tibble(id = c(1, 1, 2, 2), key = c("A", "B", "A", "B"), value = c(10, 20, 30, 40)) df_wide <- df_long %>% spread(key = key, value = value)
b. Gathering:
Converts wide-format data into long format.
df_wide %>% gather(key = "key", value = "value", -id)
a. Inner Join:
df1 <- tibble(id = 1:3, name = c("Alice", "Bob", "Charlie")) df2 <- tibble(id = 2:4, score = c(90, 80, 70)) joined_df <- df1 %>% inner_join(df2, by = "id")
Other joins available include left_join()
, right_join()
, full_join()
, and anti_join()
.
a. Separating:
Split a single column into multiple columns.
df_sep <- tibble(name_age = c("Alice_25", "Bob_30", "Charlie_35")) df_sep %>% separate(name_age, into = c("name", "age"), sep = "_")
b. Uniting:
Combine multiple columns into a single column.
df_unite <- tibble(name = c("Alice", "Bob", "Charlie"), age = c(25, 30, 35)) df_unite %>% unite(name_age, name, age, sep = "_")
This tutorial offers a brief overview of the many data munging capabilities that dplyr
and tidyr
provide. For more in-depth exploration, refer to their respective vignettes and documentation. Remember, efficient data munging often requires a good understanding of the data at hand, so always make it a point to explore and understand your datasets.
Cleaning and transforming data in R:
# Using dplyr for cleaning and transforming library(dplyr) cleaned_transformed_data <- original_data %>% filter(!is.na(Column1)) %>% mutate(NewColumn = log(Column2))
R tidyr and dplyr for data munging:
# Using tidyr and dplyr for data munging library(dplyr) library(tidyr) munged_data <- original_data %>% gather(Key, Value, -ID) %>% spread(NewKey, NewValue)
Handling messy data in R:
# Handling messy data with stringr library(stringr) cleaned_messy_data <- original_messy_data %>% mutate(FormattedColumn = str_replace_all(UnformattedColumn, "[^0-9]", ""))
Dealing with inconsistent data in R:
# Dealing with inconsistent data library(dplyr) consistent_data <- original_inconsistent_data %>% mutate(ColumnName = ifelse(is.na(ColumnName), "Unknown", ColumnName), NumericColumn = as.numeric(as.character(NumericColumn)))