R Tutorial
Fundamentals of R
Variables
Input and Output
Decision Making
Control Flow
Functions
Strings
Vectors
Lists
Arrays
Matrices
Factors
DataFrames
Object Oriented Programming
Error Handling
File Handling
Packages in R
Data Interfaces
Data Visualization
Statistics
Machine Learning with R
Handling data is a fundamental task in R. This tutorial will introduce you to essential techniques for data handling in R, from importing/exporting data to data manipulation.
R can import data from various sources. Here's how you can import some of the most common data formats:
CSV Files:
my_data <- read.csv("path_to_file.csv", header=TRUE)
Excel Files:
First, you'll need to install the readxl
package:
install.packages("readxl") library(readxl)
Then, read in the Excel file:
my_data <- read_excel("path_to_file.xlsx")
Databases:
Connect to databases (e.g., MySQL, SQLite) using the DBI
package. The connection and reading process varies based on the database.
Once you've imported the data, it's essential to understand its structure:
View the first/last few rows:
head(my_data) tail(my_data)
Structure and Summary:
str(my_data) summary(my_data)
Selecting Columns:
subset_data <- my_data$column_name # Single column subset_data <- my_data[, c("column1", "column2")] # Multiple columns
Filtering Rows:
filtered_data <- my_data[my_data$column_name > value, ]
Adding New Columns:
my_data$new_column <- calculated_values
Sorting Data:
sorted_data <- my_data[order(my_data$column_name), ]
Merging Data:
merged_data <- merge(data1, data2, by="common_column")
Identify missing values:
is.na(my_data)
Remove rows with NA:
clean_data <- na.omit(my_data)
Replace NA with a value:
my_data[is.na(my_data)] <- replacement_value
The apply()
family of functions allows operations over data structures:
Example:
average <- sapply(my_data, mean, na.rm=TRUE)
Using the aggregate()
function:
result <- aggregate(column_to_aggregate ~ grouping_column, data=my_data, FUN=function_name)
CSV Files:
write.csv(my_data, "path_to_save.csv")
Excel Files:
Use the writexl
package:
install.packages("writexl") library(writexl) write_xlsx(my_data, "path_to_save.xlsx")
The tidyverse
collection, especially dplyr
and tidyr
, offers advanced data manipulation capabilities. For example, with dplyr
:
library(dplyr) my_data %>% filter(column_name > value) %>% select(column1, column2) %>% arrange(column1)
This tutorial has introduced essential techniques for data handling in R. To deepen your knowledge, consider diving into specific packages like dplyr
, tidyr
, and exploring more specialized packages for unique data types and sources.
Data import and export in R:
# Import data from CSV using readr library(readr) dataset <- read_csv("path/to/your/file.csv") # Export data to CSV write_csv(dataset, "path/to/your/exported/file.csv")
Handling missing data in R:
# Remove rows with missing values using base R cleaned_data <- original_data[complete.cases(original_data), ] # Remove missing values using dplyr library(dplyr) cleaned_data_dplyr <- original_data %>% drop_na()
Data transformation in R:
# Transform data using dplyr library(dplyr) transformed_data <- original_data %>% mutate(TransformedColumn = log(OriginalColumn))
Dealing with outliers in R:
# Identify and remove outliers using base R outlier_threshold <- 2 no_outliers_data <- original_data[abs(scale(original_data$NumericColumn)) < outlier_threshold, ] # Identify and remove outliers using dplyr library(dplyr) no_outliers_data_dplyr <- original_data %>% filter(abs(scale(NumericColumn)) < outlier_threshold)