R Tutorial

Fundamentals of R

Variables

Input and Output

Decision Making

Control Flow

Functions

Strings

Vectors

Lists

Arrays

Matrices

Factors

DataFrames

Object Oriented Programming

Error Handling

File Handling

Packages in R

Data Interfaces

Data Visualization

Statistics

Machine Learning with R

Data Handling in R

Handling data is a fundamental task in R. This tutorial will introduce you to essential techniques for data handling in R, from importing/exporting data to data manipulation.

1. Importing Data:

R can import data from various sources. Here's how you can import some of the most common data formats:

  • CSV Files:

    my_data <- read.csv("path_to_file.csv", header=TRUE)
    
  • Excel Files:

    First, you'll need to install the readxl package:

    install.packages("readxl")
    library(readxl)
    

    Then, read in the Excel file:

    my_data <- read_excel("path_to_file.xlsx")
    
  • Databases:

    Connect to databases (e.g., MySQL, SQLite) using the DBI package. The connection and reading process varies based on the database.

2. Exploring Data:

Once you've imported the data, it's essential to understand its structure:

  • View the first/last few rows:

    head(my_data)
    tail(my_data)
    
  • Structure and Summary:

    str(my_data)
    summary(my_data)
    

3. Data Manipulation:

Selecting Columns:

subset_data <- my_data$column_name  # Single column
subset_data <- my_data[, c("column1", "column2")]  # Multiple columns

Filtering Rows:

filtered_data <- my_data[my_data$column_name > value, ]

Adding New Columns:

my_data$new_column <- calculated_values

Sorting Data:

sorted_data <- my_data[order(my_data$column_name), ]

Merging Data:

merged_data <- merge(data1, data2, by="common_column")

4. Handling Missing Data:

Identify missing values:

is.na(my_data)

Remove rows with NA:

clean_data <- na.omit(my_data)

Replace NA with a value:

my_data[is.na(my_data)] <- replacement_value

5. Applying Functions:

The apply() family of functions allows operations over data structures:

  • apply(): For matrices or arrays
  • lapply(): For lists or vectors
  • sapply(): Simplified lapply (returns a vector or matrix)
  • tapply(): For applying a function over a subset of a vector

Example:

average <- sapply(my_data, mean, na.rm=TRUE)

6. Data Aggregation:

Using the aggregate() function:

result <- aggregate(column_to_aggregate ~ grouping_column, data=my_data, FUN=function_name)

7. Exporting Data:

  • CSV Files:

    write.csv(my_data, "path_to_save.csv")
    
  • Excel Files:

    Use the writexl package:

    install.packages("writexl")
    library(writexl)
    write_xlsx(my_data, "path_to_save.xlsx")
    

8. Advanced Packages for Data Handling:

The tidyverse collection, especially dplyr and tidyr, offers advanced data manipulation capabilities. For example, with dplyr:

library(dplyr)
my_data %>%
  filter(column_name > value) %>%
  select(column1, column2) %>%
  arrange(column1)

Conclusion:

This tutorial has introduced essential techniques for data handling in R. To deepen your knowledge, consider diving into specific packages like dplyr, tidyr, and exploring more specialized packages for unique data types and sources.

  1. Data import and export in R:

    # Import data from CSV using readr
    library(readr)
    dataset <- read_csv("path/to/your/file.csv")
    
    # Export data to CSV
    write_csv(dataset, "path/to/your/exported/file.csv")
    
  2. Handling missing data in R:

    # Remove rows with missing values using base R
    cleaned_data <- original_data[complete.cases(original_data), ]
    
    # Remove missing values using dplyr
    library(dplyr)
    cleaned_data_dplyr <- original_data %>%
      drop_na()
    
  3. Data transformation in R:

    # Transform data using dplyr
    library(dplyr)
    transformed_data <- original_data %>%
      mutate(TransformedColumn = log(OriginalColumn))
    
  4. Dealing with outliers in R:

    # Identify and remove outliers using base R
    outlier_threshold <- 2
    no_outliers_data <- original_data[abs(scale(original_data$NumericColumn)) < outlier_threshold, ]
    
    # Identify and remove outliers using dplyr
    library(dplyr)
    no_outliers_data_dplyr <- original_data %>%
      filter(abs(scale(NumericColumn)) < outlier_threshold)