R Tutorial

Fundamentals of R

Variables

Input and Output

Decision Making

Control Flow

Functions

Strings

Vectors

Lists

Arrays

Matrices

Factors

DataFrames

Object Oriented Programming

Error Handling

File Handling

Packages in R

Data Interfaces

Data Visualization

Statistics

Machine Learning with R

Clustering in R

Clustering is an unsupervised learning technique used to group similar data points together. R provides several packages and functions to perform clustering. In this tutorial, we'll explore some popular clustering methods: k-means, hierarchical, and DBSCAN.

1. K-means Clustering

K-means tries to partition data into k pre-defined distinct non-overlapping groups (clusters).

Example using kmeans function:

# Simulate some data
set.seed(123)
data <- rbind(matrix(rnorm(100), ncol=2),
              matrix(rnorm(100, mean=3), ncol=2))

# Apply k-means clustering
clusters <- kmeans(data, centers=2)

# Plot
plot(data, col=clusters$cluster)
points(clusters$centers, col=1:2, pch=8, cex=2)

2. Hierarchical Clustering

Hierarchical clustering builds a tree of clusters. You can visualize this tree using a dendrogram.

Example using hclust function:

# Compute distance matrix
dist_matrix <- dist(data)

# Hierarchical clustering
h_cluster <- hclust(dist_matrix)

# Plot
plot(h_cluster)

To cut the tree into k clusters:

groups <- cutree(h_cluster, k=2)

3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN groups together points that are close to each other based on a distance measure and a minimum number of points.

Example using the dbscan package:

First, install and load the necessary package:

install.packages("dbscan")
library(dbscan)

Then apply DBSCAN:

set.seed(123)
db <- dbscan(data, eps=0.5, minPts=5)

# Plot
plot(data, col=db$cluster)

4. Determining the Number of Clusters

For k-means, the Elbow method is commonly used:

wss <- numeric(10)
for (k in 1:10) {
  model <- kmeans(data, centers=k)
  wss[k] <- model$tot.withinss
}

plot(1:10, wss, type="b", xlab="Number of Clusters", ylab="WSS")

The 'elbow' of the curve represents an optimal number of clusters (a balance between precision and computational cost).

5. Evaluation

For clustering, evaluation can be challenging, especially if true labels are not known. Silhouette analysis can provide insights into the distance between the resulting clusters. More distant clusters lead to better clusterings.

library(cluster)
silhouette_score <- silhouette(groups, dist_matrix)
plot(silhouette_score)

Summary:

Clustering is a powerful tool in unsupervised machine learning. R provides various methods and packages to perform clustering. Always ensure that you're preprocessing your data (e.g., scaling) appropriately and evaluating your clustering results using appropriate metrics or visual methods.

How to Perform Clustering Analysis in R:

# Load a sample dataset
data(iris)

# Select features for clustering
features <- iris[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")]

# Standardize the features (if necessary)
standardized_features <- scale(features)

Unsupervised Learning for Clustering in R:

# Using unsupervised learning for clustering
kmeans_model <- kmeans(standardized_features, centers = 3)
clusters_kmeans <- kmeans_model$cluster

Popular Clustering Packages in R:

# Popular clustering packages
library(cluster)
library(factoextra)
library(dbscan)

K-Means Clustering in R:

# Using k-means clustering
kmeans_model <- kmeans(standardized_features, centers = 3)
clusters_kmeans <- kmeans_model$cluster

Hierarchical Clustering in R:

# Using hierarchical clustering
hierarchical_model <- hclust(dist(standardized_features))
clusters_hierarchical <- cutree(hierarchical_model, k = 3)

DBSCAN Clustering in R:

# Using DBSCAN clustering
dbscan_model <- dbscan(standardized_features, eps = 0.5, minPts = 5)
clusters_dbscan <- dbscan_model$cluster

Agglomerative Clustering in R:

# Using agglomerative clustering
agglomerative_model <- agnes(standardized_features)
clusters_agglomerative <- cutree(agglomerative_model, k = 3)

Comparing Clustering Methods in R:

# Comparing clustering methods
fviz_nbclust(standardized_features, kmeans, method = "silhouette")

Visualizing Clustering Results in R:

# Visualizing clustering results
fviz_cluster(list(data = standardized_features, cluster = clusters_kmeans))