R Tutorial
Fundamentals of R
Variables
Input and Output
Decision Making
Control Flow
Functions
Strings
Vectors
Lists
Arrays
Matrices
Factors
DataFrames
Object Oriented Programming
Error Handling
File Handling
Packages in R
Data Interfaces
Data Visualization
Statistics
Machine Learning with R
Clustering is an unsupervised learning technique used to group similar data points together. R provides several packages and functions to perform clustering. In this tutorial, we'll explore some popular clustering methods: k-means, hierarchical, and DBSCAN.
K-means tries to partition data into k
pre-defined distinct non-overlapping groups (clusters).
Example using kmeans
function:
# Simulate some data set.seed(123) data <- rbind(matrix(rnorm(100), ncol=2), matrix(rnorm(100, mean=3), ncol=2)) # Apply k-means clustering clusters <- kmeans(data, centers=2) # Plot plot(data, col=clusters$cluster) points(clusters$centers, col=1:2, pch=8, cex=2)
Hierarchical clustering builds a tree of clusters. You can visualize this tree using a dendrogram.
Example using hclust
function:
# Compute distance matrix dist_matrix <- dist(data) # Hierarchical clustering h_cluster <- hclust(dist_matrix) # Plot plot(h_cluster)
To cut the tree into k
clusters:
groups <- cutree(h_cluster, k=2)
DBSCAN groups together points that are close to each other based on a distance measure and a minimum number of points.
Example using the dbscan
package:
First, install and load the necessary package:
install.packages("dbscan") library(dbscan)
Then apply DBSCAN:
set.seed(123) db <- dbscan(data, eps=0.5, minPts=5) # Plot plot(data, col=db$cluster)
For k-means, the Elbow method is commonly used:
wss <- numeric(10) for (k in 1:10) { model <- kmeans(data, centers=k) wss[k] <- model$tot.withinss } plot(1:10, wss, type="b", xlab="Number of Clusters", ylab="WSS")
The 'elbow' of the curve represents an optimal number of clusters (a balance between precision and computational cost).
For clustering, evaluation can be challenging, especially if true labels are not known. Silhouette analysis can provide insights into the distance between the resulting clusters. More distant clusters lead to better clusterings.
library(cluster) silhouette_score <- silhouette(groups, dist_matrix) plot(silhouette_score)
Clustering is a powerful tool in unsupervised machine learning. R provides various methods and packages to perform clustering. Always ensure that you're preprocessing your data (e.g., scaling) appropriately and evaluating your clustering results using appropriate metrics or visual methods.
How to Perform Clustering Analysis in R:
# Load a sample dataset data(iris) # Select features for clustering features <- iris[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")] # Standardize the features (if necessary) standardized_features <- scale(features)
Unsupervised Learning for Clustering in R:
# Using unsupervised learning for clustering kmeans_model <- kmeans(standardized_features, centers = 3) clusters_kmeans <- kmeans_model$cluster
Popular Clustering Packages in R:
# Popular clustering packages library(cluster) library(factoextra) library(dbscan)
K-Means Clustering in R:
# Using k-means clustering kmeans_model <- kmeans(standardized_features, centers = 3) clusters_kmeans <- kmeans_model$cluster
Hierarchical Clustering in R:
# Using hierarchical clustering hierarchical_model <- hclust(dist(standardized_features)) clusters_hierarchical <- cutree(hierarchical_model, k = 3)
DBSCAN Clustering in R:
# Using DBSCAN clustering dbscan_model <- dbscan(standardized_features, eps = 0.5, minPts = 5) clusters_dbscan <- dbscan_model$cluster
Agglomerative Clustering in R:
# Using agglomerative clustering agglomerative_model <- agnes(standardized_features) clusters_agglomerative <- cutree(agglomerative_model, k = 3)
Comparing Clustering Methods in R:
# Comparing clustering methods fviz_nbclust(standardized_features, kmeans, method = "silhouette")
Visualizing Clustering Results in R:
# Visualizing clustering results fviz_cluster(list(data = standardized_features, cluster = clusters_kmeans))