R Tutorial
Fundamentals of R
Variables
Input and Output
Decision Making
Control Flow
Functions
Strings
Vectors
Lists
Arrays
Matrices
Factors
DataFrames
Object Oriented Programming
Error Handling
File Handling
Packages in R
Data Interfaces
Data Visualization
Statistics
Machine Learning with R
DBScan (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm. Unlike k-means, which partitions the dataset into spherical clusters, DBScan can discover clusters of arbitrary shapes, making it suitable for datasets with varying densities and shapes. Moreover, DBScan can also identify noise points that don't belong to any cluster.
Here's a tutorial on how to perform DBScan clustering in R:
You need the fpc
package for DBScan:
install.packages("fpc") library(fpc)
For this tutorial, let's generate a dataset with two distinct clusters:
set.seed(123) cluster1 <- matrix(rnorm(100 * 2), ncol = 2) cluster2 <- matrix(rnorm(100 * 2, mean = 3), ncol = 2) data <- rbind(cluster1, cluster2) plot(data, col = "blue", pch = 20, main = "Generated Data")
Use the dbscan()
function to perform clustering:
dbscan_result <- dbscan(data, eps = 0.5, MinPts = 5)
Here:
eps
is the radius of the neighborhood around a data point.MinPts
is the minimum number of points required to form a dense region.Visualize the clusters and noise:
plot(data, col=dbscan_result$cluster, pch=20, main="DBScan Clustering Results") legend("topright", legend=unique(dbscan_result$cluster), col=1:max(dbscan_result$cluster), pch=20)
Points labeled with 0 are considered noise.
One way to evaluate the result is by checking the number of clusters and noise:
table(dbscan_result$cluster)
DBScan's results depend heavily on the chosen eps
and MinPts
values. Different values might result in more or fewer clusters. A common technique is to use the k-distance plot to determine a suitable value for eps
:
library(dbscan) kNNdistplot(data, k = 5) abline(h = 0.5, col = "red")
In the plot, a clear "bend" or "knee" can be used as a heuristic to set the eps
value.
DBScan is a powerful clustering algorithm suitable for datasets with clusters of arbitrary shapes and sizes. However, its results depend on the eps
and MinPts
parameters, which might require fine-tuning. It's also notable for its ability to detect noise in the dataset. The use of DBScan can be more advantageous than other clustering methods like k-means when the shape of clusters isn't spherical or when noise is present.
Density-based clustering with DBScan in R:
# Using dbscan package for DBScan clustering library(dbscan) # Sample data set.seed(123) data <- matrix(rnorm(200), ncol = 2) # DBScan clustering dbscan_result <- dbscan(data, eps = 0.5, MinPts = 5)
Using dbscan in R for clustering:
dbscan
package provides the dbscan
function for performing DBScan clustering.# Using dbscan package for DBScan clustering library(dbscan) # Sample data set.seed(123) data <- matrix(rnorm(200), ncol = 2) # DBScan clustering dbscan_result <- dbscan(data, eps = 0.5, MinPts = 5)
Visualizing DBScan clusters in R:
# Visualizing DBScan clusters library(dbscan) library(ggplot2) # Sample data and DBScan clustering set.seed(123) data <- matrix(rnorm(200), ncol = 2) dbscan_result <- dbscan(data, eps = 0.5, MinPts = 5) # Plotting clusters ggplot(data.frame(data, Cluster = dbscan_result$cluster), aes(x = V1, y = V2, color = as.factor(Cluster))) + geom_point() + ggtitle("DBScan Clustering")
DBScan parameters in R:
eps
(radius for defining neighborhood) and MinPts
(minimum number of points in a neighborhood to form a core point).# DBScan parameters library(dbscan) # Sample data set.seed(123) data <- matrix(rnorm(200), ncol = 2) # Trying different DBScan parameters dbscan_result_1 <- dbscan(data, eps = 0.3, MinPts = 5) dbscan_result_2 <- dbscan(data, eps = 0.5, MinPts = 10)
Outlier detection with DBScan in R:
# Outlier detection with DBScan library(dbscan) # Sample data set.seed(123) data <- matrix(rnorm(200), ncol = 2) # DBScan for outlier detection dbscan_result <- dbscan(data, eps = 0.5, MinPts = 5) # Extracting outliers (noise) outliers <- data[dbscan_result$cluster == 0, ]