R Tutorial

Fundamentals of R

Variables

Input and Output

Decision Making

Control Flow

Functions

Strings

Vectors

Lists

Arrays

Matrices

Factors

DataFrames

Object Oriented Programming

Error Handling

File Handling

Packages in R

Data Interfaces

Data Visualization

Statistics

Machine Learning with R

Hierarchical Clustering in R

Hierarchical clustering is a method of clustering that builds nested clusters by successively merging or splitting groups. The result of hierarchical clustering is a tree-like diagram called a dendrogram, which displays the sequence in which groups are merged or split.

Here's a basic tutorial on hierarchical clustering in R:

1. Sample Data

For this example, we'll use the built-in mtcars dataset:

data(mtcars)

2. Calculate Distance

Before performing hierarchical clustering, we need to compute the distance between each pair of samples. We usually use the Euclidean distance, though other distances can be chosen depending on the data:

dist_matrix <- dist(mtcars, method = "euclidean")

3. Hierarchical Clustering

We'll use the hclust() function to perform the hierarchical clustering:

hc <- hclust(dist_matrix, method = "complete")

There are various methods available, such as "complete", "average", "single", and others. Each method determines how the distance between clusters is measured.

4. Plot Dendrogram

You can visualize the result of the clustering using a dendrogram:

plot(hc, main = "Hierarchical Clustering: MTCARS", xlab = "", sub = "", cex = 0.9)

5. Cut the Dendrogram to Form K Clusters

You can cut the dendrogram to create a specific number of clusters:

k <- 3
clusters <- cutree(hc, k)

This will assign each car in the mtcars dataset to one of the k clusters.

6. Analyzing the Clusters

After creating clusters, you might want to examine the members of each cluster:

cluster1 <- mtcars[clusters == 1, ]
cluster2 <- mtcars[clusters == 2, ]
cluster3 <- mtcars[clusters == 3, ]

7. Advanced: Additional Packages

For advanced hierarchical clustering functionalities, you can explore packages like dendextend, dynamicTreeCut, and others.

  • dendextend: Provides tools for visualizing and comparing dendrogram trees.

  • dynamicTreeCut: Offers methods to cut the tree dynamically (not fixed at k clusters) and can be useful when the number of clusters isn't predefined.

Summary:

Hierarchical clustering is a versatile clustering method that can provide insights into the nested structure of data. One of its strengths is that, unlike K-means clustering, it doesn't require the user to specify the number of clusters beforehand. However, interpretation can become challenging when working with large datasets.

  1. Hierarchical clustering in R:

    • Description: Hierarchical clustering is a method for grouping similar observations into clusters. It creates a hierarchy of clusters that can be represented as a dendrogram.
    • Code:
      # Hierarchical clustering in R
      set.seed(123)
      data_matrix <- matrix(rnorm(50), ncol = 5)
      hc_result <- hclust(dist(data_matrix))
      
  2. Dendrogram in R:

    • Description: A dendrogram is a tree diagram representing a hierarchical clustering result. It visually displays the arrangement of clusters.
    • Code:
      # Creating a dendrogram in R
      dendrogram <- as.dendrogram(hc_result)
      plot(dendrogram)
      
  3. Agglomerative clustering in R:

    • Description: Agglomerative clustering is a bottom-up approach where each observation starts as its own cluster, and clusters are successively merged.
    • Code:
      # Agglomerative clustering in R
      set.seed(123)
      data_matrix <- matrix(rnorm(50), ncol = 5)
      hc_result <- hclust(dist(data_matrix))
      
  4. Hierarchical clustering methods in R:

    • Description: Different linkage methods (e.g., complete, average, single) can be used in hierarchical clustering. Each method determines how distances between clusters are calculated.
    • Code:
      # Hierarchical clustering with different linkage methods
      set.seed(123)
      data_matrix <- matrix(rnorm(50), ncol = 5)
      
      # Complete linkage
      hc_complete <- hclust(dist(data_matrix), method = "complete")
      
      # Average linkage
      hc_average <- hclust(dist(data_matrix), method = "average")
      
      # Single linkage
      hc_single <- hclust(dist(data_matrix), method = "single")
      
  5. Using hclust() in R:

    • Description: The hclust() function in R is used to perform hierarchical clustering. It takes a distance matrix as input and returns a hierarchical clustering object.
    • Code:
      # Using hclust() in R
      set.seed(123)
      data_matrix <- matrix(rnorm(50), ncol = 5)
      hc_result <- hclust(dist(data_matrix))
      
  6. Distance metrics for hierarchical clustering in R:

    • Description: The choice of distance metric influences how clusters are formed. Common distance metrics include Euclidean distance, Manhattan distance, and more.
    • Code:
      # Hierarchical clustering with different distance metrics
      set.seed(123)
      data_matrix <- matrix(rnorm(50), ncol = 5)
      
      # Euclidean distance
      hc_euclidean <- hclust(dist(data_matrix))
      
      # Manhattan distance
      hc_manhattan <- hclust(dist(data_matrix, method = "manhattan"))
      
  7. Cutting dendrograms in R:

    • Description: Cutting a dendrogram at a specific height results in clusters. The cutree() function is often used to extract cluster assignments.
    • Code:
      # Cutting dendrograms in R
      set.seed(123)
      data_matrix <- matrix(rnorm(50), ncol = 5)
      hc_result <- hclust(dist(data_matrix))
      
      # Cut the dendrogram into k clusters
      k_clusters <- cutree(hc_result, k = 3)
      
  8. Visualizing hierarchical clustering results in R:

    • Description: Visualizing hierarchical clustering results involves creating plots of the dendrogram or other representations to understand the structure of clusters.
    • Code:
      # Visualizing hierarchical clustering results in R
      set.seed(123)
      data_matrix <- matrix(rnorm(50), ncol = 5)
      hc_result <- hclust(dist(data_matrix))
      
      # Plotting the dendrogram
      plot(hc_result, main = "Hierarchical Clustering Dendrogram")
      
  9. Comparing different linkage methods in R:

    • Description: Comparing different linkage methods helps determine how clusters are formed and how sensitive the clustering result is to the choice of method.
    • Code:
      # Comparing different linkage methods in R
      set.seed(123)
      data_matrix <- matrix(rnorm(50), ncol = 5)
      
      # Complete linkage
      hc_complete <- hclust(dist(data_matrix), method = "complete")
      
      # Average linkage
      hc_average <- hclust(dist(data_matrix), method = "average")
      
      # Single linkage
      hc_single <- hclust(dist(data_matrix), method = "single")