Part 12 - Implementing Hierarchical Clustering in Python
Machine Learning Algorithms Series - Unsupervised Learning with scikit-learn
This article explains how to implement the Hierarchical Clustering algorithm and visualize it with a dendrogram in Python. It covers importing necessary libraries, preparing data, performing hierarchical clustering, plotting the dendrogram, and displaying the plot.
Introduction to Hierarchical Clustering
Hierarchical Clustering is an unsupervised learning algorithm that builds a hierarchy of clusters. It starts with each data point as its own cluster and then merges or splits clusters based on distance measures, forming a tree-like structure called a dendrogram. The hierarchy can be used to choose a suitable number of clusters by cutting the dendrogram at a specific level.
Step-by-Step Implementation
Importing Libraries:
Import the
dendrogram
andlinkage
functions fromscipy.cluster.hierarchy
. Thelinkage
function performs hierarchical or agglomerative clustering, whereasdendrogram
generates a dendrogram plot to visualize the hierarchical clustering.Import
matplotlib.pyplot
for plotting graphs and visualizations.Import
numpy
for handling arrays and performing numerical operations.
Preparing Data:
Create a NumPy array
X
representing the data points in 2D space. Each sublist represents a data point with X and Y coordinates.
Performing Hierarchical Clustering:
Use the
linkage
function to perform hierarchical clustering on the dataX
. Specify themethod
parameter to define the linkage criterion. The 'ward' method minimizes variance within clusters.The
linkage
function returns a hierarchical clustering resultZ
, which is an array where each row represents a merge containing information on the clusters that were merged, the distance between them, and the number of original data points in the newly formed cluster.
Plotting the Dendrogram:
Create a new figure for the plot using
plt.figure(figsize=(8, 4))
.Plot the dendrogram using the
dendrogram
function, passing the hierarchical clustering resultsZ
. The dendrogram visually represents the merging process of clusters, where each u-shaped link shows the distance at which clusters were merged.Add a title to the plot using
plt.title('Dendrogram for Hierarchical Clustering')
.Set the labels for the x and y axes using
plt.xlabel('Data Points')
andplt.ylabel('Distance')
.Display the dendrogram plot using
plt.show()
.
Complete Code Example
# Import necessary libraries
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
import numpy as np
# Prepare data
X = np.array([[,], [,], [,], [,], [,], [,]])
# Perform hierarchical clustering
Z = linkage(X, method='ward')
# Plot the dendrogram
plt.figure(figsize=(8, 4))
dendrogram(Z)
plt.title('Dendrogram for Hierarchical Clustering')
plt.xlabel('Data Points')
plt.ylabel('Distance')
plt.show()
Conclusion
This article demonstrates how to perform hierarchical clustering on a set of 2D points and visualize the results with a dendrogram. The dendrogram shows how clusters are formed by merging data points and groups step by step. The height of each merge (Y-axis) indicates the distance between clusters, and cutting the dendrogram at different levels can yield different numbers of clusters.