top of page

Hierarchical Clustering

About the Algorythm Recipe 🍰

Imagine you're organizing a giant party with a bunch of people you don't know very well. Hierarchical clustering is a way to gradually group these people together based on how similar they seem.

Cookin' time! 🍳


Here's how it works:

  1. Start with Everyone Alone:  At first, everyone is in their own little group (of one!).

  2. Find the Closest Strangers:  Look for the two people who seem most alike based on some criteria, like favorite music or hobbies. Maybe they both love wearing hats and talk non-stop about their cats. These two become the first official "cluster."

  3. Keep Grouping Similar People:  Repeat step 2, but now consider pairs that could include individuals or existing groups. Maybe another person joins the hat-and-cat people because they also love cats!

  4. Branch Out, But Not Too Far:  As you keep grouping people, imagine building a tree-like structure. New groups connect to existing ones based on their similarities. But remember, the goal is to keep things organized, not just clump everyone together.

  5. Stop When You're Happy:  There's no one right answer for how many groups to make. You can keep clustering until everyone is in one giant group, or stop when you have a few well-defined clusters that make sense based on your goals for the party (maybe separating the cat lovers, the music lovers, and the quiet observers).

This is a simplified explanation, but it captures the essence of hierarchical clustering. It's a way of taking a large dataset and automatically organizing it into a hierarchy of increasingly similar groups. This can be useful for various tasks, like:

  • Understanding customer behavior: Grouping customers based on their shopping habits.

  • Recommending products: Recommending similar movies to people who liked a particular film.

  • Analyzing gene expressions: Grouping genes with similar functions in biological research.


 

# Import libraries
import pandas as pd
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram

# Sample data (replace with your actual data)
data = pd.DataFrame({
  'feature1': [1, 5, 8, 1.5, 1.8, 8.8, 1.2, 0.8, 7.7],
  'feature2': [2, 8, 10, 1.7, 2.2, 9.4, 1.1, 0.9, 8.1]
})

# Define distance metric (euclidean distance commonly used)
distance_metric = 'euclidean'

# Perform hierarchical clustering with Ward linkage
clustering = AgglomerativeClustering(n_clusters=3, affinity=distance_metric, linkage='ward')
clustering.fit_predict(data)

# Get cluster labels for each data point
cluster_labels = clustering.labels_

# Print cluster labels
print("Cluster labels:", cluster_labels)

# Optional: Visualize the clustering dendrogram
plt.figure(figsize=(10, 6))
dendrogram(clustering, labels=data.index)
plt.title("Hierarchical Clustering Dendrogram")
plt.show()

 

Decode the recipe:


  1. Import libraries: We import pandas for data manipulation, scikit-learn's AgglomerativeClustering class for performing hierarchical clustering, and the dendrogram function from scipy to visualize the clustering structure.

  2. Sample data: This script includes sample data for demonstration purposes. Replace this with your actual data in a pandas DataFrame format.

  3. Distance metric: We define the distance metric used to calculate similarity between data points. Euclidean distance is a common choice, but others can be used depending on your data.

  4. Perform clustering: We create an AgglomerativeClustering object, specifying the number of desired clusters (3 in this case), the distance metric, and the linkage method ('ward' is a commonly used method). The fit_predict method performs the clustering and assigns data points to clusters.

  5. Get cluster labels: We extract the cluster labels assigned to each data point, providing information about which cluster each data point belongs to.

  6. Print cluster labels: This line simply prints the cluster labels for each data point in your data.

  7. Optional: Visualize dendrogram: This section creates a visual representation of the clustering hierarchy using a dendrogram. This can be helpful in understanding how the clusters were formed and their relative distances.

  ALGORYTHM ACADEMY 2023. ALL RIGHT RESERVED.

bottom of page