About the Algorythm Recipe 🍰
Imagine you're organizing a giant party with a bunch of people you don't know very well. Hierarchical clustering is a way to gradually group these people together based on how similar they seem.
Cookin' time! 🍳
Here's how it works:
Start with Everyone Alone: At first, everyone is in their own little group (of one!).
Find the Closest Strangers: Look for the two people who seem most alike based on some criteria, like favorite music or hobbies. Maybe they both love wearing hats and talk non-stop about their cats. These two become the first official "cluster."
Keep Grouping Similar People: Repeat step 2, but now consider pairs that could include individuals or existing groups. Maybe another person joins the hat-and-cat people because they also love cats!
Branch Out, But Not Too Far: As you keep grouping people, imagine building a tree-like structure. New groups connect to existing ones based on their similarities. But remember, the goal is to keep things organized, not just clump everyone together.
Stop When You're Happy: There's no one right answer for how many groups to make. You can keep clustering until everyone is in one giant group, or stop when you have a few well-defined clusters that make sense based on your goals for the party (maybe separating the cat lovers, the music lovers, and the quiet observers).
This is a simplified explanation, but it captures the essence of hierarchical clustering. It's a way of taking a large dataset and automatically organizing it into a hierarchy of increasingly similar groups. This can be useful for various tasks, like:
Understanding customer behavior: Grouping customers based on their shopping habits.
Recommending products: Recommending similar movies to people who liked a particular film.
Analyzing gene expressions: Grouping genes with similar functions in biological research.
# Import libraries
import pandas as pd
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram
# Sample data (replace with your actual data)
data = pd.DataFrame({
'feature1': [1, 5, 8, 1.5, 1.8, 8.8, 1.2, 0.8, 7.7],
'feature2': [2, 8, 10, 1.7, 2.2, 9.4, 1.1, 0.9, 8.1]
})
# Define distance metric (euclidean distance commonly used)
distance_metric = 'euclidean'
# Perform hierarchical clustering with Ward linkage
clustering = AgglomerativeClustering(n_clusters=3, affinity=distance_metric, linkage='ward')
clustering.fit_predict(data)
# Get cluster labels for each data point
cluster_labels = clustering.labels_
# Print cluster labels
print("Cluster labels:", cluster_labels)
# Optional: Visualize the clustering dendrogram
plt.figure(figsize=(10, 6))
dendrogram(clustering, labels=data.index)
plt.title("Hierarchical Clustering Dendrogram")
plt.show()
Decode the recipe:
Import libraries: We import pandas for data manipulation, scikit-learn's AgglomerativeClustering class for performing hierarchical clustering, and the dendrogram function from scipy to visualize the clustering structure.
Sample data: This script includes sample data for demonstration purposes. Replace this with your actual data in a pandas DataFrame format.
Distance metric: We define the distance metric used to calculate similarity between data points. Euclidean distance is a common choice, but others can be used depending on your data.
Perform clustering: We create an AgglomerativeClustering object, specifying the number of desired clusters (3 in this case), the distance metric, and the linkage method ('ward' is a commonly used method). The fit_predict method performs the clustering and assigns data points to clusters.
Get cluster labels: We extract the cluster labels assigned to each data point, providing information about which cluster each data point belongs to.
Print cluster labels: This line simply prints the cluster labels for each data point in your data.
Optional: Visualize dendrogram: This section creates a visual representation of the clustering hierarchy using a dendrogram. This can be helpful in understanding how the clusters were formed and their relative distances.