KMeans Clustering Explained for Beginners: A Step-by-Step Guide in Data Science


Introduction:

Clustering is an unsupervised learning technique used to divide data points into distinct groups, where points within each group are more similar to each other and less similar to those in other groups. It essentially involves grouping objects based on their similarities and differences.

KMeans Clustering Explained for Beginners: A Step-by-Step Guide in Data Science


KMeans clustering is an unsupervised machine learning algorithm that performs this grouping by dividing 'n' observations into 'K' clusters according to their distances. The algorithm aims to reduce within-cluster variance, ensuring that similar observations are grouped together.

KMeans clustering assumes all variables are continuous due to its reliance on distance measures and requires prior selection of the number of clusters (K).

Table of Contents

How does the algorithm work?

In Hierarchical clustering is approach to clustering that builds a hierarchy of clusters. It is typically used for smaller datasets, as it is computationally more intensive than KMeans. There are two main types:

  • Agglomerative (Bottom-Up) Clustering: This starts with each data point as its own cluster and merges clusters iteratively based on similarity until only one cluster remains or a stopping criterion is met. The most similar clusters are merged at each step, creating a tree-like structure called a dendrogram.
  • Divisive (Top-Down) Clustering: This begins with all data points in a single cluster, which is then split iteratively into smaller clusters based on dissimilarity, until each data point is its own cluster or a stopping criterion is achieved.

These methods do not require specifying the number of clusters in advance and are often visualized using dendrograms, which help in determining the natural clusters in the data. Hierarchical clustering is particularly useful when the data has a nested or hierarchical structure.

The KMeans algorithm is a popular method for clustering in machine learning, particularly for its simplicity and efficiency. Here's an overview of its working steps:

  • Initialize Centroids: Begin by choosing  centroids randomly from the dataset, where  is the number of clusters specified in advance.
  • Assign Points to Clusters: For each data point, calculate the distance to each centroid. Assign the point to the cluster with the nearest centroid.
  • Update Centroids: Recalculate the centroid of each cluster based on the current members. The centroid is usually the mean position of all points in the cluster.
  • Repeat Until Convergence: Repeat the assignment and update steps until the centroids no longer change significantly, or a set number of iterations is reached.
  • Output Clusters: The algorithm concludes by outputting clusters where data points within the same cluster are more similar to each other than to points in other clusters.

The Distance Metrics

In clustering algorithms like KMeans, choosing an appropriate distance metric is crucial, as it directly impacts how clusters are formed. Here are some commonly used distance metrics:


Euclidean Distance: This is the most commonly used metric in KMeans clustering. It calculates the straight-line distance between two points in multidimensional space. Given two points, A and B, with coordinates (x1,y1) and (x2,y2), the Euclidean distance is:

EUCLIDEAN DISTANCE

   This metric works well with continuous data and clusters that have a spherical shape.

Manhattan Distance (City Block Distance): This metric calculates the distance between two points by summing the absolute differences along each dimension. For points A and B:

Manhattan Distance

   Manhattan distance is useful when dealing with high-dimensional data or clusters that are elongated rather than spherical.

Cosine Similarity: Rather than distance, cosine similarity measures the cosine of the angle between two vectors. It is commonly used when the magnitude of the points doesn’t matter as much as their direction, making it suitable for high-dimensional, sparse data (e.g., text data). The cosine similarity between points A and B is:

COSINE

   Higher cosine similarity indicates points in the same direction, though they may differ in magnitude.

Mahalanobis Distance: This distance metric considers the correlations between variables and scales distances based on the data's variance. It's particularly useful when variables are not independent or identically distributed. Given the covariance matrix Sigma of the dataset, the Mahalanobis distance between points A and B is:

MAHALANOBIS DISTANCE

Hamming Distance: This metric is used for categorical data, calculating the number of dimensions in which two points differ. It’s often applied in binary or categorical datasets, such as sequences or strings.

Choosing the right distance metric depends on the dataset and the clustering method. Euclidean distance is the default choice in KMeans clustering, but experimenting with different metrics can lead to better clustering results depending on the data’s structure and properties.


Properties of clusters

  • All data points within a cluster should exhibit homogeneity, meaning they should be similar to each other. The Within-Cluster Sum of Squares (WCSS) represents the total sum of the squared distances between each data point and its cluster's centroid. A lower WCSS value indicates that the data points are tightly grouped around the centroid.
  • Conversely, data points from different clusters should be heterogeneous, meaning they should differ significantly from one another. The Between-Cluster Sum of Squares (BCSS) measures the total sum of the squared distances between the centroids of different clusters. A higher BCSS value suggests that the clusters are more spread out, while a lower BCSS value indicates that the clusters are closer together.

How do we select the right K value?

Choosing the optimal number of clusters, K , is a critical step in KMeans clustering. Here are common methods to determine the best K value:


1. The Elbow Method: 

   - In this method, you calculate the Within-Cluster Sum of Squares (WCSS) for a range of K values (e.g., from 1 to 10).

   - Plot K values on the x-axis and the corresponding WCSS values on the y-axis.

   - Look for an "elbow point," where the decrease in WCSS becomes less significant. This point suggests the optimal K as it balances low WCSS with simplicity.


2. The Silhouette Score: 

   - The silhouette score measures how similar a point is to its cluster compared to other clusters, providing a value between -1 and 1.

   - A higher score indicates that clusters are well-defined and separated, while a lower score suggests overlap or poor clustering.

   - To use this method, calculate the average silhouette score for different K values and choose the one that maximizes this score.


3. Gap Statistic Method:

   - This method compares the WCSS for a range of K values to the WCSS of a random dataset with no clusters.

   - By subtracting the random WCSS from the observed WCSS, you obtain a “gap” statistic for each K.

   - The optimal K is the one with the largest gap, indicating that the observed clustering structure is stronger than in random data.


4. Cross-Validation:

   - In some cases, you can split the dataset into training and testing sets to evaluate the performance of different K values.

   - Choose K based on which produces the best clustering consistency or generalizes best across subsets of data.


These methods provide guidelines, but the right K value ultimately depends on your data and the clustering goals. Experimentation and interpretation are essential for finding the most meaningful clusters.

Implementation of KMeans clustering in Python using sklearn library

I will be using the crime_data.csv dataset for our implementation.

The dataset contains five features: State, Murder, Assault, UrbanPop, and Rape.

The data contains 50 entries and 5 columns:

- Murder: The murder rates across different states in the United States.  

- Assault: The assault rates across various states in the United States.  

- UrbanPop: The percentage of the urban population in each state in the United States.  

- Rape: The rape rates in different states of the United States.

Let’s begin with importing necessary libraries

import pandas as pd
import NumPy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler

Create pandas data frame from crime_data.csv

data = pd.read_csv("crime_data.csv")
data.head()

Check if there are any null values

data.isnull().any()

Dropping the categorical feature and copy the remaining data to another data frame

mydata = data.iloc[:, data.columns!=’Unnamed: 0']
mydata.head()

Scaling Data (used MinMaxScaler)

scaler = MinMaxScaler()
norm_mydata = mydata.copy()
def minmaxscaler(x):
    for columnName, columnData in x.iteritems():
        x[columnName] = scaler.fit_transform(np.array(columnData).reshape(-1, 1))
    
minmaxscaler(norm_mydata)
norm_mydata.head()

Scree plot or Elbow plot to find K

k = list(range(2,11))
sum_of_squared_distances = []
for i in k:
    kmeans = KMeans(n_clusters=i)
    kmeans.fit(norm_mydata)
    sum_of_squared_distances.append(kmeans.inertia_)

plt.figure(figsize=(10, 5))
plt.plot(k, sum_of_squared_distances, 'go--')
plt.xlabel('Number of Clusters')
plt.ylabel('Within Cluster Sum of squares')
plt.title('Elbow Curve to find optimum K')

Building KMeans model with K=4 (Training and Predicting)

import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Load the data
file_path = '/mnt/data/crime_data.csv'
data = pd.read_csv(file_path)

# Drop the state names column
data_cleaned = data.drop(columns=['Unnamed: 0'])

# Elbow method to find optimal number of clusters
inertia = []
cluster_range = range(1, 11)

for k in cluster_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(data_cleaned)
    inertia.append(kmeans.inertia_)

# Plotting the elbow plot
plt.figure(figsize=(10, 6))
plt.plot(cluster_range, inertia, marker='o', linestyle='-')
plt.xlabel("Number of Clusters")
plt.ylabel("Inertia")
plt.title("Elbow Method for Optimal K")
plt.show()

# Building the KMeans model with K=4
kmeans_4 = KMeans(n_clusters=4, random_state=42)
kmeans_4.fit(data_cleaned)

# Predicting the cluster labels for each data point
data_cleaned['Cluster'] = kmeans_4.predict(data_cleaned)

# Display the first few rows with cluster assignments
print(data_cleaned.head())
kmeans traning


Sample visualization of clusters

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Apply PCA to reduce to 2 dimensions for visualization
pca = PCA(n_components=2)
data_pca = pca.fit_transform(data_cleaned.drop(columns=['Cluster']))

# Creating a DataFrame with the PCA components and the cluster labels
data_pca_df = pd.DataFrame(data_pca, columns=['PCA1', 'PCA2'])
data_pca_df['Cluster'] = data_cleaned['Cluster']

# Plotting the clusters
plt.figure(figsize=(10, 8))
for cluster in data_pca_df['Cluster'].unique():
    subset = data_pca_df[data_pca_df['Cluster'] == cluster]
    plt.scatter(subset['PCA1'], subset['PCA2'], label=f'Cluster {cluster}', alpha=0.6)

plt.xlabel('PCA1')
plt.ylabel('PCA2')
plt.title('Cluster Visualization using PCA')
plt.legend()
plt.show()

visualization of clusters

End Notes

In this blog, we explored how the KMeans algorithm works, the various distance metrics it utilizes, and how the Elbow curve method helps in determining the optimal K value. I hope you found this article helpful—feel free to share it with your fellow learners!