DIY : k clustering (means / medians) via Python – pj – Medium

This is a quick walk through on setting up your own k clustering algorithm from scratch. This is meant to better understand the details behind the algorithm as well as areas that may allow for alternate solutions.

GOAL: Cluster “like minded” data points together.

3 main methods needed:

  1. Need a way to compute the distances between points.
  2. Need a way to assign clusters to points.
  3. Need a way to update the centroids of our clusters.
import math
import numpy as np
from sklearn import datasets
## slim it down to 2d,
X, y = datasets.load_iris(return_X_y=True)
X = X[:, 0:2]

1. L2 Norm (Euclidean Distance)

def euclidean_distance(p, q):    
if type(p) == int:
p = [ p ]
if type(q) == int:
q = [ q ]
    ## shortest distance to go, between two points,
assert len(p) == len(q)
return math.sqrt(
np.sum([ (x1 - x2) ** 2 for x1, x2 in zip(p, q) ])
)

2. Assign Cluster to Instances

def assign_instance_a_cluster(instance, clusters):
n_k = len(clusters)
return np.argmin([
euclidean_distance(instance, clusters[i])
for i in range(n_k)
])

## setup cluster to instances dictionary,
cluster_to_instances = dict([
(i, []) for i in range(len(clusters))
])

for row in X:
cluster_assignment = assign_instance_a_cluster(row, clusters)
cluster_to_instances[cluster_assignment].append(row.tolist())

3. Update Cluster Centroids (mean / median)

def update_cluster_centroid(cluster, use_kmeans=True):
_, c_l = X.shape;

if use_kmeans:
return [
np.mean(
np.array(cluster)[:, i])
for i in range(c_l)
]

return [
np.median(
np.array(cluster)[:, i])
for i in range(c_l)
]

use_means = True
clusters = [
update_cluster_centroid(cluster_to_instances[i], use_means)
for i in range(n_k)
]

The rest is straight forward. Setup a method to iterate over these three methods. You could create a stop rule, to exit early. I setup one that would exit early when all the centroids stop moving. Other useful methods exist. The Full Notebook can be found here.

In summary, immediately the methods that compute the distance between points as well as computing the cluster centroids stands out as areas that we can adjust to potentially achieve different results. Just swapping the mean for the median instantly changes our results (see below). Another, often overlooked facet, is our initial starting point for the centroids.

Result Using Mean:

Using Means — 15 iterations

Result using Median:

Using Medians — 4 iterations

References:

  1. https://en.wikipedia.org/wiki/K-means_clustering

LEAVE A REPLY

Please enter your comment!
Please enter your name here