### Python k-means Clustering With Example

K-means Clustering in Python

(With Example)

K-means clustering is a clustering algorithm that aims to partition n observations into k clusters.

There are 3 steps:

1. Initialisation - K initial "means"(centroids) are generated at random

2. Assignment - K clusters are created by associating each observation with the nearest centroid

3. Update - The centroid of the clusters becomes the new mean

Assignment and Update are repeated iteratively until convergence

The end result is that the sum of squared errors is minimised between points and their respective centroids.

We"ll do this manually first, then show how it's done using scikit-learn

Let's view it in action using k=3:

In [1]:

## Initialisation

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

df = pd.DataFrame({

'x': [12202818293324454552515255535561646972],

'y': [393630525446555963706663582314819724]

})

np.random.seed(200)

k = 3

# centroids[i] = [x, y]

centroids = {

i+1: [np.random.randint(080), np.random.randint(080)]

for i in range(k)

}

fig = plt.figure(figsize=(55))

plt.scatter(df['x'], df['y'], color='k')

colmap = {1'r'2'g'3'b'}

for i in centroids.keys():

plt.scatter(*centroids[i], color=colmap[i])

plt.xlim(080)

plt.ylim(080)

plt.show()

In [2]:

## Assignment Stage

def assignment(dfcentroids):

for i in centroids.keys():

# sqrt((x1 - x2)^2 - (y1 - y2)^2)

df['distance_from_{}'.format(i)] = (

np.sqrt(

df['x'- centroids[i][0]) ** 2

+ (df['y'- centroids[i][1]) ** 2

)

)

centroid_distance_cols = ['distance_from_{}'.format(ifor i in centroids.keys()]

df['closest'= df.loc[:, centroid_distance_cols].idxmin(axis=1)

df['closest'= df['closest'].map(lambda xint(x.lstrip('distance_from_')))

df['color'= df['closest'].map(lambda xcolmap[x])

return df

df = assignment(dfcentroids)

fig = plt.figure(figsize=(55))

plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5edgecolor='k')

for i in centroids.keys():

plt.scatter(*centroids[i], color=colmap[i])

plt.xlim(080)

plt.ylim(080)

plt.show()

x   y distance_from_1 distance_from_2 distance_from_3  closest color

0  12  39       26.925824        56.080300        56.727418        1    r

1  20  36       20.880613        48.373546        53.150729        1    r

2  28  30       14.142136        41.761226        53.338541        1    r

3  18  52       36.878178        50.990195        44.102154        1    r

4  29  54       38.118237        40.804412        34.058773        3    b

In [3]:

## Update Stage

import copy

old_centroids = copy.deepcopy(centroids)

def update(k):

for i in centroids.keys():

centroids[i][0= np.mean(df[df['closest'== i]['x'])

centroids[i][1= np.mean(df[df['closest'== i]['y'])

return k

centroids = update(centroids)

fig = plt.figure(figsize=(55))

ax = plt.axes()

plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5edgecolor='k')

for i in centroids.keys():

plt.scatter(*centroids[i], color=colmap[i])

plt.xlim(080)

plt.ylim(080)

for i in old_centroids.keys():

old_x = old_centroids[i][0]

old_y = old_centroids[i][1]

dx = (centroids[i][0- old_centroids[i][0]) * 0.75

dy = (centroids[i][1- old_centroids[i][1]) * 0.75

plt.show()

In [4]:

## Repeat Assigment Stage

df = assignment(dfcentroids)

# Plot results

fig = plt.figure(figsize=(55))

plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5edgecolor='k')

for i in centroids.keys():

plt.scatter(*centroids[i], color=colmap[i])

plt.xlim(080)

plt.ylim(080)

plt.show()

Note that one of the reds is now green and one of the blues is now red.

We are getting closer. We now repeat until there are no changes to any of the clusters.

In [5]:

# Continue until all assigned categories don't

change any more

while True:

closest_centroids = df['closest'].copy(deep=True)

centroids = update(centroids)

df = assignment(dfcentroids)

if closest_centroids.equals(df['closest']):

break

fig = plt.figure(figsize=(55))

plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5edgecolor='k')

for i in centroids.keys():

plt.scatter(*centroids[i], color=colmap[i])

plt.xlim(080)

plt.ylim(080)

plt.show()

So we have 3 clear clusters with 3 means at the centre of these clusters.

We will now repeat the above using scikit-learn, we first fit to our data

In [6]:

df = pd.DataFrame({

'x': [12202818293324454552515255535561646972],

'y': [393630525446555963706663582314819724]

})

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3)

kmeans.fit(df)

Out[6]:

KMeans(algorithm='auto', copy_x=True,

init='k-means++', max_iter=300,

n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',

random_state=None, tol=0.0001, verbose=0)

Then we learn the labels

In [7]:

labels = kmeans.predict(df)

centroids = kmeans.cluster_centers_

In [8]:

fig = plt.figure(figsize=(55))

colors = map(lambda xcolmap[x+1], labels)

plt.scatter(df['x'], df['y'], color=colorsalpha=0.5edgecolor='k')

for idxcentroid in enumerate(centroids):

plt.scatter(*centroidcolor=colmap[idx+1])

plt.xlim(080)

plt.ylim(080)

plt.show()

We get the exact same result, albeit with the colours in a different order.

Some things to take note of though:

1. k-means clustering is very sensitive to scale due to its reliance on Euclidean distance so be sure to normalize data if there are likely to be scaling problems.

2. If there are some symmetries in your data, some of the labels may be mis-labelled

3. It is recommended to do the same k-means with different initial centroids and take the most common label.

...

Happy Pythoning....!!

#### More Articles of Aditi Kothiyal:

Name Views Likes
Python PyCaret How to optimize the probability threshold % in binary classification 2071 0
Python K-means Predicting Iris Flower Species 1323 2
Python PyCaret How to ignore certain columns for model building 2635 0
Python PyCaret Experiment Logging 680 0
Python PyWin32 Open a File in Excel 941 0
Python Guppy GSL Introduction 220 2
Python Usage of Guppy With Example 1102 2
Python Naive Bayes Tutorial 552 2
Python Guppy Recent Memory Usage of a Program 893 2
Python K-Means Clustering Applications 333 2
Python Random Forest Algorithm Decision Trees 440 0
Python K-means Clustering PREDICTING IRIS FLOWER SPECIES 457 1
Python Random Forest Algorithm Bootstrap 476 0
Python PyCaret Util Functions 441 0
Python K-means Music Genre Classification 1763 1
Python PyWin Attach an Excel file to Outlook 1542 0
Python Guppy GSL Document and Test Example 248 2
Python Random Forest Algorithm Bagging 386 0
Python PyWin32 Getting Started PyWin32 603 0
Python Naive Bayes in Machine Learning 376 2
Python PyCaret How to improve results from hyperparameter tuning by increasing "n_iter" 1724 0
Python PyCaret Getting Started with PyCaret 2.0 356 1
Python PyCaret Tune Model 1325 1
Python PyCaret Create your own AutoML software 321 0
Python PyCaret Intoduction to PyCaret 297 1
Python PyCaret Compare Models 2697 1
Python PyWin Copying Data into Excel 1154 0
Python Guppy Error: expected function body after function declarator 413 2
Python Coding Random forest classifier using xgBoost 247 0
Python PyCaret How to tune "n parameter" in unsupervised experiments 659 0
Python PyCaret How to programmatically define data types in the setup function 1403 0
Python PyCaret Ensemble Model 805 1
Python Random forest algorithm Introduction 228 0
Python k-means Clustering Example 339 1
Python PyCaret Plot Model 1244 1
Python Hamming Distance 715 0
Python Understanding Random forest algorithm 311 0
Python PyCaret Sort a Dictionary by Keys 245 0
Python Coding Random forest classifier using sklearn 340 0
Python Guppy Introduction 368 2
Python How to use Guppy/Heapy for tracking down Memory Usage 1069 2
Python AdaBoost Summary and Conclusion 232 1
Python PyCaret Create Model 365 1
Python k -means Clusturing Introduction 326 2
Python k-means Clustering With Example 351 2