Python k-means Clustering Example














































Python k-means Clustering Example



K-means Clustering in
Python

(With Example)



K-means Clustering in Python

K-means clustering is a clustering algorithm that aims to partition n observations into k clusters.

There are 3 steps:

        Initialisation %u2013 K initial %u201Cmeans%u201D (centroids) are generated at random

        Assignment %u2013 K clusters are created by associating each observation with the nearest centroid

        Update %u2013 The centroid of the clusters becomes the new mean

Assignment and Update are repeated iteratively until convergence

The end result is that the sum of squared errors is minimised between points and their respective centroids.

We%u2019ll do this manually first, then show how it%u2019s done using scikit-learn

Let%u2019s view it in action using k=3:

In [1]:

## Initialisation

 

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

 

df = pd.DataFrame({

    'x': [12, 20, 28, 18, 29, 33, 24, 45, 45, 52, 51, 52, 55, 53, 55, 61, 64, 69, 72],

    'y': [39, 36, 30, 52, 54, 46, 55, 59, 63, 70, 66, 63, 58, 23, 14, 8, 19, 7, 24]

})

 

 

np.random.seed(200)

k = 3

# centroids[i] = [x, y]

centroids = {

    i+1: [np.random.randint(0, 80), np.random.randint(0, 80)]

    for i in range(k)

}

   

fig = plt.figure(figsize=(5, 5))

plt.scatter(df['x'], df['y'], color='k')

colmap = {1: 'r', 2: 'g', 3: 'b'}

for i in centroids.keys():

    plt.scatter(*centroids[i], color=colmap[i])

plt.xlim(0, 80)

plt.ylim(0, 80)

plt.show()

 

In [2]:

## Assignment Stage

 

def assignment(df, centroids):

    for i in centroids.keys():

        # sqrt((x1 - x2)^2 - (y1 - y2)^2)

        df['distance_from_{}'.format(i)] = (

            np.sqrt(

                (df['x'] - centroids[i][0]) ** 2

                + (df['y'] - centroids[i][1]) ** 2

            )

        )

    centroid_distance_cols = ['distance_from_{}'.format(i) for i in centroids.keys()]

    df['closest'] = df.loc[:, centroid_distance_cols].idxmin(axis=1)

    df['closest'] = df['closest'].map(lambda x: int(x.lstrip('distance_from_')))

    df['color'] = df['closest'].map(lambda x: colmap[x])

    return df

 

df = assignment(df, centroids)

print(df.head())

 

fig = plt.figure(figsize=(5, 5))

plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')

for i in centroids.keys():

    plt.scatter(*centroids[i], color=colmap[i])

plt.xlim(0, 80)

plt.ylim(0, 80)

plt.show()

    x   y  distance_from_1  distance_from_2  distance_from_3  closest color

0  12  39        26.925824        56.080300        56.727418        1     r

1  20  36        20.880613        48.373546        53.150729        1     r

2  28  30        14.142136        41.761226        53.338541        1     r

3  18  52        36.878178        50.990195        44.102154        1     r

4  29  54        38.118237        40.804412        34.058773        3     b

 

In [3]:

## Update Stage

 

import copy

 

old_centroids = copy.deepcopy(centroids)

 

def update(k):

    for i in centroids.keys():

        centroids[i][0] = np.mean(df[df['closest'] == i]['x'])

        centroids[i][1] = np.mean(df[df['closest'] == i]['y'])

    return k

 

centroids = update(centroids)

   

fig = plt.figure(figsize=(5, 5))

ax = plt.axes()

plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')

for i in centroids.keys():

    plt.scatter(*centroids[i], color=colmap[i])

plt.xlim(0, 80)

plt.ylim(0, 80)

for i in old_centroids.keys():

    old_x = old_centroids[i][0]

    old_y = old_centroids[i][1]

    dx = (centroids[i][0] - old_centroids[i][0]) * 0.75

    dy = (centroids[i][1] - old_centroids[i][1]) * 0.75

    ax.arrow(old_x, old_y, dx, dy, head_width=2, head_length=3, fc=colmap[i], ec=colmap[i])

plt.show()

 

In [4]:

## Repeat Assigment Stage

 

df = assignment(df, centroids)

 

# Plot results

fig = plt.figure(figsize=(5, 5))

plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')

for i in centroids.keys():

    plt.scatter(*centroids[i], color=colmap[i])

plt.xlim(0, 80)

plt.ylim(0, 80)

plt.show()

 

Note that one of the reds is now green and one of the blues is now red.

We are getting closer.

We now repeat until there are no changes to any of the clusters.

In [5]:

# Continue until all assigned categories don't change any more

while True:

    closest_centroids = df['closest'].copy(deep=True)

    centroids = update(centroids)

    df = assignment(df, centroids)

    if closest_centroids.equals(df['closest']):

        break

 

fig = plt.figure(figsize=(5, 5))

plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')

for i in centroids.keys():

    plt.scatter(*centroids[i], color=colmap[i])

plt.xlim(0, 80)

plt.ylim(0, 80)

plt.show()

 

So we have 3 clear clusters with 3 means at the centre of these clusters.

We will now repeat the above using scikit-learn, we first fit to our data

In [6]:

df = pd.DataFrame({

    'x': [12, 20, 28, 18, 29, 33, 24, 45, 45, 52, 51, 52, 55, 53, 55, 61, 64, 69, 72],

    'y': [39, 36, 30, 52, 54, 46, 55, 59, 63, 70, 66, 63, 58, 23, 14, 8, 19, 7, 24]

})

 

from sklearn.cluster import KMeans

 

kmeans = KMeans(n_clusters=3)

kmeans.fit(df)

Out[6]:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,

    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',

    random_state=None, tol=0.0001, verbose=0)

Then we learn the labels

In [7]:

labels = kmeans.predict(df)

centroids = kmeans.cluster_centers_

In [8]:

fig = plt.figure(figsize=(5, 5))

 

colors = map(lambda x: colmap[x+1], labels)

 

plt.scatter(df['x'], df['y'], color=colors, alpha=0.5, edgecolor='k')

for idx, centroid in enumerate(centroids):

    plt.scatter(*centroid, color=colmap[idx+1])

plt.xlim(0, 80)

plt.ylim(0, 80)

plt.show()

 

We get the exact same result, albeit with the colours in a different order.

Some things to take note of though:

        k-means clustering is very sensitive to scale due to its reliance on Euclidean distance so be sure to normalize data if there are likely to be scaling problems.

        If there are some symmetries in your data, some of the labels may be mis-labelled

        It is recommended to do the same k-means with different initial centroids and take the most common label.

 

...



 Happy Pythoning....!!


More Articles of Aditi Kothiyal:

Name Views Likes
Python AdaBoost Mathematics Behind AdaBoost 465 1
Python PyCaret How to optimize the probability threshold % in binary classification 2240 0
Python K-means Predicting Iris Flower Species 1391 2
Python PyCaret How to ignore certain columns for model building 3243 0
Python PyCaret Experiment Logging 805 0
Python PyWin32 Open a File in Excel 1099 0
Python Guppy GSL Introduction 256 2
Python Usage of Guppy With Example 1176 2
Python Naive Bayes Tutorial 596 2
Python Guppy Recent Memory Usage of a Program 983 2
Introduction to AdaBoost 341 1
Python AdaBoost Implementation of AdaBoost 549 1
Python AdaBoost Advantages and Disadvantages of AdaBoost 3972 1
Python K-Means Clustering Applications 375 2
Python Random Forest Algorithm Decision Trees 486 0
Python K-means Clustering PREDICTING IRIS FLOWER SPECIES 513 1
Python Random Forest Algorithm Bootstrap 549 0
Python PyCaret Util Functions 495 0
Python K-means Music Genre Classification 1863 1
Python PyWin Attach an Excel file to Outlook 1731 0
Python Guppy GSL Document and Test Example 304 2
Python Random Forest Algorithm Bagging 439 0
Python AdaBoost An Example of How AdaBoost Works 339 1
Python PyWin32 Getting Started PyWin32 881 0
Python Naive Bayes in Machine Learning 413 2
Python PyCaret How to improve results from hyperparameter tuning by increasing "n_iter" 1822 0
Python PyCaret Getting Started with PyCaret 2.0 404 1
Python PyCaret Tune Model 1541 1
Python PyCaret Create your own AutoML software 387 0
Python PyCaret Intoduction to PyCaret 347 1
Python PyCaret Compare Models 2998 1
Python PyWin Copying Data into Excel 1250 0
Python Guppy Error: expected function body after function declarator 451 2
Python Coding Random forest classifier using xgBoost 282 0
Python PyCaret How to tune "n parameter" in unsupervised experiments 715 0
Python PyCaret How to programmatically define data types in the setup function 1476 0
Python PyCaret Ensemble Model 881 1
Python Random forest algorithm Introduction 257 0
Python k-means Clustering Example 376 1
Python PyCaret Plot Model 1382 1
Python Hamming Distance 769 0
Python Understanding Random forest algorithm 341 0
Python PyCaret Sort a Dictionary by Keys 279 0
Python Coding Random forest classifier using sklearn 389 0
Python Guppy Introduction 423 2
Python How to use Guppy/Heapy for tracking down Memory Usage 1154 2
Python AdaBoost Summary and Conclusion 270 1
Python PyCaret Create Model 408 1
Python k -means Clusturing Introduction 386 2
Python k-means Clustering With Example 397 2

Comments