Python k-means Clustering Example














































Python k-means Clustering Example



K-means Clustering in
Python

(With Example)



K-means Clustering in Python

K-means clustering is a clustering algorithm that aims to partition n observations into k clusters.

There are 3 steps:

·        Initialisation %u2013 K initial %u201Cmeans%u201D (centroids) are generated at random

·        Assignment %u2013 K clusters are created by associating each observation with the nearest centroid

·        Update %u2013 The centroid of the clusters becomes the new mean

Assignment and Update are repeated iteratively until convergence

The end result is that the sum of squared errors is minimised between points and their respective centroids.

We%u2019ll do this manually first, then show how it%u2019s done using scikit-learn

Let%u2019s view it in action using k=3:

In [1]:

## Initialisation

 

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

 

df = pd.DataFrame({

    'x': [12, 20, 28, 18, 29, 33, 24, 45, 45, 52, 51, 52, 55, 53, 55, 61, 64, 69, 72],

    'y': [39, 36, 30, 52, 54, 46, 55, 59, 63, 70, 66, 63, 58, 23, 14, 8, 19, 7, 24]

})

 

 

np.random.seed(200)

k = 3

# centroids[i] = [x, y]

centroids = {

    i+1: [np.random.randint(0, 80), np.random.randint(0, 80)]

    for i in range(k)

}

   

fig = plt.figure(figsize=(5, 5))

plt.scatter(df['x'], df['y'], color='k')

colmap = {1: 'r', 2: 'g', 3: 'b'}

for i in centroids.keys():

    plt.scatter(*centroids[i], color=colmap[i])

plt.xlim(0, 80)

plt.ylim(0, 80)

plt.show()

 

In [2]:

## Assignment Stage

 

def assignment(df, centroids):

    for i in centroids.keys():

        # sqrt((x1 - x2)^2 - (y1 - y2)^2)

        df['distance_from_{}'.format(i)] = (

            np.sqrt(

                (df['x'] - centroids[i][0]) ** 2

                + (df['y'] - centroids[i][1]) ** 2

            )

        )

    centroid_distance_cols = ['distance_from_{}'.format(i) for i in centroids.keys()]

    df['closest'] = df.loc[:, centroid_distance_cols].idxmin(axis=1)

    df['closest'] = df['closest'].map(lambda x: int(x.lstrip('distance_from_')))

    df['color'] = df['closest'].map(lambda x: colmap[x])

    return df

 

df = assignment(df, centroids)

print(df.head())

 

fig = plt.figure(figsize=(5, 5))

plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')

for i in centroids.keys():

    plt.scatter(*centroids[i], color=colmap[i])

plt.xlim(0, 80)

plt.ylim(0, 80)

plt.show()

    x   y  distance_from_1  distance_from_2  distance_from_3  closest color

0  12  39        26.925824        56.080300        56.727418        1     r

1  20  36        20.880613        48.373546        53.150729        1     r

2  28  30        14.142136        41.761226        53.338541        1     r

3  18  52        36.878178        50.990195        44.102154        1     r

4  29  54        38.118237        40.804412        34.058773        3     b

 

In [3]:

## Update Stage

 

import copy

 

old_centroids = copy.deepcopy(centroids)

 

def update(k):

    for i in centroids.keys():

        centroids[i][0] = np.mean(df[df['closest'] == i]['x'])

        centroids[i][1] = np.mean(df[df['closest'] == i]['y'])

    return k

 

centroids = update(centroids)

   

fig = plt.figure(figsize=(5, 5))

ax = plt.axes()

plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')

for i in centroids.keys():

    plt.scatter(*centroids[i], color=colmap[i])

plt.xlim(0, 80)

plt.ylim(0, 80)

for i in old_centroids.keys():

    old_x = old_centroids[i][0]

    old_y = old_centroids[i][1]

    dx = (centroids[i][0] - old_centroids[i][0]) * 0.75

    dy = (centroids[i][1] - old_centroids[i][1]) * 0.75

    ax.arrow(old_x, old_y, dx, dy, head_width=2, head_length=3, fc=colmap[i], ec=colmap[i])

plt.show()

 

In [4]:

## Repeat Assigment Stage

 

df = assignment(df, centroids)

 

# Plot results

fig = plt.figure(figsize=(5, 5))

plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')

for i in centroids.keys():

    plt.scatter(*centroids[i], color=colmap[i])

plt.xlim(0, 80)

plt.ylim(0, 80)

plt.show()

 

Note that one of the reds is now green and one of the blues is now red.

We are getting closer.

We now repeat until there are no changes to any of the clusters.

In [5]:

# Continue until all assigned categories don't change any more

while True:

    closest_centroids = df['closest'].copy(deep=True)

    centroids = update(centroids)

    df = assignment(df, centroids)

    if closest_centroids.equals(df['closest']):

        break

 

fig = plt.figure(figsize=(5, 5))

plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')

for i in centroids.keys():

    plt.scatter(*centroids[i], color=colmap[i])

plt.xlim(0, 80)

plt.ylim(0, 80)

plt.show()

 

So we have 3 clear clusters with 3 means at the centre of these clusters.

We will now repeat the above using scikit-learn, we first fit to our data

In [6]:

df = pd.DataFrame({

    'x': [12, 20, 28, 18, 29, 33, 24, 45, 45, 52, 51, 52, 55, 53, 55, 61, 64, 69, 72],

    'y': [39, 36, 30, 52, 54, 46, 55, 59, 63, 70, 66, 63, 58, 23, 14, 8, 19, 7, 24]

})

 

from sklearn.cluster import KMeans

 

kmeans = KMeans(n_clusters=3)

kmeans.fit(df)

Out[6]:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,

    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',

    random_state=None, tol=0.0001, verbose=0)

Then we learn the labels

In [7]:

labels = kmeans.predict(df)

centroids = kmeans.cluster_centers_

In [8]:

fig = plt.figure(figsize=(5, 5))

 

colors = map(lambda x: colmap[x+1], labels)

 

plt.scatter(df['x'], df['y'], color=colors, alpha=0.5, edgecolor='k')

for idx, centroid in enumerate(centroids):

    plt.scatter(*centroid, color=colmap[idx+1])

plt.xlim(0, 80)

plt.ylim(0, 80)

plt.show()

 

We get the exact same result, albeit with the colours in a different order.

Some things to take note of though:

·        k-means clustering is very sensitive to scale due to its reliance on Euclidean distance so be sure to normalize data if there are likely to be scaling problems.

·        If there are some symmetries in your data, some of the labels may be mis-labelled

·        It is recommended to do the same k-means with different initial centroids and take the most common label.

 

...



 Happy Pythoning....!!


More Articles of Aditi Kothiyal:

Name Views Likes
Python AdaBoost Mathematics Behind AdaBoost 421 1
Python PyCaret How to optimize the probability threshold % in binary classification 2071 0
Python K-means Predicting Iris Flower Species 1323 2
Python PyCaret How to ignore certain columns for model building 2635 0
Python PyCaret Experiment Logging 680 0
Python PyWin32 Open a File in Excel 941 0
Python Guppy GSL Introduction 220 2
Python Usage of Guppy With Example 1102 2
Python Naive Bayes Tutorial 552 2
Python Guppy Recent Memory Usage of a Program 893 2
Introduction to AdaBoost 290 1
Python AdaBoost Implementation of AdaBoost 513 1
Python AdaBoost Advantages and Disadvantages of AdaBoost 3714 1
Python K-Means Clustering Applications 333 2
Python Random Forest Algorithm Decision Trees 440 0
Python K-means Clustering PREDICTING IRIS FLOWER SPECIES 457 1
Python Random Forest Algorithm Bootstrap 476 0
Python PyCaret Util Functions 441 0
Python K-means Music Genre Classification 1763 1
Python PyWin Attach an Excel file to Outlook 1542 0
Python Guppy GSL Document and Test Example 248 2
Python Random Forest Algorithm Bagging 387 0
Python AdaBoost An Example of How AdaBoost Works 280 1
Python PyWin32 Getting Started PyWin32 603 0
Python Naive Bayes in Machine Learning 376 2
Python PyCaret How to improve results from hyperparameter tuning by increasing "n_iter" 1724 0
Python PyCaret Getting Started with PyCaret 2.0 357 1
Python PyCaret Tune Model 1325 1
Python PyCaret Create your own AutoML software 321 0
Python PyCaret Intoduction to PyCaret 297 1
Python PyCaret Compare Models 2697 1
Python PyWin Copying Data into Excel 1154 0
Python Guppy Error: expected function body after function declarator 413 2
Python Coding Random forest classifier using xgBoost 247 0
Python PyCaret How to tune "n parameter" in unsupervised experiments 659 0
Python PyCaret How to programmatically define data types in the setup function 1403 0
Python PyCaret Ensemble Model 805 1
Python Random forest algorithm Introduction 228 0
Python k-means Clustering Example 340 1
Python PyCaret Plot Model 1245 1
Python Hamming Distance 715 0
Python Understanding Random forest algorithm 311 0
Python PyCaret Sort a Dictionary by Keys 245 0
Python Coding Random forest classifier using sklearn 340 0
Python Guppy Introduction 368 2
Python How to use Guppy/Heapy for tracking down Memory Usage 1069 2
Python AdaBoost Summary and Conclusion 232 1
Python PyCaret Create Model 365 1
Python k -means Clusturing Introduction 326 2
Python k-means Clustering With Example 351 2

Comments