Python k -means Clusturing Introduction














































Python k -means Clusturing Introduction



k-means Clustering

k-means is  one of  the simplest unsupervised  learning  algorithms  that  solve the well  known clustering problem. The procedure follows a simple and  easy  way  to classify a given data set  through a certain number of  clusters (assume k clusters) fixed apriori.
The  main  idea  is to define k centers, one for each cluster. These centers  should  be placed in a
cunning  way  because of  different  location  causes different result. So, the better  choice  is  to place them as  much as possible  far away from each other. The  next step is to take each point belonging  to a  given data set and associate it to the nearest center. When no point  is pending,  the first step is completed and an early group age is done. At this point we need to re-calculate k new centroids as barycenter of  the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done  between the same data set points  and  the nearest new center. A loop has been generated. As a result of  this loop we  may  notice that the k centers change their location step by step until no more changes  are done or  in  other words centers do not move any more. Finally, this  algorithm  aims at  minimizing  an
objective function know as squared error function given by:

                            

where,

            '||x-vj||' is the Euclidean distance between xi and vj.

             'ci' is the number of data points in ith cluster.

              'c' is the number of cluster centers.

 

Algorithmic steps for k-means
clustering

Let  X = {x1,x2,x3,..,xn}
be the set of data points and V = {v1,v2,..........,vc} be the set of centers.

1) Randomly select ' c' cluster centers.

2) Calculate the distance between each data point and cluster centers.

3) Assign the data point to the cluster center whose distance from the cluster center is minimum of all the cluster centers..

4) Recalculate the new cluster center using: 

where, 'ci' represents the number of data points in ith cluster.

 

5) Recalculate the distance between each data point and new obtained cluster centers.

6) If no data point was reassigned then stop, otherwise repeat from step 3).

 

Advantages

1) Fast, robust and easier to understand.

2) Relatively efficient: O(tknd), where n is # objects, k is #clusters, d is # dimension of each object, and t  is #iterations. Normally, k, t, d << n.

3) Gives best result when data set are distinct or well separated from each other.


Fig I: Showing the result of k-means for 'N' = 60
and 'c' = 3

Note: For more detailed figure for k-means algorithm please refer to k-means figure sub page.


Disadvantages

1) The learning algorithm requires apriori specification of the number of cluster centers.

2) The use of  Exclusive Assignment - If  there are two highly overlapping data then k-means will
not be able to resolve  that there are two clusters.

3) The learning algorithm is not invariant to non-linear transformations i.e. with different
representation of data we get 
different results (data represented in form of cartesian co-ordinates and polar co-ordinates will give different results).

4) Euclidean distance measures can unequally weight underlying factors.

5) The learning algorithm provides the local optima of the squared error function. 

6) Randomly choosing of the cluster center cannot lead us to the fruitful result. Pl. refer Fig.

7) Applicable only when mean is defined i.e. fails for categorical data.

8) Unable to handle noisy data and outliers.

9) Algorithm fails for non-linear data set. 

 

                                          

Fig II: Showing the non-linear data set
where k-means algorithm fails

    
Demonstration of the standard algorithm

%uFFFD        

1. k initial "means" (in
this case k=3) are randomly generated within the data domain (shown
in color).

 

%uFFFD               

2. k clusters are created by
associating every observation with the nearest mean. The partitions here represent
the Voronoi diagram generated by the means.

 

%uFFFD        

3. The centroid of each of the k clusters
becomes the new mean.

 

%uFFFD        

4. Steps 2 and 3 are repeated until convergence has
been reached.


. . .

Happy Pythoning...!!



More Articles of Aditi Kothiyal:

Name Views Likes
Python AdaBoost Mathematics Behind AdaBoost 421 1
Python PyCaret How to optimize the probability threshold % in binary classification 2071 0
Python K-means Predicting Iris Flower Species 1323 2
Python PyCaret How to ignore certain columns for model building 2635 0
Python PyCaret Experiment Logging 680 0
Python PyWin32 Open a File in Excel 941 0
Python Guppy GSL Introduction 220 2
Python Usage of Guppy With Example 1102 2
Python Naive Bayes Tutorial 552 2
Python Guppy Recent Memory Usage of a Program 893 2
Introduction to AdaBoost 290 1
Python AdaBoost Implementation of AdaBoost 513 1
Python AdaBoost Advantages and Disadvantages of AdaBoost 3714 1
Python K-Means Clustering Applications 333 2
Python Random Forest Algorithm Decision Trees 440 0
Python K-means Clustering PREDICTING IRIS FLOWER SPECIES 457 1
Python Random Forest Algorithm Bootstrap 476 0
Python PyCaret Util Functions 441 0
Python K-means Music Genre Classification 1763 1
Python PyWin Attach an Excel file to Outlook 1542 0
Python Guppy GSL Document and Test Example 248 2
Python Random Forest Algorithm Bagging 386 0
Python AdaBoost An Example of How AdaBoost Works 279 1
Python PyWin32 Getting Started PyWin32 603 0
Python Naive Bayes in Machine Learning 376 2
Python PyCaret How to improve results from hyperparameter tuning by increasing "n_iter" 1724 0
Python PyCaret Getting Started with PyCaret 2.0 356 1
Python PyCaret Tune Model 1325 1
Python PyCaret Create your own AutoML software 321 0
Python PyCaret Intoduction to PyCaret 297 1
Python PyCaret Compare Models 2697 1
Python PyWin Copying Data into Excel 1154 0
Python Guppy Error: expected function body after function declarator 413 2
Python Coding Random forest classifier using xgBoost 247 0
Python PyCaret How to tune "n parameter" in unsupervised experiments 659 0
Python PyCaret How to programmatically define data types in the setup function 1403 0
Python PyCaret Ensemble Model 805 1
Python Random forest algorithm Introduction 228 0
Python k-means Clustering Example 339 1
Python PyCaret Plot Model 1244 1
Python Hamming Distance 715 0
Python Understanding Random forest algorithm 311 0
Python PyCaret Sort a Dictionary by Keys 245 0
Python Coding Random forest classifier using sklearn 340 0
Python Guppy Introduction 368 2
Python How to use Guppy/Heapy for tracking down Memory Usage 1069 2
Python AdaBoost Summary and Conclusion 232 1
Python PyCaret Create Model 365 1
Python k -means Clusturing Introduction 326 2
Python k-means Clustering With Example 350 2

Comments