Python K-Means Clustering Applications

Python K-Means Clustering Applications

K- Means Clustering Applications

k-means clustering is rather easy to apply to even large data sets, particularly when using heuristics such as Lloyd's algorithm. It has been successfully used in market segmentationcomputer vision, and astronomy among many other domains. It often is used as a preprocessing step for other algorithms, for example to find a starting configuration.

Vector quantization

k-means originates from signal processing, and still finds use in this domain. For example, in computer graphicscolor quantization is the task of reducing the color palette of an image to a fixed number of colors k. The k-means algorithm can easily be used for this task and produces competitive results. A use case for this approach is image segmentation. Other uses of vector quantization include non-random sampling, as k-means can easily be used to choose k different but prototypical objects from a large data set for further analysis.

Cluster analysis

In cluster analysis, the k-means algorithm can be used to partition the input data set into k partitions

However, the pure k-means algorithm is not very flexible, and as such is of limited use (except for when vector quantization as above is actually the desired use case). In particular, the parameter k is known to be hard to choose (as discussed above) when not given by external constraints. Another limitation is that it cannot be used with arbitrary distance functions or on non-numerical data. For these use cases, many other algorithms are superior.

Feature learning

k-means clustering has been used as a feature learning (or dictionary learning) step, in either (semi-)supervised learning or unsupervised learning. The basic approach is first to train a k-means clustering representation, using the input training data (which need not be labelled). Then, to project any input datum into the new feature space, an "encoding" function, such as the thresholded matrix-product of the datum with the centroid locations, computes the distance from the datum to each centroid, or simply an indicator function for the nearest centroid, or some smooth transformation of the distance. Alternatively, transforming the sample-cluster distance through a Gaussian RBF, obtains the hidden layer of a radial basis function network.

This use of k-means has been successfully combined with simple, linear classifiers for
semi-supervised learning in 
NLP (specifically for named entity recognition)and in computer vision. On an object recognition task, it was found to exhibit comparable performance with more sophisticated feature learning approaches such as autoencoders and restricted Boltzmann machines. However, it generally requires more data, for equivalent performance, because each data point only contributes to one "feature".




Software implementations

The following implementations are available under Free/Open Source Software licenses,with publicly available source code.

-     Accord.NET contains C# implementations for k-means, k-means++ and k-modes.

-     ALGLIB contains parallelized C++ and C# implementations for k-means and k-means++.

-     AOSP contains a Java implementation for k-means.

-     CrimeStat implements two spatial k-means algorithms, one of which allows the user to define the starting locations.

-     ELKI contains k-means (with Lloyd and MacQueen iteration, along with different initializations such as k-means++ initialization) and various more advanced clustering algorithms.

-     Julia contains a k-means implementation in the JuliaStats Clustering package.

-     KNIME contains nodes for k-means and k-medoids.

-    Mahout contains a MapReduce based k-means.

-    mlpack contains a C++ implementation of k-means.

-    Octave contains k-means.

-     OpenCV contains a k-means implementation.

-     Orange includes a component for k-means clustering with automatic selection of k and cluster silhouette scoring.

-     PSPP contains k-means, The QUICK CLUSTER command performs k-means clustering on the dataset.

-     R contains three k-means variations.

-     SciPy and scikit-learn contain multiple k-means implementations.

-     Spark MLlib implements a distributed k-means algorithm.

-     Torch contains an unsup package that provides k-means clustering.

-     Weka contains k-means and x-means.


Happy Pythoning...!!!


More Articles of Aditi Kothiyal:

Name Views Likes
Python AdaBoost Mathematics Behind AdaBoost 421 1
Python PyCaret How to optimize the probability threshold % in binary classification 2071 0
Python K-means Predicting Iris Flower Species 1323 2
Python PyCaret How to ignore certain columns for model building 2636 0
Python PyCaret Experiment Logging 680 0
Python PyWin32 Open a File in Excel 941 0
Python Guppy GSL Introduction 220 2
Python Usage of Guppy With Example 1102 2
Python Naive Bayes Tutorial 553 2
Python Guppy Recent Memory Usage of a Program 893 2
Introduction to AdaBoost 290 1
Python AdaBoost Implementation of AdaBoost 513 1
Python AdaBoost Advantages and Disadvantages of AdaBoost 3715 1
Python K-Means Clustering Applications 334 2
Python Random Forest Algorithm Decision Trees 440 0
Python K-means Clustering PREDICTING IRIS FLOWER SPECIES 457 1
Python Random Forest Algorithm Bootstrap 476 0
Python PyCaret Util Functions 441 0
Python K-means Music Genre Classification 1764 1
Python PyWin Attach an Excel file to Outlook 1542 0
Python Guppy GSL Document and Test Example 248 2
Python Random Forest Algorithm Bagging 387 0
Python AdaBoost An Example of How AdaBoost Works 280 1
Python PyWin32 Getting Started PyWin32 603 0
Python Naive Bayes in Machine Learning 376 2
Python PyCaret How to improve results from hyperparameter tuning by increasing "n_iter" 1724 0
Python PyCaret Getting Started with PyCaret 2.0 357 1
Python PyCaret Tune Model 1326 1
Python PyCaret Create your own AutoML software 321 0
Python PyCaret Intoduction to PyCaret 297 1
Python PyCaret Compare Models 2697 1
Python PyWin Copying Data into Excel 1154 0
Python Guppy Error: expected function body after function declarator 414 2
Python Coding Random forest classifier using xgBoost 247 0
Python PyCaret How to tune "n parameter" in unsupervised experiments 659 0
Python PyCaret How to programmatically define data types in the setup function 1403 0
Python PyCaret Ensemble Model 806 1
Python Random forest algorithm Introduction 229 0
Python k-means Clustering Example 340 1
Python PyCaret Plot Model 1245 1
Python Hamming Distance 715 0
Python Understanding Random forest algorithm 311 0
Python PyCaret Sort a Dictionary by Keys 245 0
Python Coding Random forest classifier using sklearn 341 0
Python Guppy Introduction 368 2
Python How to use Guppy/Heapy for tracking down Memory Usage 1069 2
Python AdaBoost Summary and Conclusion 232 1
Python PyCaret Create Model 366 1
Python k -means Clusturing Introduction 326 2
Python k-means Clustering With Example 351 2