Python Scipy Sub-Packages

Python Scipy Sub-Packages

                                                          SCIPY CLUSTER

Clustering is the procedure of dividing the datasets into groups consisting of similar data-points. For example, the Items are arranged in the shopping mall. Data Points are in the same group must be identical as possible and should be different from the other groups. There are two types of cluster, which are:-

1. Central

2. Hierarchy

K-means clustering is a method for finding clusters and cluster centers in a set of unlabelled data. Intuitively, we might think of a cluster as %u2013 comprising of a group of data points, whose inter-point distances are small compared with the distances to points outside of the cluster. Given an initial set of K centers, the K-means algorithm iterates the following two steps:-
1. For each center, the subset of training points (its cluster) that is closer to it is identified as any other center.
2. The mean of each feature for the data points in each cluster is computed, and this mean vector becomes the new center for that cluster.

                                              K-MEANS ALGORITHM

The steps are as follows, suppose we have an input x1,x2, x3,....xn, data, and value K.

Step - 1:
Select K random points as a cluster center called the centroid. Suppose these are c1,c2,, and it can be written as follows:
                                                 " c1,c2,...c"
C is the set of all centroid.

Assign each input value xi to the nearest center by calculating its Euclidean (L2) distance between the point and each centroid.

In this step, we get the new centroid by calculating the average of all the points assigned to the cluster.

We repeat steps 2 and 3 until none of the clusters remains unstable.

These two steps are iterated until the centers no longer move or the assignments no longer change. Then, a new point x can be assigned to the cluster of the closest prototype. The SciPy library provides a good implementation of the K-Means algorithm through the cluster package. 

                       K-MEANS IMPLEMENTATION IN SCIPY

We will understand how to implement K-Means in SciPy.

 Import K-Means
We will see the implementation and usage of each imported function.

from SciPy.cluster.vq import kmeans,vq,whiten

Normalize a group of observations on a per feature basis. Before running K-Means, it is beneficial to rescale each feature dimension of the observation set with whitening. Each feature is divided by its standard deviation across all observations to give it unit variance.

Whiten the data:-

We have to use the following code to whiten the data.

# whitening of data
data = whiten(data)


The K-means algorithm iterates again and again and adjusts the centroid until necessary progress cannot be made the change in distortion, since the last iteration is less than some threshold.

Consider the following example:-


The above code performs K-Means on a set of observation vectors forming K clusters. The K-Means algorithm adjusts the centroids until sufficient progress cannot be made, i.e. the change in distortion, since the last iteration is less than some threshold. Here, we can observe the centroid of the cluster by printing the centroids variable using the code given below.


The vq function compares each observation vector in the %u2018M%u2019 by %u2018N%u2019 obs array with the centroids and assigns the observation to the closest cluster. It returns the cluster of each observation and the distortion. We can check the distortion as well.

This is all about Scipy Cluster Subpackage.