*K*-means is one of the most popular unsupervised algorithm for cluster analysis.

It cannot determine the number of clusters (*k*) within the dataset, therefore this has to be provided * prior* the initialisation of the algorithm.

The basic idea of the *K*-mean algorithm is that data points within a same group will gather on the plane near each other. Consequently close points are likely to belong to the same cluster.

The *K*-means algorithm performs better as compared to a hierarchical algorithm (HC) and the execution takes less time, with a time complexity of O(n^{2}), lower than other HC methods that have a time complexity between O(n^{3}) and O(n^{2}\log n ).

On the other hand, HC provides good quality results in respect to *K*– means.

In general, a *K*-means algorithm is good for a large dataset and HC is good for small datasets.

## The
algorithm

Given a set of points, where each point is an *n*-dimensional vector, the algorithm is able to separate the *n *points into *k *sets (with K \leq n, forming a number of clusters S ={S_{1},S_{2},\ …\ ,S_{k}} with centers \left( \mu_{1},\ …\ , \mu_{k} \right) , the Within-Cluster Sum of Squares formula (WCSS) is defined as:.

* *\\WCSS = \min \sum_{i=1}^{k}\sum_{x_{j}\in S_{i}} \left| x_{j}-\mu_{i}\right| ^{2}\\

The algorithm is composed of
four steps:

Continue reading “K-Means in R and Python”