K-Means Clustering Overview
A long time ago I attended a Coursera course on Machine Learning that presented an interesting application of clustering to image compression. In this post I will present the K-means clustering algorithm and demonstrate how it can be used to color-compress an image.
K-means clustering is an algorithm for unsupervised learning that aims to identify clusters of similar objects in a data set. A clustering of objects into K clusters means simply to partition your data set into K subsets.The algorithms assumes, like may others, that the objects in the data set can be directly mapped into a Vector Space equipped with a norm. The objective of the algorithms is to find a clustering such that objects belonging to the same cluster are close together and objects belonging to different clusters are distant.
Mathematically, given a partition of the data set $S$ into K disjoint subsets $S_1, S_2, \cdots, S_K$, the total intra-cluster distance is given by
$$
\sum_{k=1}^K\sum_{i,j \in S_k} d(x_i,x_j)
$$
where $d(x_i,x_j) := ||x_i-x_j||^2$ is the Euclidean distance between the two points. The objective of K-means clustering is to find the choice of clusters that minimizes the intra-cluster distance. Finding an exact solution to the optimization problem becomes computationally infeasible for large values of samples, so K-means clustering proceeds by making an initial guess and then iteratively improving this guess. The details of the algorithms are as follows:
A long time ago I attended a Coursera course on Machine Learning that presented an interesting application of clustering to image compression. In this post I will present the K-means clustering algorithm and demonstrate how it can be used to color-compress an image.
K-means clustering is an algorithm for unsupervised learning that aims to identify clusters of similar objects in a data set. A clustering of objects into K clusters means simply to partition your data set into K subsets.The algorithms assumes, like may others, that the objects in the data set can be directly mapped into a Vector Space equipped with a norm. The objective of the algorithms is to find a clustering such that objects belonging to the same cluster are close together and objects belonging to different clusters are distant.
Mathematically, given a partition of the data set $S$ into K disjoint subsets $S_1, S_2, \cdots, S_K$, the total intra-cluster distance is given by
$$
\sum_{k=1}^K\sum_{i,j \in S_k} d(x_i,x_j)
$$
where $d(x_i,x_j) := ||x_i-x_j||^2$ is the Euclidean distance between the two points. The objective of K-means clustering is to find the choice of clusters that minimizes the intra-cluster distance. Finding an exact solution to the optimization problem becomes computationally infeasible for large values of samples, so K-means clustering proceeds by making an initial guess and then iteratively improving this guess. The details of the algorithms are as follows:
Initialization: Randomly select ${\mu_1, \cdots,\mu_K}$ points from the data set of $n$ points and make them the cluster centers.
Repeat:
1. Cluster Assignment Step: For each point $x$ in the dataset, assign it to the cluster $k \in {1,\cdots,K}$ then minimizes the distance $d(x, \mu_k)$.
2. Recompute each cluster center by setting it to the mean of all points assigned to the cluster: $\mu'_k := \sum_{i \in S_k}x_i/n$
Termination: The algorithm terminates when the centers converge. In other words when the distance between the old and new centers is smaller than some specified threshold: $d(\mu'_k, \mu_k) < \epsilon, \forall k$.
Implementing K-Means Clustering
It is possible to find several implementation of K-means clustering in Java, for example, in open source libraries such as Apache Commons Math, Weka, and Mahout. Invariably what you will find when trying to use these implementations is that you have to make your data conform to each library's API. Here, I provide a simple implementation that operates on data in the form of double arrays. The source code available on github.
Application to Image Compression
K-means clustering can be used in a few creative ways to compress image content. In this particular case compression is achieved by using only a subset of all available colors. The algorithm proceeds as follows:
Implementing K-Means Clustering
It is possible to find several implementation of K-means clustering in Java, for example, in open source libraries such as Apache Commons Math, Weka, and Mahout. Invariably what you will find when trying to use these implementations is that you have to make your data conform to each library's API. Here, I provide a simple implementation that operates on data in the form of double arrays. The source code available on github.
Application to Image Compression
K-means clustering can be used in a few creative ways to compress image content. In this particular case compression is achieved by using only a subset of all available colors. The algorithm proceeds as follows:
- Reshape the image into a list of pixels of size width*height. Each pixel is a 3D vector corresponding to the RGB value of the image.
- Run K-Means clustering with 16 clusters. The resulting centers correspond to the RGB values of the selected colors
- Create an look-up table with the resulting colors.
- Create an image where each pixel is represented by its index into the look-up table of step 3. This means that each pixel can be represented with only 4 bits instead of the 24 required for the original RGB.
![]() |
| Original Picture occupies 983 KB |
![]() |
| Compressed image with 16 colors occupies 75 KB |

