Clustering Algorithm Demonstration

We aim to use D3 visualization library to help visualize the K-means clustering algorithm

In this project, we aim to visualize the K-mean clustering algorithm. This algorithm is commonly used in fields of data mining and singal processing

Before going into the details of this algorithm, we’ll briefly show you what it does. When the animation is refreshed, centroids are randomly chosen, and each point connects to its closest centroid with a light blue line. After a few seconds, the centroids are recalculated, thus centroids are reassigned and each point is reconnected to its new closest centroid. This process repeats until there is no variation between the new centroids and the old centroids (or it reaches convergence).

The main point of this demonstration is that it randomizes the original centroids, and this may significantly affect the number of steps it takes to reach convergence. Note that everytime the animation is refreshed, the centroids that we start off with are different since they are chosen randomly, so the number of times that we need to reassign the centroids until convergence varies depending on the original locations of the centroids.

Algorithm:

First of all, let us explain to you visually what the K-means clustering algorithm looks like.

  1. Given a set of points, x1,........,xn and an integer value, k, representing the number of cluster we are trying to make
  2. We first pick k points out of x1,........,xn to use as our first centroids
  3. Then we will clusterize all the points into the k sets by finding the nearest centeroid for each point.
  4. We find the new centroids using the cluster we now have (Centroid is just the mean of the cluster).
  5. We check if any of the point needs to be reassigned to a different cluster based on the new centroid
  6. If no reassignment happens, we are now done. Otherwise, we go through steps 4 and 5 until no reassignment happens, aka convergence.

Example:

Place mouse over a particular step to see the corresponding visualization on the right

Step 1: Suppose we have the following points:

  • {(1,2) (3,6) (4,2) (11,5) (9,9) (2,10) (12,1) (7,15) (20,20) (18,5)} and k value of 2. This will give us this plot.

Step 2: Suppose we want to find 2 clusters, so we randomly pick out (3,6), C1 and (7,15), C2 as our centroids

  • the centroids are the red squares

Step 3: Now, we will have to assign the points into the proper cluster based on our centroid

  • Distance between (1,2) and C1 = (62)2+(31)2=20\sqrt{(6-2)^2+(3-1)^2} = \sqrt{20} , while distance between (1,2) and C2 = (152)2+(71)2=205\sqrt{(15-2)^2+(7-1)^2} = \sqrt{205} . Since (1,2) is closer to C1, we will assign it to C1’s cluster.
  • After doing this to all the points, we will end up getting these two clusters (displayed by the background shading)

Step 4: We now need to recalculate the centroids after our reassignmnents.

  • This is done through the formula centroidcluster=(xn,yn)centroid_{cluster} = (\dfrac{\sum{x}}{n},\dfrac{\sum{y}}{n})
  • We will get these two new centroids (5,4) and (13,12)

Step 5: We now need to check if it converges

  • This is done by reassigning all the points to our two new clusters and see if any of the point has to be moved from one cluster to the other
  • If this case, none of the point require this. so CONVERGE.

Real Life Application:

This can be useful in the following fields:

  1. Marketing
    • Goal: Subdivide a market into distinct subsets of customers, such that each subset is conceivably a submarket which can be reached with a customized marketing mix.

    • Approach: Collect different attributes of customers based on their geographical and lifestyle related information. Find clusters of similar customers. Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.

  2. Document clustering
    • Goal: Find groups of documents that are similar to each other based on the important terms appearing in them. Information retrieval can utilize the clusters to relate a new document or search term to clustered documents.

    • Approach: Identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster.

  3. Biology: Taxonomy of living things: kingdom, phylum, class, order, family, genus and species
  4. Social networks: Recognize communities within people

Why does it converge?

In order to prove the convergence, we can specify a clustering quality objective function and show that the algorithm converges to a (local) minimum of that function. The objective function K-means is optimizing is the sum of squared distances from any data point to its assigned centroid.

Formally, the K-Means objective is :
L(z,μ,;D)=nxnμzn2=kn:zn=kxnμk2L(z,\mu,;D) = \sum_n \parallel x_n - \mu_{z_n} \parallel^2 = \sum_k \sum_{n:z_n=k} \parallel x_n -\mu_k \parallel^2 where zz are assignments, μ\mu are the centroids, and DD is the dataset.

There are only two points in which the K-means algorithm changes the values of μ\mu or zz :
1. When we assign data points to centroids.
2. When we recalculate centroids.

We will show that both of these operations can never increase the value of LL .
After the first iteration, there are only finitely many possible assignments to zz and μ\mu , because zz is discrete and because μ\mu can only take on a finite number of values: means of some subset of the data. Besides, LL is lower-bounded by zero (square can’t be less than 0). Together, this means that LL cannot decrease more than a finite number of times. Thus, it must stop decreasing at some point, where the algorithm has converged.

For the first operation, when looking at a data point nn , suppose that the previous value of znz_n is aa and the new value is bb . It must be the case that xnμbxnμb\parallel x_n - \mu_b \parallel \leq \parallel x_n - \mu_b \parallel . Thus, changing from aa to bb can only decrease LL .
For the second operation, consider the second form of LL The second operation computes μk\mu_k as the mean of the data points for which zn=kz_n = k , which is precisely the point that minimizes squared distances. Thus, this update to μk\mu_k can only decrease LL .

How sensitive it is to initialization?

There are several aspects of K-means that are unfortunate. One of those is that the convergence is only to a local optimum of LL depending on the initializations. So how should we initialize the centroids? One way of doing it is to pick some data points from the data set as centroids. This does not work well, because it’s very likely that data points that are picked as centroids will end up in the same cluster.

A better approach is called the furthest first, which initialize the first centroid randomly, but then intialize the second centroid to be the data point that is furthest away from the first centroids. With this approach, we can assure that the centroids to be well distributed.

An even better approach is called K-means++. It works like the furthest first approach, but instead of choosing the next centroid to be the point that are furthest away from all assigned centroids, it selects a point with probability proportional to the square of its distance to the nearest preceding centroid. By doing so, we add some randomness to the algorithm and make it more likely to get one neaer the center of the true cluster.

In practice, you should run it couple times with different initializations and pick the one with minimal resulting LL .

What should K be (Hyperparameter tuning)?

If someone just gives you a dataset and ask you to cluster it, how many clusters should you produce? As you might have realized that if we keep increasing K, the objective function LL decreases as well (until K > N), and so simply using LL as a notion of goodness is insufficient (like overfitting in a supervised learning). We can “regularize” K so that the model can’t grow to be too complicated (imagine you practice too much on the practice exams but fail to generalize the concepts).

The two most popular procedures are the Bayes Information Criteria (BIC) and the Akaike Information Criteria (AIC), defined below in the context of K-means:
- BIC : argminKLK+KlogDargmin_K L_K + K logD
- AIC : argminKLK+2KDargmin_K L_K + 2KD

The idea behind these criteria is that increasing K will make LL go down. However, if it doesn’t go down “by enough” then it’s not worth doing.

Interactive Panel

Add your own points by and run the k-means algorithm with it
This will find two clusters among the entered points
Please limit the range for both x and y between 0 and 30, inclusive
Make sure you enter at least two points
Notice: Clearing should be used prior to clicking on “Find K-Means”, in order to redo the interaction, please reload the page!

The points that you’ve added are: []

Algorithm Analysis

  • Time Complexity:
    • O(tkn), where n is number of objects, k is number of clusters, and t is number of iterations. Normally, k, t << n.
  • Limitaions:
    • Need to specify k, the number of clusters
    • Sensitive to initial centroids, noisy data and outliers