In this project, we aim to visualize the K-mean clustering algorithm. This algorithm is commonly used in fields of data mining and singal processing
Before going into the details of this algorithm, we’ll briefly show you what it does. When the animation is refreshed, centroids are randomly chosen, and each point connects to its closest centroid with a light blue line. After a few seconds, the centroids are recalculated, thus centroids are reassigned and each point is reconnected to its new closest centroid. This process repeats until there is no variation between the new centroids and the old centroids (or it reaches convergence).
The main point of this demonstration is that it randomizes the original centroids, and this may significantly affect the number of steps it takes to reach convergence. Note that everytime the animation is refreshed, the centroids that we start off with are different since they are chosen randomly, so the number of times that we need to reassign the centroids until convergence varies depending on the original locations of the centroids.
First of all, let us explain to you visually what the K-means clustering algorithm looks like.
Step 1: Suppose we have the following points:
Step 2: Suppose we want to find 2 clusters, so we randomly pick out (3,6), C1 and (7,15), C2 as our centroids
Step 3: Now, we will have to assign the points into the proper cluster based on our centroid
Step 4: We now need to recalculate the centroids after our reassignmnents.
Step 5: We now need to check if it converges
Goal: Subdivide a market into distinct subsets of customers, such that each subset is conceivably a submarket which can be reached with a customized marketing mix.
Approach: Collect different attributes of customers based on their geographical and lifestyle related information. Find clusters of similar customers. Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.
Goal: Find groups of documents that are similar to each other based on the important terms appearing in them. Information retrieval can utilize the clusters to relate a new document or search term to clustered documents.
Approach: Identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster.
In order to prove the convergence, we can specify a clustering quality objective function and show that the algorithm converges to a (local) minimum of that function. The objective function K-means is optimizing is the sum of squared distances from any data point to its assigned centroid.
Formally, the K-Means objective is :
where are assignments, are the centroids, and is the dataset.
There are only two points in which the K-means algorithm changes the values of or :
1. When we assign data points to centroids.
2. When we recalculate centroids.
We will show that both of these operations can never increase the value of .
After the first iteration, there are only finitely many possible assignments to and , because is discrete and because can only take on a finite number of values: means of some subset of the data. Besides, is lower-bounded by zero (square can’t be less than 0). Together, this means that cannot decrease more than a finite number of times. Thus, it must stop decreasing at some point, where the algorithm has converged.
For the first operation, when looking at a data point , suppose that the previous value of is and the new value is . It must be the case that . Thus, changing from to can only decrease .
For the second operation, consider the second form of The second operation computes as the mean of the data points for which , which is precisely the point that minimizes squared distances. Thus, this update to can only decrease .
There are several aspects of K-means that are unfortunate. One of those is that the convergence is only to a local optimum of depending on the initializations. So how should we initialize the centroids? One way of doing it is to pick some data points from the data set as centroids. This does not work well, because it’s very likely that data points that are picked as centroids will end up in the same cluster.
A better approach is called the furthest first, which initialize the first centroid randomly, but then intialize the second centroid to be the data point that is furthest away from the first centroids. With this approach, we can assure that the centroids to be well distributed.
An even better approach is called K-means++. It works like the furthest first approach, but instead of choosing the next centroid to be the point that are furthest away from all assigned centroids, it selects a point with probability proportional to the square of its distance to the nearest preceding centroid. By doing so, we add some randomness to the algorithm and make it more likely to get one neaer the center of the true cluster.
In practice, you should run it couple times with different initializations and pick the one with minimal resulting .
If someone just gives you a dataset and ask you to cluster it, how many clusters should you produce? As you might have realized that if we keep increasing K, the objective function decreases as well (until K > N), and so simply using as a notion of goodness is insufficient (like overfitting in a supervised learning). We can “regularize” K so that the model can’t grow to be too complicated (imagine you practice too much on the practice exams but fail to generalize the concepts).
The two most popular procedures are the Bayes Information Criteria (BIC) and the Akaike Information Criteria (AIC), defined below in the context of K-means:
- BIC :
- AIC :
The idea behind these criteria is that increasing K will make go down. However, if it doesn’t go down “by enough” then it’s not worth doing.
Add your own points by and run the k-means algorithm with it
This will find two clusters among the entered points
Please limit the range for both x and y between 0 and 30, inclusive
Make sure you enter at least two points
Notice: Clearing should be used prior to clicking on “Find K-Means”, in order to redo the interaction, please reload the page!
The points that you’ve added are: []