Project 3 - Computer Architecture I - ShanghaiTech University

Project 3 - Parallelizing K-Means Clustering

Computer Architecture I ShanghaiTech University
Project 2.2 Project 3 Project 4

K-Means Clustering Algorithm

The K-Means Clustering Algorithm is one of the most straight-forward clustering algorithm in data science. It aims at solving Vector Quantization problems. Given a set of sample points and the number of clusters k, the algorithm starts from k initial centers points and goes on the following procedure:

Each center point c represents the center (i.e. mean) of a cluster.
For every sample point p and every center point ci, calculate the distance between p and ci: d(p, ci).
Assign every sample point p to the cluster x, where d(p, cx) is the smallest among all d(p, ci)s.
Update the center point ci of each cluster i to be the current mean of all points pi within the cluster.
Goto 2. and repeat, unless no updates happened in this iteration (i.e. converges).

How we choose the initial center points will affect the performance (both accuracy and speed) of the algorithm, but it will be an advanced topic. Therefore, we will be using a set of random points generated by a fixed random seed as the initial points.

Our Problem Setup

We will be dealing with K-Means on a 2-Dimensional data set. For example, the following is an example data set, with number of data sample points p = 450,000 and desired number of clusters (colors) c = 15:

You can imagine why clustering algorithm is useful: Suppose each data point represents a ShanghaiTech student or a faculty, the x-axis ranks his/her interest in Galgames, and the y-axis ranks his/her interest in Horror games. Clustering helps to gather people of similar hobbies together.

K-means cannot always give an elegant solution, but it is very straightforward and easy to understand.

Input

The input of your algorithm will be a txt file of the following format:


            450000 / 15
      
181.32907969, 251.07429196, 0
      
-189.28197631, -360.18581742, 0
      
333.06486750, 194.89272514, 0
      
175.15397040, 257.92706689, 0
      
-184.17403621, -420.23277421, 0
      
...
      
...
      
-139.93631124, -308.79563496, 0
      
173.08555561, -154.49741085, 0
      
188.38953238, 312.91741170, 0
      
247.30736986, 190.62523729, 0
      
46.01590445, 225.99338366, 0

Explanation of the input file:

The first line contains two integers p / c separated by a slash, where p is the total number of sample points, and c is the desired number of clusters.
Then there follows p lines in the form x, y, 0, each describing a data point's coordinate (x, y), and the original color id 0 (because they have not been clustered, so we consider they are all initially "black"). Both x and y are represented in a signed floating point number, each have 8 digits after the decimal point.

Your Mission

We have implemented a simple sequential solution in C++. You can download the framework from here.

Build the target executable with make
Run the K-Means algorithm by ./kmeans <input.txt> <result.txt>
Clean the build with make clean

You should try your best to parallelize and optimize the given K-Means with any techniques you can come up with. Unlike Homework 5, in Project 3 here you can perform any shared-memory optimization techniques you wish, including but not limited to:

Manually using pthread
OpenMP
Intel SSE (SIMD)
Cache optimizations
...

In kmeans.cpp and kmeans.h, you are only allowed to add / modify things in the indicated area. (They have been clearly narrated in the comments. Read them carefully please.) Feel free to modify the Makefile and add flags that you need.

The initial center points will be picked randomly among the data points, but the random seed is fixed to 5201314 (as you can notice the line of code srand(5201314); in kmeans.cpp). This ensures that, on the same execution environment, with the same input dataset, we always pick the same set of initial center points. This property is very important for our testing. We will compare your output with the reference output, and see if the clusters match each other. The same cluster may get different color ids in reference solution and your output, but as long as it is a one-to-one mapping, it is accepted.

Like these two clustering results, they get different colors, but every point actually goes to the correct cluster: Accepted √

Output

Your program will write the calculated clustering information into the specified output txt file. The content of the output file is similar to the input, except that the color information has been filled. For example:


            450000 / 15
      
181.32907969, 251.07429196, 11
      
-189.28197631, -360.18581742, 6
      
333.06486750, 194.89272514, 7
      
175.15397040, 257.92706689, 11
      
-184.17403621, -420.23277421, 6
      
...
      
...
      
-139.93631124, -308.79563496, 6
      
173.08555561, -154.49741085, 9
      
188.38953238, 312.91741170, 13
      
247.30736986, 190.62523729, 4
      
46.01590445, 225.99338366, 8

Each color id represents a unique color (i.e. the cluster that the data point finally belongs to).

Meanwhile, your console will reveal the total time elapsed for the algorithm execution. Time is only ticked around the k-Means algorithm. I/O stuff is not included. So your optimization should exactly focus on the main body of the algorithm.

Utilities

We provide two Python tools for you, generate.py and plot.py in the framework. Make sure you have a workable Python ≥ 3.5. Also, install these packages with pip3 if you do not have them yet:

scikit-learn
numpy
scipy
matplotlib

Data Generation

Use make gen FILE=<data-file.txt> to generate a data set. It will prompt the following things in the console, and wait for you to give the parameters. Simply hit Enter if you want the default option.


            Enter dataset parameters:
      
 * # of samples (100,000 ~ 800,000) [default = 450,000]: 
      
 * # of theoretical clusters (1 ~ 20) [default = 15]: 
      
 * |Range| where centers can be (100. ~ 800.) [default = 450.]: 
      
 * Standard deviation of a cluster (10. ~ 100.) [default = 50.]:

It will generate a bunch of clusters with randomly picked centers, and distribute the data points around them using Gaussian distribution. |Range| means the limit of absolute value of coordinate x and y for every sample, and Standard deviation determines how disperse the points will be.

Result Plotting

Use make plot FILE=<result.txt> to plot the clustering result. It uses PyPlot to generate a plot in a pop-up window. With large number of data points, the plotting may be VERY slow. Please be patient.

Submission and Grading

We will be using git for this project as well. Refer to previous Git Usage Guide for basic git usage. Create your team's git repo for Project 3 on GitHub, initialize your framework directory as a git repository (which means it directly contains kmeans.cpp ...), and push your work onto your remote repository. Submit your work to Autolab with an archive project3.tar, which is created through:


        git clone --bare YOUR-TEAM-REPO-URL.git .git 

        tar czvf project3.tar .git

You can add your own code files, as long as they fit in the project, and you have wrote the correct Makefile for integrating them.

Due to the poor performance of Autolab server (服 xiǎo 务 shuǐ 器 guǎn), we will only autograde your submission on 100,000 ~ 200,000 sized data sets and give you a score according to correctness. The execution time on Autolab will also be given, but that is only a reference.

Step 1: If you submitted the correct algorithm onto Autolab, meanwhile it makes meaningful uses of multithreading, you get 60% of the total score. (Memory leak check with valgrind is also included)

Step 2: After the autograding deadline, we will run your algorithms on a much stronger and stabler server. We will run it on a huuuuuge data set (450,000 ~ 800,000 points) for several times, and take the average execution time tavg. If you have earned the 60%, the rest 40% will be given based on the performance of your algorithm on the strong server. Fastest 15 groups get full 40%, top 16~35 get 30%, top 36~55 get 20%, and the rest get 10%. NOTE: If your program crashes on this huge data set, you will get zero for this phase.

Hints

No hints for this project! @v@; I believe you can think of really great ideas for optimizing this classic algorithm. Just a reference: If you really did some work properly, you should at least achieve 3x speedup with 4 threads.

If you came up with a really incredible optimization approach which is accurate and highly parallelizable, please come to TAs. We warmly welcome your sparkling ideas and may consider giving you a bonus.

Server Configurations

Autolab Server (Just for your reference. Tune your performance to the Star Lab Server):

CPU: Intel Core i5-4590 3.3 GHz, 4 cores (4 threads) Details here
Memory: 4GB

Star Lab Server (Used for final speed test):

CPU: Intel Xeon E5-2650 v3 2.3 GHz, 10 cores (20 threads) Details here

L1 cache 640KB per core
L2 cache 2560KB per core
L3 cache 25600KB

Memory: 32GB