Project 4: Map Reduce
Project 3 Project 4
Introduction
This project 4 is to some extent a expansion of project 3. Students are required to implement a KNN removal with map reduce in python.
Here we assume students have already been very familiar to KNN removal (Project3), Python (SI100) and spark (Discussion and lab), no extra information will be provided on the web.
Tasks:
- Implement KNN removal with map reduce in pyspark
- run your code on computer cluster (Not provided yet)
Dataset:
- Download from Project 3
Scripts
- utils/io.py, for bin2nparray and nparray2bin
- utils/knnRemoval.py, reimplementation of reference in Project 3 (Updated June 7)
- utils/mapreduceKnnRemoval.py, your map reduce implementation
- demo.py, you can follow result from demo.py as ground truth
- Image comparison and plot scripts will not be provided
There's one minor difference between the cpp and python version on mean computation. Cpp version will divide by k anyway while python version takes the true mean. With the same setting, 30_1.5_15, on cropped.bin, cpp version holds 73908 zeros while python version result holds 73912 zeros.
The whole project can be found in proj4.zip and please put your .bin in ./data . (Updated June 7)
Submission:
Check into autolab:- mapreduceKnnRemoval.py: Your fast implementation.
Grading
We will use a small map to test your program. If your output is incompatible you will receive 0 pts! We will show you the output of your program - keep it short! This way you will have a rough estimate how fast you are compared to the other students. But keep in mind that the autolab is a shared resource, so those values might differ a lot.
Your program will (hopefully) be run on a big cluster with many nodes. The speed of each program will be noted.
The slowest 33 percentile and below will receive a score of 80%. The fastest 15 percentile will receive a score of 100%. All other programs will get a score that is linearly scaled between those values.