Lab 11: Spark Exercise

Computer Architecture I ShanghaiTech University
Lab 10 Lab 11 Lab 12

Before you start working, please make sure you have well installed the python3.6, pySpark. It is highly recommended to demonstrate your result on your notebook.

Exercise 1: Warm up with Word Count

In this exercise, we provide the sample for you to rouse the memory of last discussion.

If you feel not adequate to continue this lab11, please find the ppt in Discussion10 and finish the pySpark tutorial from the discussion.

from pyspark import SparkContext
sc = SparkContext("local","wordCount")
text_RDD = sc.textFile("http://shtech.org/course/ca/18s/")
text_RDD = sc.textFile("filename")#textFile does not support url, please download http://shtech.org/course/ca/18s/ locally
counts_RDD = text_RDD.flatMap(lambda x: x.split(" ")).map(lambda x: (x, 1)).reduceByKey(lambda x,y: x+y)
print(counts_RDD.take(3))

Checkoff

  1. What is RDD from your understanding, what variables used here are RDD?
  2. Explain how we compute counts_RDD from text_RDD in the above script?

Exercise 2: Radix Sort

Radix Sort is a non-comparative sorting algorithm that can sort an integer array by groupping and rearranging on each significant bits sequentially.

One possible implementation with python can be found in wikipedia:Radix_sort, it simply consist of two operations on each position: put integers into buckets and extract integers from buckets.

In this exercise, students are required to reimplement the Radix sort algorithm with mapReduce at the TODO section of the file radix.py

You can run the code but the comparison would be false -.-

Checkoff

  1. Introduce your TODO code. How would it work?
  2. Achieve correct result while be much faster than wiki-code

Exercise 3: Gradient Descent

In this exercise, given sample pairs , we will implement a Gradient Descent optimizer on

with

In above optimization problem, the loss function is a summation of squared error. Which make it possible to train parallelly.

You can find the gradient of above loss function,

is a summation of independent items.

So by mapReduce on the gradient in each iteration, time cost tend to be much cheaper

Code and data are available in gd.py and GPA.txt

You can run the code but the current mapreduce version won't give you a good match :)

For specific introduction for GD, please turn to Wikipedia

Checkoff

  1. Introduce your TODO code. Why does it make sense?
  2. Achieve correct result on the final plot while be much faster than my code (actually you may not be faster than my implementation:) )