Before you start working, please make sure you have well installed the python3.6, pySpark. It is highly recommended to demonstrate your result on your notebook.
Exercise 1: Warm up with Word Count
In this exercise, we provide the sample for you to rouse the memory of last discussion.
If you feel not adequate to continue this lab11, please find the ppt in Discussion10 and finish the pySpark tutorial from the discussion.
from pyspark import SparkContext sc = SparkContext("local","wordCount")text_RDD = sc.textFile("http://shtech.org/course/ca/18s/")text_RDD = sc.textFile("filename")#textFile does not support url, please download http://shtech.org/course/ca/18s/ locally counts_RDD = text_RDD.flatMap(lambda x: x.split(" ")).map(lambda x: (x, 1)).reduceByKey(lambda x,y: x+y) print(counts_RDD.take(3))
Checkoff
- What is RDD from your understanding, what variables used here are RDD?
- Explain how we compute counts_RDD from text_RDD in the above script?
Exercise 2: Radix Sort
Radix Sort is a non-comparative sorting algorithm that can sort an integer array by groupping and rearranging on each significant bits sequentially.
One possible implementation with python can be found in wikipedia:Radix_sort, it simply consist of two operations on each position: put integers into buckets and extract integers from buckets.
In this exercise, students are required to reimplement the Radix sort algorithm with mapReduce at the TODO section of the file radix.py
You can run the code but the comparison would be false -.-
Checkoff
- Introduce your TODO code. How would it work?
- Achieve correct result while be much faster than wiki-code
Exercise 3: Gradient Descent
In this exercise, given sample pairs , we will implement a Gradient Descent optimizer on
with
In above optimization problem, the loss function is a summation of squared error. Which make it possible to train parallelly.
You can find the gradient of above loss function,
is a summation of independent items.
So by mapReduce on the gradient in each iteration, time cost tend to be much cheaper
Code and data are available in gd.py and GPA.txt
You can run the code but the current mapreduce version won't give you a good match :)
For specific introduction for GD, please turn to Wikipedia
Checkoff
- Introduce your TODO code. Why does it make sense?
- Achieve correct result on the final plot while be much faster than my code (actually you may not be faster than my implementation:) )