Caches is typically one of the hardest topics for students in Computer Architecture to grasp at first. This exercise will use some cool cache visualization tools in Venus to get you more familiar with cache behavior and performance terminology with the help of the file cache.s. At this point, read through cache.s to get a rough idea of what the program does.
To get started with each of the scenarios below:
Get familiar with the parameters in cache windows:
The Data Cache Simulator will show the state of your data cache. Please remember that these are running with your code, so if you reset your code, it will also reset your cache and memory status.
If you run the code all at once, you will get the final state of the cache and hit rate. You will probably benefit the most from setting breakpoints at each memory access to see exactly where the hits and misses are coming from. Themethod to set a breakpoint in Venus is just click the corresponding line in the simulator, if the line become red, that means your program will stop when the execution meets that line.
Simulate the following scenarios and record the final cache hit rates. Try to reason out what the hit rate will be BEFORE running the code. After running each simulation, make sure you understand WHY you see what you see (the TAs will be asking)!
Do not hesitate to ask questions if you feel confused! This is perfectly normal and the TA is there to help you out!
Good questions to ask yourself as you do these exercises:
Cache Parameters:
Program Parameters:
Checkoff
Cache Parameters:
Program Parameters:
Checkoff
Cache Parameters:
Program Parameters:
Checkoff
If you recall, matrices are 2-dimensional data structures wherein each data element is accessed via two indices. To multiply two matrices, we can simply use 3 nested loops, assuming that matrices A, B, and C are all n-by-n and stored in one-dimensional column-major arrays:
for (int i = 0; i < n; i++) for (int j = 0; j < n; j++) for (int k = 0; k < n; k++) C[i+j*n] += A[i+k*n] * B[k+j*n];
Matrix multiplication operations are at the heart of many linear algebra algorithms, and efficient matrix multiplication is critical for many applications within the applied sciences.
In the above code, note that the loops are ordered i, j, k. If we examine the innermost loop (k), we see that it moves through B with stride 1, A with stride n and C with stride 0. To compute the matrix multiplication correctly, the loop order doesn't matter. However, the order in which we choose to access the elements of the matrices can have a large impact on performance. Caches perform better (more cache hits, fewer cache misses) when memory accesses exhibit spatial and temporal locality. Optimizing a program's memory access patterns is essential to obtaining good performance from the memory hierarchy.
Take a glance at matrixMultiply.c. You'll notice that the file contains multiple implementations of matrix multiply with 3 nested loops.
Compile this code using the '-O3' flag. It is important here that we use the '-O3' flag to turn on compiler optimizations. Compile and run this code and then answer the questions:
Checkoff: Be prepared to explain your responses to the above questions to your TA.
Note: The term cache blocking is not referring to the same thing as the cache blocks we talked about when discussing the structure of caches.
Sometimes, we wish to swap the rows and columns of a matrix. This operation is called a "transposition" and an efficient implementation can be quite helpful while performing more complicated linear algebra operations. The transpose of matrix A is often denoted as AT.
In the above code for matrix multiplication, note that we are striding across the entire A and B matrices to compute a single value of C. As such, we are constantly accessing new values from memory and obtain very little reuse of cached data! We can improve the amount of data reuse in the caches by implementing a technique called cache blocking. More formally, cache blocking is a technique that attempts to reduce the cache miss rate by improving the temporal and/or spatial locality of memory accesses. In the case of matrix transposition we consider performing the transposition one block at a time.
In the above image, we transpose each block Aij of matrix A into its final location in the output matrix, one block at a time. With this scheme, we significantly reduce the magnitude of the working set in cache during the processing of any one block. This (if implemented correctly) will result in a substantial improvement in performance. For this lab, you will implement a cache blocking scheme for matrix transposition and analyze its performance.
Your task is to implement cache blocking in the transpose_blocking() function inside transpose.c. By default, the function does nothing, so the benchmark function will report an error. You may NOT assume that the matrix width (n) is a multiple of the blocksize. After you have implemented cache blocking, compile it with '-O3'and then execute it:
$ ./transpose <n> <blocksize>
Where n, the width of the matrix, and blocksize are parameters that you will specify. You can verify that your code is working by setting n=1000 and blocksize=33 Once your code is working, complete the following exercises and record your answers (we will ask about it during checkoff).
Fix the blocksize to be 20, and run your code with n equal to 100, 1000, 2000, 5000, and 10000. At what point does cache blocked version of transpose become faster than the non-cache blocked version? Why does cache blocking require the matrix to be a certain size before it outperforms the non-cache blocked code?
Fix n to be 10000, and run your code with blocksize equal to 50, 100, 500, 1000, 5000. How does performance change as the blocksize increases? Why is this the case?
Checkoff: Be prepared to explain your responses to the above questions to your TA.