One of my Spring courses was a course taught in C++ teaching the basics of high performance computing. We learned how to build a linear algebra solver with customized vector and matrix classes that we then applied to solving partial differential equations. We used benchmarking to quantify different ways of improving performance of linear algebra operations.
First, we constructed the vector and matrix classes, including addition and multiplication operators. We tested different means of doing matrix and vector multiplication, e.g. different ordering for processing the products in order to take advantage of processor cache and SIMD vectorization. We compared the performance of different types of memory storage including row ordered and column ordered matrices.
Next, we created sparse matrix classes including coordinate ordered, compressed sparse row, and compressed sparse column matrices, and compared their relative performance.
Next, we focused on parallel programming. First, we created threads manually, then we used OpenMP and various types of pragma statements. We emphasized techniques for avoiding race conditions. We tested our operations using an Amazon Web Services account.
Next, we focused on GPU programming, more specifically, on CUDA programming. We tested our operations on a Tesla GPU server.
Finally, we focused on Message Passing Interface (MPI) programming. We tested our operations with up to 16 nodes. As our culmination of our work, we applied our matrix and vector classes to solving partial differential equations.