Next: Known Bugs in Up: The Demo Examples. Previous: BM-3: The NAS

BM-4: the NAS Sparse CG Benchmark

A far more interesting benchmark in the NAS suite is the random sparse conjugate gradient computation. This benchmark requires the repeated solution to A*X = F, where A is a random sparse matrix. There are two ways to approach the parallelization of this task. The best, and most difficult, is the matrix as the connectivity graph for the elements in the vector. By a preprocessing step one can partition the vector into segments of nearly equal size that minimizes the communication between subsets. The resulting algorithm can have the same low communication to computation ratio as our simple grid CG example above.

The second approach is not as efficient as the scheme that optimizes locality, but it a more direct implementation of the benchmark. We shall represent the matrix as a distributed random sparse matrix. More specifically, we shall represent the matrix as a Distributed Matrix of Sparse Matrices.

The DistributedMatrix collection class in the pC++ library allows any class that has the algebraic structure of a ring to be the element of the matrix. Furthermore, a p by p matrix whose elements are k by k matrices has all the same mathematical properties of a kp by kp matrix. Indeed, this is the key property used in the blocking algorithms used in most BLAS level 3 libraries. Consequently, we can select the element class to be a sparse matrix as shown below.

    class SparseMatrix 
      int rows, columns;
      data_ptr * r; // a vector of pointers to row lists
      data_ptr * c; // a vector of pointers to column lists

      void amux(Vector& Vout, Vector& Vin);
SparseMatrix is a simple double linked sparse matrix with a special function amux(Vout, Vin) that can be used to compute the sparse matrix multiply Vout = A*Vin. The collection object that is the core of the computation is then

         DistributedMatrix< SparseMatrix >  A(.... )
which is illustrated in Figure 6.4.

This distributed matrix of sparse matrices can then be used like an ordinary matrix. In particular, an efficient matrix multiply can be easily implemented that will multiply the distributed matrix by a distributed vector.

The benchmark kernel is then nearly identical to the Grid conjugate gradient algorithm from the first benchmark.

The primary communication consists of a broadcast of the Vin vector to each row of sparse blocks and then a reduction sum to compute the Vin results. The dot product reductions on the distributed vectors make up the remaining communication.

The standard NAS version of this benchmark comes in a test size of n = 1400 and the full test of n = 14000. We have run the full size problem only on the CM-5. All other statistics report the results for the small version.

Next: Known Bugs in Up: The Demo Examples. Previous: BM-3: The NAS
Mon Nov 21 09:49:54 EST 1994