A far more interesting benchmark in the NAS suite is the random
sparse conjugate gradient computation. This benchmark requires
the repeated solution to *A*X = F*, where *A* is a random sparse
matrix. There are two ways to approach the parallelization of
this task. The best, and most difficult, is the matrix as the connectivity
graph for the elements in the vector. By a preprocessing step one
can partition the vector into segments of nearly equal size that
minimizes the communication between subsets. The resulting algorithm
can have the same low communication to computation ratio as our simple
grid CG example above.

The second approach is not as efficient as the scheme that optimizes locality, but it a more direct implementation of the benchmark. We shall represent the matrix as a distributed random sparse matrix. More specifically, we shall represent the matrix as a Distributed Matrix of Sparse Matrices.

The *DistributedMatrix* collection class in the pC++ library allows
any class that has the algebraic structure of a ring to be the element
of the matrix. Furthermore, a *p* by *p* matrix whose elements
are *k* by *k* matrices has all the same mathematical properties
of a *kp* by *kp* matrix. Indeed, this is the key property used
in the blocking algorithms used in most BLAS level 3 libraries.
Consequently, we can select the element class to be a sparse matrix
as shown below.

class SparseMatrix { int rows, columns; data_ptr * r; // a vector of pointers to row lists data_ptr * c; // a vector of pointers to column lists public: ... void amux(Vector& Vout, Vector& Vin); ... };

which is illustrated in Figure 6.4.DistributedMatrix< SparseMatrix > A(.... )

This distributed matrix of sparse matrices can then be used like an ordinary matrix. In particular, an efficient matrix multiply can be easily implemented that will multiply the distributed matrix by a distributed vector.

The benchmark kernel is then nearly identical to the Grid conjugate gradient algorithm from the first benchmark.

The primary communication consists of a broadcast of the *Vin*
vector to each row of sparse blocks and then a reduction sum
to compute the *Vin* results. The dot product reductions on
the distributed vectors make up the remaining communication.

The standard NAS version of this benchmark comes in a test
size of *n = 1400* and the full test of *n = 14000*.
We have run the full size problem only on the CM-5. All
other statistics report the results for the small version.

Mon Nov 21 09:49:54 EST 1994