Let us fist revisit the matrix-matrix multiply example discussed in the SPMD programming section. We will build our first version out of matrices which are defined as distributed arrays of a simple class to represent a single matrix element.

class Double{ public: double value; Double(){ value = 0; } Double(double x){ value = x;} Double &operator *(Double &x){ return Double(value*x.value); } Double &operator +=(Double &x){ value += x.value; return *this;} };

The matrix collection, shown below, builds a matrix sized according to the dimension in the alignment object. The idea behind the matrix-matrix algorithm in computing

is to have each element of the matrix P grab the right row of M and the right column of N and form the dot product to obtain its value. Consequently, the operation is performed using ``method of element'' parallelismP = M * N

The complexity of this code should be compared with the message passing matMul() in the SPMD programming (section 3.2.4). Note that these two programs implement the same basic algorithm, the version above is much simpler. In addition, the message passing version was tied to the way the data was distributed to memory. In this version the distribution and alignment of the data are completely independent of the the code. In the main program below we have used a distribution and alignment scheme that places the rows of the second matrix along the columns of the first. Consequently, no communication takes place if the matrices are square. However, we emphasize that the code will run no matter how the data is aligned.Collection Matrix: DistributedArray{ // a constructor is needed. Matrix(Distribution *T, Align *A): DistributedArray(T, A){} MethodOfElement: void matMul(Matrix<ElementType> &M, Matrix<ElementType> &N){ int myrow = this->index1; // first index from DistributedArray int mycol = this->index2; // second index from DistributedArray int maxcol = M.dim2size; // should be your "n" parameter if(maxcol != N.dim1size) err("col dim of M must = row dim of N"); value = 0; for (int j; j < maxCol; j++) *this += M(myrow, j) * N(j, mycol); // Get_Element implicit }

Processor_Main() { Processor P; Distribution T(m, p, &P, BLOCK, BLOCK); Align A(m, n, "[ALIGN( domain[i][j], T[i][j])]" ); Align B(n, p, "[ALIGN( domain[i][j], T[j][i])]" ); Matrix<Double> M(&T, &A, m, n); Matrix<Double> N(&T, &B, n, p); Matrix<Double> P(&T, &A, m, p); P.matMul( M, N); }

One problem with this program is that it would run like a dog. There are too many function calls for each multiply-add. Here is how to make it fly. Let's replace Double as the element type with

Now by replacing ``Double" by ``BlockDouble" in the Processor_Main() and you have a serious blocked matrix-matrix multiply that will do flops for data moves. This can be very fast.#define N 100 class BlockDouble{ double data[N][N]; BlockDouble & operator *(BlockDouble &x){ call a linpack BLAS-3 dense matrix-matrix multiply tuned for the architecture } BlockDouble & operator +=(BlockDouble &x){ call the BLAS-3 dense matrix accumulate. } };

Mon Nov 21 09:49:54 EST 1994