Next: Example: Batcher's Bitonic Up: Programming with Collections. Previous: Programming with Collections.

Example: Matrix-Matrix Multiply. Ver 1.0+

Let us fist revisit the matrix-matrix multiply example discussed in the SPMD programming section. We will build our first version out of matrices which are defined as distributed arrays of a simple class to represent a single matrix element.

class Double{
  double value;
  Double(){ value = 0; }
  Double(double x){ value = x;}
  Double &operator *(Double &x){ return Double(value*x.value); }
  Double &operator +=(Double &x){ value += x.value; return *this;}

The matrix collection, shown below, builds a matrix sized according to the dimension in the alignment object. The idea behind the matrix-matrix algorithm in computing

   P = M * N
is to have each element of the matrix P grab the right row of M and the right column of N and form the dot product to obtain its value. Consequently, the operation is performed using ``method of element'' parallelism

Collection Matrix: DistributedArray{
    // a constructor is needed.
    Matrix(Distribution *T, Align *A): DistributedArray(T, A){}

    void matMul(Matrix<ElementType> &M, Matrix<ElementType> &N){
       int myrow = this->index1;  // first index from DistributedArray
       int mycol = this->index2;  // second index from DistributedArray
       int maxcol = M.dim2size;   // should be your "n" parameter
       if(maxcol != N.dim1size) err("col dim of M must = row dim of N");
       value = 0;
       for (int j; j < maxCol; j++)
              *this += M(myrow, j) * N(j, mycol); // Get_Element implicit
The complexity of this code should be compared with the message passing matMul() in the SPMD programming (section 3.2.4). Note that these two programs implement the same basic algorithm, the version above is much simpler. In addition, the message passing version was tied to the way the data was distributed to memory. In this version the distribution and alignment of the data are completely independent of the the code. In the main program below we have used a distribution and alignment scheme that places the rows of the second matrix along the columns of the first. Consequently, no communication takes place if the matrices are square. However, we emphasize that the code will run no matter how the data is aligned.

        Processor P;
        Distribution T(m, p, &P, BLOCK, BLOCK);
        Align A(m, n, "[ALIGN( domain[i][j], T[i][j])]" );
        Align B(n, p, "[ALIGN( domain[i][j], T[j][i])]" );
        Matrix<Double> M(&T, &A, m, n);
        Matrix<Double> N(&T, &B, n, p);
        Matrix<Double> P(&T, &A, m, p);
        P.matMul( M, N);

One problem with this program is that it would run like a dog. There are too many function calls for each multiply-add. Here is how to make it fly. Let's replace Double as the element type with

#define N 100
class BlockDouble{
   double data[N][N];
   BlockDouble &  operator *(BlockDouble &x){
         call a linpack BLAS-3 dense matrix-matrix multiply
         tuned for the architecture
   BlockDouble & operator +=(BlockDouble &x){
         call the BLAS-3 dense matrix accumulate.
Now by replacing ``Double" by ``BlockDouble" in the Processor_Main() and you have a serious blocked matrix-matrix multiply that will do flops for data moves. This can be very fast.

Next: Example: Batcher's Bitonic Up: Programming with Collections. Previous: Programming with Collections.
Mon Nov 21 09:49:54 EST 1994