Next: Collections. Ver 1.0+ Up: Thread Environment Classes. Previous: Constructors for TEClass

Encapsulating SPMD Libraries. Ver. 2.0

One of the reasons for including Thread Environment Classes in pC++ is to provide a mechanism to encapsulate code that is designed for execution in a message-passing SPMD environment. This includes many of the libraries that have been designed at the national laboratories such as Lapack++, AMR++, and many more.

To understand how this works, consider an example of a matrix class Matrix defined as follows


TEClass Matrix{
  double **data;
 public:
  int rows, cols;
  Matrix(n,m,p);
  void matMul(Matrix &A, Matrix &B);
  double &operator ()(int,int);
 };

In an SPMD style execution, the matrix object would be created on each processor participating in the computation and the constructor, given global dimensions m by n would automatically partition the data over the p processors. The interesting part of these libraries is the way processor communication is managed. In a typical application every processor must participate in each matrix operation done in parallel. All communication is hidden within the class operators and the resulting ``user code'' looks exactly like sequential code. (The version 1.0 pC++ compiler for distributed memory machines works exactly in this manner.) Take for example the way the library designer would implement the operator matMul(). Let us assume that the library is designed so that rows are partitioned over the processors. That is, rows (0, n/p -1) are on processor 0, rows (n/p, 2n/p-1) on processors 1, etc. The SPMD code for matMul() would look something like the following. Each processor has part of three matrices, A, B and *this. The code below first broadcasts a column of B to each thread which then assembles the pieces of the column and computes the appropriate dot product of that column with its share of the rows of A.


void Matrix::matMul(Matrix &A, Matrix &B){
    int i,j,k,s, n, m, r, p, from;
    p = NumProc(); //NumProc() gives the number of processor threads
    k = A.rows;  m = cols, n = rows/p; r = k/p;
    double *rowbuf = new double[k];
    double *buffer = new double[r];
    for(i = 0; i < m; i++){
       // boadcast a column of B to each processor
       for(j = 0; j < p; j++){
         for(s = 0; s < r; s++) buffer[s] = B.data[s][i]; 
         pCxx_send(j, r, buffer);
         }
       // assemble column blocks into a B row
       for(j = 0; j < p; j++){
         pCxx_receive(&from, buffer);
         for(s = 0; s < r; s++) rowbuf[from*r+s] = buffer[s];
         }
       for(j = 0; j < n; j++)
           for(s = 0; s < k; s++)
                data[i][j] += A.data[i][s]*rowbuf[s];
     }
}
This version of the program is not optimal (a blocked version should be used) but it is easy to understand and it is typical of the style of SPMD libraries.

This function can now be called with a pC++ main program as follows


Processor_Main(){
   Processors P;
   Matrix C(n,m), A(n,k), B(k, m);
   ....
   C.matMul(A, B);
};

A more interesting problem is that of the element reference operator. The job of the ( int, int ) operator is to make sure that any read or update from the main thread is propagated to the correct position in this distributed array. For example, if the main thread invokes


     x  = M(i,j);
then the thread that contains the element must return the correct value to the main thread. On the other hand, if we call

    M(i,j) = x;
then it is the job of the (...) operator to make sure the on the correct processor is updated. This problem is complicated because we are not sure which thread may be invoking this operator. If it is a thread for which the requested data reference lies in the same address space, there is no problem. However, if they are different, such as when the main thread invokes this operation on each worker, we have a problem. To see this difficulty consider the following possible implementation.


double dummy_buffer;
double &Matrix:: operator( int i, int j){

   double *z;
   int not_local = 1;
   int p = NumProc();
   if (MyProc()*(rows/p) <= i) && (i < MyProc()*(rows/p)){
        // I have the desired row!
        z = &(data[ i % p][j]);
        not_local = 0;
	}
   else z = &dummy_buffer;
   pCxx_BroadcastBytes(non_local, sizeof(double), z);
   return *z;
}
If each of the worker threads associated with the TEClass executes this operation, then the reference evaluation will be correct when called by the main thread only if the main thread shares its address space with one of the worker threads. (This is the case in the current ver. 1.0 pC++), However, in the future versions, this may not hold. There are two solutions to this problem. One solution is to introduce the CC++ global data type qualifier, so that special pointers and references can be created that can be passed between address spaces. We are strongly considering this for ver. 2.0. The other solution is to introduce more explicit member functions for read and write operations.


TEClass Matrix{
  double **data;
  double &operator ()(int,int);
 public:
  int rows, cols;
  Matrix(n,m,p);
  void matMul(Matrix &A, Matrix &B);
  double read(int i, int j){ return (*this)(i,j); }
  void write(int i, int j, double value){ (*this)(i,j) = value; }
 };
This solution protects the (...) operator to be only used in the TEClass thread environments and it allows only data values (instead of references or pointers) to be passed between address spaces.



Next: Collections. Ver 1.0+ Up: Thread Environment Classes. Previous: Constructors for TEClass


beckman@cica.indiana.edu
Mon Nov 21 09:49:54 EST 1994