One of the reasons for including Thread Environment Classes in pC++ is to provide a mechanism to encapsulate code that is designed for execution in a message-passing SPMD environment. This includes many of the libraries that have been designed at the national laboratories such as Lapack++, AMR++, and many more.

To understand how this works, consider an example of a matrix class Matrix defined as follows

TEClass Matrix{ double **data; public: int rows, cols; Matrix(n,m,p); void matMul(Matrix &A, Matrix &B); double &operator ()(int,int); };

In an SPMD style execution, the matrix object would be created
on each processor participating in the computation and the
constructor, given global dimensions *m* by *n* would automatically
partition the data over the *p* processors.
The interesting part of these libraries is the way processor
communication is managed. In a typical application every processor
must participate in each matrix operation done in parallel. All
communication is hidden within the class operators and the resulting
``user code'' looks exactly like sequential code.
(The version 1.0 pC++ compiler for distributed memory machines works
exactly in this manner.) Take for example the way the library designer
would implement the operator matMul(). Let us assume
that the library is designed so that rows are partitioned over the processors.
That is, rows *(0, n/p -1)* are on processor *0*, rows *(n/p, 2n/p-1)* on
processors *1*, etc. The SPMD code for matMul() would look
something like the following. Each processor has part of three matrices,
*A*, *B* and *this. The code below first broadcasts a column
of *B* to each thread which then assembles the pieces of the column
and computes the appropriate dot product of that column with its share
of the rows of *A*.

This version of the program is not optimal (a blocked version should be used) but it is easy to understand and it is typical of the style of SPMD libraries.void Matrix::matMul(Matrix &A, Matrix &B){ int i,j,k,s, n, m, r, p, from; p = NumProc(); //NumProc() gives the number of processor threads k = A.rows; m = cols, n = rows/p; r = k/p; double *rowbuf = new double[k]; double *buffer = new double[r]; for(i = 0; i < m; i++){ // boadcast a column of B to each processor for(j = 0; j < p; j++){ for(s = 0; s < r; s++) buffer[s] = B.data[s][i]; pCxx_send(j, r, buffer); } // assemble column blocks into a B row for(j = 0; j < p; j++){ pCxx_receive(&from, buffer); for(s = 0; s < r; s++) rowbuf[from*r+s] = buffer[s]; } for(j = 0; j < n; j++) for(s = 0; s < k; s++) data[i][j] += A.data[i][s]*rowbuf[s]; } }

This function can now be called with a pC++ main program as follows

Processor_Main(){ Processors P; Matrix C(n,m), A(n,k), B(k, m); .... C.matMul(A, B); };

A more interesting problem is that of the element reference operator. The job of the ( int, int ) operator is to make sure that any read or update from the main thread is propagated to the correct position in this distributed array. For example, if the main thread invokes

then the thread that contains the element must return the correct value to the main thread. On the other hand, if we callx = M(i,j);

then it is the job of the (...) operator to make sure the on the correct processor is updated. This problem is complicated because we are not sure which thread may be invoking this operator. If it is a thread for which the requested data reference lies in the same address space, there is no problem. However, if they are different, such as when the main thread invokes this operation on each worker, we have a problem. To see this difficulty consider the following possible implementation.M(i,j) = x;

If each of the worker threads associated with the TEClass executes this operation, then the reference evaluation will be correct when called by the main thread only if the main thread shares its address space with one of the worker threads. (This is the case in the current ver. 1.0 pC++), However, in the future versions, this may not hold. There are two solutions to this problem. One solution is to introduce the CC++ global data type qualifier, so that special pointers and references can be created that can be passed between address spaces. We are strongly considering this for ver. 2.0. The other solution is to introduce more explicit member functions for read and write operations.double dummy_buffer; double &Matrix:: operator( int i, int j){ double *z; int not_local = 1; int p = NumProc(); if (MyProc()*(rows/p) <= i) && (i < MyProc()*(rows/p)){ // I have the desired row! z = &(data[ i % p][j]); not_local = 0; } else z = &dummy_buffer; pCxx_BroadcastBytes(non_local, sizeof(double), z); return *z; }

This solution protects the (...) operator to be only used in the TEClass thread environments and it allows only data values (instead of references or pointers) to be passed between address spaces.TEClass Matrix{ double **data; double &operator ()(int,int); public: int rows, cols; Matrix(n,m,p); void matMul(Matrix &A, Matrix &B); double read(int i, int j){ return (*this)(i,j); } void write(int i, int j, double value){ (*this)(i,j) = value; } };

Mon Nov 21 09:49:54 EST 1994