PSTL Parallel Algorithms

As in the STL, the PSTL algorithms are generic -- the arguments to the algorithms are not the containers themselves, but rather iterators which access container elements. This approach has several advantages. First, the same algorithm can be used for any container which utilizes an iterator of sufficient capability. So, for example, an algorithm which applies a function to each element can be called for any container which provides an iterator capable of the increment ( ++) operation. Second, algorithms can be easily applied to subranges of elements by passing as arguments to the algorithm the iterators which mark the beginning and one past the end of the subrange. In addition, through the use of user-defined iterator adaptors, subgroups of elements, such as odd- or even- indexed elements or elements with values above a certain threshold, can be accessed.

There are three types of parallel algorithms in the PSTL:

The algorithms in the first group retain their STL names, but allow pariterator arguments in place of iterators. These new algorithms are collective and have parallel semantics.

The algorithms in the second group are also versions of the STL algorithms, but par_ will be prepended to their names. When invoked with parallel iterators, these algorithms are semantically equivalent to the first type of algorithm (and are collective). When invoked with local iterators, these algorithms are not collective. Execution is local to a particular context, but has parallel semantics (so a loop may be parallelized, for example).

The differentiation between the versions of these algorithms is accomplished using iterator tags. The body of each algorithm consists of a call to another function with an added argument: iterator_category(first). The local or parallel iterator version of the function is called, depending on the iterator tag. Note that this differentiation is done at compile time using function templates.

The following example illustrates these algorithm versions.

// create a distributed vector of size 100 and
// initialize each element to 5.0
distributed_vector<double> myVec(100, 5.0);

// sum the local elements in this context 
double sum1 = accumulate(myVec.begin(), myVec.end(), 
                         0.0);

// sum all elements - collective operation
double sum2 = accumulate(myVec.parbegin(), 
                         myVec.parend(), 0.0);

// sum the local elements in this context 
//   using single-context parallelism
double sum3 = par_accumulate(myVec.begin(), 
                             myVec.end(), 0.0);

// sum all elements - collective operation
double sum4 = par_accumulate(myVec.parbegin(), 
                             myVec.parend(), 0.0);

In this example, let us assume that 3 contexts are involved. We first create a distributed vector with 100 elements of type double. Since the default distribution is used (no ContainerRatio object is specified in the constructor), 33 elements are stored in the first two contexts and 34 are stored in the last. Each element is initialized to 5.0.

The STL algorithm accumulate is declared as:

template <class Iterator, class T>
T accumulate(Iterator first, Iterator last, T initial)
The algorithm sums the elements in the iteration space [first, last) using initial as the initial value.

The first call to accumulate is a non-collective operation and results in the invocation of the STL version of accumulate in the calling context. While it does not need to be invoked in each context, it is in this case. Each invocation is, however, completely independent of the other contexts. In contexts 0 and 1, the value of sum1 is 165.0 (5 * 33). In context 2, the value is 170.0 because there is one extra element.

The second accumulate call is differentiated from the first by the use of parallel iterators. It is a collective operation and must be called in all contexts. The local elements are summed in each context and then a global sum is computed using a reduction operation. In all contexts, sum2 is 500.0.

Because it is invoked with local iterators, the first call to par_accumulate is a non-collective operation. Unlike accumulate, however, par_accumulate with local iterators uses single-context parallelism, if available, to compute the local sum. Lightweight threads are used in a context to accumulate partial sums of the local elements; these are summed to get the total for the local context. The value of sum3 in each context corresponds to the value of sum1; they differ only in the processing involved. Finally, the second call to par_accumulate is equivalent to the second call to accumulate since behavior of the algorithms is the same when they are called with parallel iterators.

The third group of algorithms consists of special par_ algorithms such as par_apply, par_scan, and par_reduce. These are collective operations with parallel semantics. The following have been defined thus far:

As is the case for all PSTL algorithms, the main arguments to these algorithms are iterators which access container elements. When invoked with local iterators, the algorithms are non-collective and use single-context parallelism. When invoked with parallel iterators, the operations are collective and have parallel semantics across contexts. The function object may change only the value of the elements in the space defined by the first two iterators in the argument list. Changes made to elements accessed via any other parallel iterator argument are lost since the function is applied to a temporary local copy of those elements.

 
Figure 1: Example of par_apply usage 

For example, the declaration of par_apply for a binary function object is:

template <class ForwardIterator1,
          class ForwardIterator2,
          class BinaryOperation> 
void  par_apply(ForwardIterator1 begin1,
                ForwardIterator1 end1,
                ForwardIterator2 begin2,
                BinaryOperation binop);
    
Here, the first iteration space is defined by [begin1, end1). If this iteration space contains k elements, then the second iteration space consists of [begin2, begin2 + k). In applying the binary function, each element in the first iteration space is paired with the corresponding element in the second iteration space.

Consider the following example.

class add_to {
public:
   void operator()(int& m, int n) { m += n; }
};

// create vectors
distributed_vector<int> x(15);
distributed_vector<int> y(10);

// initialize element values
do_initialization(x.parbegin(),x.parend());
do_more_initialization(y.parbegin(),y.parend());

// add the first 10 elements of y to the
//  corresponding elements in x
par_apply(x.parbegin(),x.parbegin()+10,y.parbegin(),
          add_to());
    

Figure 1 shows the two vectors, with the elements from the iteration spaces shaded in gray. The processing within par_apply in each context is as follows:

  1. The iteration space [x.parbegin(),x.parbegin()+10) is restricted to its local elements. As shown, in context 0, seven elements are local and in context 1, three elements are local.
  2. The portion of [y.parbegin(),y.parbegin()+10) which corresponds to these local elements is identified. In context 0, this begins at y.parbegin(), while in context 1 it begins at y.parbegin()+7.
  3. The operator() method of add_to is applied to each pair of elements in the iteration spaces identified in steps 1 and 2. Paired elements are connected by arrows in the figure. Note that all of the elements in the first iteration space are local, while in the second space elements may be local or remote. In the case of remote elements, a remote fetch is done to get a local copy for the operation. Step 3 is done in parallel if single-context parallelism is supported.


hpc++@extreme.indiana.edu

Last modified: Thu Feb 4 17:06:00 EST 1999