Next: C++ and HPF Up: Programming with Collections. Previous: Example: Matrix-Matrix Multiply.

Example: Batcher's Bitonic Merge Sort. Ver 1.0+

Another classic example is the Batcher Bitonic Merge Sorting algorithm. A bitonic sequence is a string of values that can be seen as the concatenation of an increasing sequence followed by a decreasing sequence. The Botonic merge operates by dividing a bitonic sequence of length k into two sequences of length k/2. The item at position i is compared to the item at position i+k/2 and the smaller is kept at position i and the large is kept at position i+k/2. The results is two bitonic sequences, each of length k/2, with the property that every element of the left hand sequence is less than every element of the right hand sequence. (See Knuth for a proof.) We then recursively merge the two smaller sequences and we will eventually see the entire sequence sorted. But to make this work you need a bitonic sequence to start with. The way you make this happen is to start with sorted sequences of lenght r with one increasing and the other decreasing. When concatenated, they form a bitonic sequence of length 2r. Doing this with r=1, r=2, r=4 and so on you can build up larger bitonic sequences.

The code below works with collections of element where each element is a sub-sequence of smaller objects. The subsequences will be called Pvector objects, which as shown below, contain a vector of simple objects with an integer key.


class E{
  public:
   int key;
};

class Pvector{
 public:
   E part[BLKSIZE];
   int sortdir;
   void Qsort(int dir);
};
The Pvector class contains a member function Qsort which is used to sort the local vector of object in the direction indicated by an integer flag. There is also a field sordir which is used to record the last direction the list was sorted. Qsort is implemented with the standard C library quicksort. (See the TestSuite directory for the complete source code.)

The collection of Pvector objects is based on the DistributedArray collection library as snown below.


#include "kernel.h"
#include "distarray.h"

Collection SortedList: DistributedArray{
  public:
   SortedList(Distribution *D, Align *A);
   void merge(int distance);
   void sort();
  MethodOfElement:
   E tmp[BLKSIZE];
   virtual E part[BLKSIZE];
   int savedir;
   virtual void Qsort(int dir);
   int cond(int a, int b, int v1, int v2);
   void localSort(int flg);
   void grabFrom(int dist);
   void localMerge(int dist, int flg);
   void localCheck();
   // needed for dist. array.  not used here.
   double Dot(ElementType &arg1){ return 0.0; }
   ElementType &operator +=(ElementType &rhs){ return *this; }
};
Within the TEClass thread functions we have the two main components of the sorting algorithm sort() and merge() which accomplish the bitonic merge and bitonic sort. Sort is the toplevel function. Its main job is to set up a sequence of bitonic merges until the entire list is merged into one sorted list. This function begins by calling an element function localSort() which calls Qsort on each element. Local sort is designed so that if it is called with its parameter equal to 1, then it does the initial sort so that even indexed element are sorted in increasing order and odd numbers elements are sorted in decreasing order. This means that we have established bitonic sequences consisting of pairs of elements consisting of one even element and the following odd numbered element.

void SortedList::localSort(int flg){
 if(flg)
  Qsort(Ident % 2, 0, BLKSIZE-1);
 else
  Qsort(!savedir, 0, BLKSIZE-1);
}
The sort function is then very simple.

void SortedList::sort(){
   int i, p;
   this->localSort(1);
   for(i = 1; i < dim1size; i = 2*i)  merge(i);
   Barrier();
};
The merge(i) function is as described above. It is designed to merge bitonic sequences of length 2*i performing the exchange operation of distance i and then recursively merging the resulting subsequences.

void SortedList::merge(int dist){
   int flg;
   flg = 1;
   while(dist > 0){
        this->grabFrom(dist);
        this->localMerge(dist, flg);
        dist = dist/2;
        flg = 0;
       };
   this->localSort(0);
}
There are two parallel element functions invocations called in merge(). The first of this is a function grabFrom(dist) which, for each element, extracts a copy of the element a distance dist away into a local buffer tmp. Note that this copy of the data to a second array disqualifies this implementation from being a classical ``in place'' sort. We will return to this issue later. This is the function that does all the inter-element communication. Note that depending on which stage one is in the algorithm the sign of the offset alternates. That is, in the first stage even elements access the element to the right and odd elements access the element to the left. In the second stage, elements 0 and 1 access elements 2 and 3 respectively and visa versa.

void SortedList::grabFrom(int dist){
   int v,i, offset;
   E *T;
   void * addr1, *addr2;
   v = (Ident/dist) % 2;
   if(v ) offset = -dist;
   else   offset = dist;
   T = &((*ThisCollection)(Ident+offset)->part[0]);
   for(i = 0; i < BLKSIZE; i++) tmp[i] = T[i];
}
The localMerge() algorithm selects the values from the tmp array to keep in the local array. The most complex part of the Bitonic merge algorithm is keeping track of where one should generate an ascending sequence and where one must generate a descending sequence. During the initial ascending merge operations, the order must reverse at each step. While, for each recursive ``sub-merge'' the sordirection must be the same as on the previous step. The local field savedir records the previous direction of the merge and the parameter flg in the local merge routine below is used to distinguish the initial merge for a give value of dist.

void SortedList::localMerge(int dist, int flg){
   int v, v2, i;
   if(flg) savedir = v2 = (Ident/(2*dist)) % 2;
   else v2 = savedir;
   v = (Ident/dist) % 2;
   for(i = 0; i < BLKSIZE; i++){
       if(v == v2) && (part[i].key > tmp[i].key)) part[i] = tmp[i];
       if(v != v2) && (part[i].key <= tmp[i].key)) part[i] = tmp[i];
       }
};
The main program for this example is given below.

void Processor_Main()
{                    
  Processors P;      
  Distribution T(SIZE,&P,BLOCK);
  Align A(SIZE,"[ALIGN(V[i],T[i])]");
  SortedList<Pvector> X(&T,&A);      
  int i, n, error;                   
  double temps;                
  n = SIZE*BLKSIZE;
  X.sort();
}
This program will sort items in about 2.5 seconds on a 256 node CM-5 or Paragon. The standard unix C library qsort on a sparc 10 will complete the task in 25 seconds. Running the direct qsort on a single node of the CM-5 requires 104 seconds. Conseqently we see a speed-up of only about 40. This is cleary not the best that can be done. Part of the problem is the complexity of the Batcher Bitonic Sort ( vs. .



Next: C++ and HPF Up: Programming with Collections. Previous: Example: Matrix-Matrix Multiply.


beckman@cica.indiana.edu
Mon Nov 21 09:49:54 EST 1994