Another classic example is the Batcher Bitonic Merge Sorting algorithm.
A bitonic sequence is a string of values that can be seen as the concatenation
of an increasing sequence followed by a decreasing sequence.
The Botonic merge operates by dividing a bitonic sequence of length *k*
into two sequences of length *k/2*. The item at position *i* is compared
to the item at position *i+k/2* and the smaller is kept at position *i*
and the large is kept at position *i+k/2*. The results is two bitonic
sequences, each of length *k/2*, with the property that every element
of the left hand sequence is less than every element of the right hand
sequence. (See Knuth for a proof.) We then recursively merge the two
smaller sequences and we will eventually see the entire sequence sorted.
But to make this work you need a bitonic sequence to start with.
The way you make this happen is to start with sorted sequences of lenght
*r* with one increasing and the other decreasing. When concatenated,
they form a bitonic sequence of length *2r*. Doing this with *r=1*,
*r=2*, *r=4* and so on you can build up larger bitonic sequences.

The code below works with collections of element where each element
is a sub-sequence of smaller objects. The subsequences will be
called *Pvector* objects, which as shown below, contain a vector
of simple objects with an integer key.

The Pvector class contains a member function Qsort which is used to sort the local vector of object in the direction indicated by an integer flag. There is also a field sordir which is used to record the last direction the list was sorted. Qsort is implemented with the standard C library quicksort. (See the TestSuite directory for the complete source code.)class E{ public: int key; }; class Pvector{ public: E part[BLKSIZE]; int sortdir; void Qsort(int dir); };

The collection of Pvector objects is based on the DistributedArray collection library as snown below.

Within the TEClass thread functions we have the two main components of the sorting algorithm sort() and merge() which accomplish the bitonic merge and bitonic sort. Sort is the toplevel function. Its main job is to set up a sequence of bitonic merges until the entire list is merged into one sorted list. This function begins by calling an element function localSort() which calls Qsort on each element. Local sort is designed so that if it is called with its parameter equal to 1, then it does the initial sort so that even indexed element are sorted in increasing order and odd numbers elements are sorted in decreasing order. This means that we have established bitonic sequences consisting of pairs of elements consisting of one even element and the following odd numbered element.#include "kernel.h" #include "distarray.h" Collection SortedList: DistributedArray{ public: SortedList(Distribution *D, Align *A); void merge(int distance); void sort(); MethodOfElement: E tmp[BLKSIZE]; virtual E part[BLKSIZE]; int savedir; virtual void Qsort(int dir); int cond(int a, int b, int v1, int v2); void localSort(int flg); void grabFrom(int dist); void localMerge(int dist, int flg); void localCheck(); // needed for dist. array. not used here. double Dot(ElementType &arg1){ return 0.0; } ElementType &operator +=(ElementType &rhs){ return *this; } };

The sort function is then very simple.void SortedList::localSort(int flg){ if(flg) Qsort(Ident % 2, 0, BLKSIZE-1); else Qsort(!savedir, 0, BLKSIZE-1); }

The merge(i) function is as described above. It is designed to merge bitonic sequences of lengthvoid SortedList::sort(){ int i, p; this->localSort(1); for(i = 1; i < dim1size; i = 2*i) merge(i); Barrier(); };

There are two parallel element functions invocations called in merge(). The first of this is a function grabFrom(dist) which, for each element, extracts a copy of the element a distance dist away into a local buffer tmp. Note that this copy of the data to a second array disqualifies this implementation from being a classical ``in place'' sort. We will return to this issue later. This is the function that does all the inter-element communication. Note that depending on which stage one is in the algorithm the sign of the offset alternates. That is, in the first stage even elements access the element to the right and odd elements access the element to the left. In the second stage, elements 0 and 1 access elements 2 and 3 respectively and visa versa.void SortedList::merge(int dist){ int flg; flg = 1; while(dist > 0){ this->grabFrom(dist); this->localMerge(dist, flg); dist = dist/2; flg = 0; }; this->localSort(0); }

The localMerge() algorithm selects the values from the tmp array to keep in the local array. The most complex part of the Bitonic merge algorithm is keeping track of where one should generate an ascending sequence and where one must generate a descending sequence. During the initial ascending merge operations, the order must reverse at each step. While, for each recursive ``sub-merge'' the sordirection must be the same as on the previous step. The local field savedir records the previous direction of the merge and the parameter flg in the local merge routine below is used to distinguish the initial merge for a give value of dist.void SortedList::grabFrom(int dist){ int v,i, offset; E *T; void * addr1, *addr2; v = (Ident/dist) % 2; if(v ) offset = -dist; else offset = dist; T = &((*ThisCollection)(Ident+offset)->part[0]); for(i = 0; i < BLKSIZE; i++) tmp[i] = T[i]; }

The main program for this example is given below.void SortedList::localMerge(int dist, int flg){ int v, v2, i; if(flg) savedir = v2 = (Ident/(2*dist)) % 2; else v2 = savedir; v = (Ident/dist) % 2; for(i = 0; i < BLKSIZE; i++){ if(v == v2) && (part[i].key > tmp[i].key)) part[i] = tmp[i]; if(v != v2) && (part[i].key <= tmp[i].key)) part[i] = tmp[i]; } };

This program will sort items in about 2.5 seconds on a 256 node CM-5 or Paragon. The standard unix C library qsort on a sparc 10 will complete the task in 25 seconds. Running the direct qsort on a single node of the CM-5 requires 104 seconds. Conseqently we see a speed-up of only about 40. This is cleary not the best that can be done. Part of the problem is the complexity of the Batcher Bitonic Sort ( vs. .void Processor_Main() { Processors P; Distribution T(SIZE,&P,BLOCK); Align A(SIZE,"[ALIGN(V[i],T[i])]"); SortedList<Pvector> X(&T,&A); int i, n, error; double temps; n = SIZE*BLKSIZE; X.sort(); }

Mon Nov 21 09:49:54 EST 1994