next up previous
Next: Tree Traversals Up: Experiments Previous: Matrix Multiplication

Random Synchronization

  Consider the following imperfectly nested loop, in which a static loop-carried flow dependence tex2html_wrap_inline2446 with distance 8 holds (i.e. for tex2html_wrap_inline2450 , statement instance tex2html_wrap_inline2452 depends on statement instance tex2html_wrap_inline2454 ):

tabular678

Straightforward parallelization of the i-loop is only valid if the loop is converted into a DOACROSS-loop and one synchronization variable is used to enforce all data dependences, i.e. SVARS = 1. Parallelization of the i-loop proceeds as explained earlier, where the class method run_x() that is added to class SingleDependence is shown below:

tabular687

However, another way to obtain parallelism in the loop shown above is to first apply loop-distribution [29, 35, 36, 37] to the i-loop, which is valid because all data dependences are lexically forward. Thereafter, the second resulting i-loop can be converted into a DOALL-loop. For small values of N, parallelization of the other i-loop or the j-loop is not likely to be useful:

tabular697

In figure 15, we show the execution time of the DOACROSS- and the DOALL-loop for varying values of N. For the former loop, we present the execution time for block-scheduling and cyclic-scheduling in case the wait()/notifyAll()-implementation of random synchronization described in section 2.2.4 is used. Here we see that the overhead of random synchronization is substantial, and that the loop becomes effectively serialized if block-scheduling is used. In this case, it is more efficient to use loop distribution to enable operations on array b to be executed in a DOALL-like manner.

   figure1460
Figure 15: Random Synchronization

Unfortunately, data dependences cannot always be dealt with so easily. For example, in the example shown above, there is a data dependence cycle, caused by the static flow dependences tex2html_wrap_inline2446 and tex2html_wrap_inline2458 having distance 8 and 9 respectively:

tabular713

In this case, the only way to exploit parallelism in the outermost loop is to execute the i-loop in a DOACROSS-like manner using two synchronization variables, i.e. SVARS = 2. The run_y() method that can be used for this purpose is shown below:

tabular720

   figure1472
Figure 16: Random Synchronization

In figure 16, we show the execution time of the serial and parallel version of the previous i-loop using a wait()/notifyAll()- and a busy-waiting-implementation of random synchronization. Here we see that the latter is clearly superior. Starting two additional threads (viz. t=6) for the former implementation in an attempt to exploit available processors time while threads are waiting fails due to the overhead involved.


next up previous
Next: Tree Traversals Up: Experiments Previous: Matrix Multiplication

ajcbik@extreme.indiana.edu