To demonstrate the usefulness of the different scheduling policies, we have conducted some experiments with loop parallelization in the following class Matmat:
The most straightforward way to parallelize this implementation on a shared-address-space architecture is to convert the outermost i-loop into a parallel loop. In figure 11, we show the execution time of the original serial loop and the parallel loop for a varying number of threads and the three different scheduling policies discussed in this paper. Because work is spread evenly over the iterations, the scheduling policies have similar performance.
Figure 11: Matrix Multiplication
Now, suppose that the array a is used to store a lower triangular matrix, so that the the whole loop can be expressed as follows:
Figure 12: Triangular Matrix Multiplication
Because in this loop the amount of work is not spread evenly over the iterations, block scheduling suffers from some load imbalancing, as can be seen in figure 12, This load imbalancing, however, can be alleviated by allocating a few additional loop-workers.
Now, suppose that only a few rows of the matrix stored c have to be computed. As illustrated below, this can be accomplished by means of a boolean array filter:
Figure 13: Filtered Matrix Multiplication (even rows)
If, for example, only the even rows of c have to be computed, a severe load imbalancing may result using cyclic scheduling, as illustrated in figure 13. If, as another example, only the first M / 2 elements are set, guided self-scheduling suffers from a similar load imbalancing, as can be seen in figure 14. These experiments indicate that, in general, no decisive statement about which scheduling policy is the best can be made.
Figure 14: Filtered Matrix Multiplication (first M/2 rows)