Consider, for example, the following class SimpleLoop1 in which an instance method instanceMethod() is used to assign the inverse of the first N elements in a floating-point array b to corresponding elements in another floating-point array a:
Obviously, because no data dependences are carried by loop L1, the loop can be executed in a DOALL-like manner. Moreover, because the parameter N is only used as upper bound, the compiler does not have to deal with any variables that are declared outside the loop, but that are refered to within the loop-body. Exploiting implicit loop parallelism proceeds as follows.
First, the following instance method run_1() is constructed and added to the class SimpleLoop1 (see section 2.3.1):
Thereafter, the restructuring compiler constructs the following class LoopWorker_1 and adds this class to the transformed Java program (see section 2.3.2):
Finally, loop L1 in instanceMethod() is replaced by the following construct, where NUM=4 and SVARS=0. Because the same amount of work is done in each iteration, block-scheduling has been selected (see section 2.3.3, where adding `implements Schedules' to the transformed class enables the compiler to use the symbolic representation of a scheduling policy):
Figure 7: Single Loop in Instance Method
Now, suppose that a similar loop that operates on class variables appears in a class method classMethod() of a class SimpleLoop2:
In this case, a class method run_2() similar to the one shown above (but with a qualifier static) is added to the class SimpleLoop2, and the whole loop is replaced by a construct that is similar to the construct shown above (but without an argument this in the constructor invocation of LoopWorker_2). The loop-worker for the parallel loop L2 merely consists of a run()-method:
Figure 8: Single Loop in Class Method
In figures 7 and 8, we show the
serial execution time
and the parallel execution time
of the
methods instanceMethod() and classMethod()
for varying values of N. The execution time
of the parallel versions run with the
true parallel execution of threads
disabled is also shown.
These experiments indicate that on a uni-processor,
the overhead introduced by the loop parallelization
method presented in this paper is very small.
In figure 9, we show the speedup
of the true parallel versions. Here we see
that on the four processor IBM RS/6000 G30,
parallelization of this particular single loop
becomes useful if N exceeds 20,000.
An efficiency ranging from
up
to over
is obtained once N exceeds 220,000.
Figure 9: Speedup of Single Loops