If all iterations of a loop are independent, i.e. if no data dependence is carried by the loop, then this loop can be converted into a DOALL-loop. Even if data dependences are carried by a loop, however, some parallelism may result from executing the loop as a DOACROSS-loop, where a partial execution of some (parts) of the iterations is enforced to satisfy the loop-carried data dependences.
Although there are several methods to enforce synchronization in a DOACROSS-loop (see e.g. [7, 8, 20][21, 22, 36]), in this paper we focus on random synchronization [35, p75-83,][37, p289-295,], where synchronization variables are implemented as bit-arrays that provide one bit for each iteration. Synchronization is enforced using non-blocking post-statements to set particular bits of synchronization variables and wait-statements to block until certain bits of these synchronization variables become set.
On shared-address space architectures, parallel loops can be executed using fork/join-like parallelism to start a number of threads that will execute the different iterations of the loop in parallel [36, 385-387,]. The way in which iterations are assigned to the threads is dependent on the scheduling policy [29, ch4,][35, p73-74,][36, p387-392,][37, 296-298,]. In a pre-scheduling policy, iterations are assigned statically to threads, for instance, in a block-wise or cyclic fashion (viz. figure 2). In a self-scheduling policy, threads enter a critical section to obtain a next chunk of iterations dynamically. Here, there is a clear trade-off between using small chunks to reduce the potential of load imbalancing, and using large chunks to limit synchronization overhead. A good comprise is to vary the chunk size dynamically. In guided self-scheduling, for example, 1/t of the remaining iterations are assigned to each next thread, where t denotes the total number of threads that are used to execute the parallel loop.
Figure 2: Pre-Scheduling Policies