Below we show the pure Java implementation
of the DAXPY operation (
),
and the Java implementation that uses
a primitive from Level 1 BLAS:
| pure Java: |
SPMquot " |
SPMquotfor (int i = 0; i < N; i++)" |
SPMquot y[i] += alpha * x[i];" |
| Java + Level 1 BLAS: |
SPMquot " |
SPMquotBlas.daxpy(N, alpha, x, 0, 1, y, 0, 1);" |
SPMquot " |
The inner product Java implementation of
matrix times vector and a Java implementation that uses
a call to Level 1 BLAS DDOT (
)are shown below:
| pure Java: |
SPMquot " |
SPMquotfor (int i = 0; i < N; i++)" |
SPMquot for (int j = 0; j < N; j++)" |
SPMquot b[i] += a[i][j] * x[j];" |
| Java + Level 1 BLAS: |
SPMquot " |
SPMquotfor (int i = 0; i < N; i++)" |
SPMquot b[i] = Blas.ddot(N, a[i], 0, 1, x, 0, 1);" |
SPMquot " |
Finally, the product of two matrices can be computed using either the pure Java fragment shown below, or a similar Java fragment that uses a call to DAXPY:
| pure Java: |
SPMquot " |
SPMquotfor (int i = 0; i < N; i++)" |
SPMquot for (int k = 0; k < N; k++)" |
SPMquot for (int j = 0; j < N; j++)" |
SPMquot c[i][j] += a[i][k] * b[k][j];" |
| Java + Level 1 BLAS: |
SPMquot " |
SPMquotfor (int i = 0; i < N; i++)" |
SPMquot for (int k = 0; k < N; k++)" |
SPMquot Blas.daxpy(N, a[i][k], b[k], 0, 1, c[i], 0, 1);" |
SPMquot " |
The performance of the fragments for matrix times vector and matrix times matrix with Level 1 BLAS may be further improved by parallelization of the outermost i-loop. In earlier work [1], we have shown how loop parallelization can be expressed in Java by means of multi-threading. In this manner the transformed Java program remains portable (i.e. the parallelized version still runs on uni-processors with only a slight overhead), while the programming efforts of the parallelization are reduced substantially with respect to exploiting parallelism in a native language. The versions with a parallel outermost loop are labeled as `parallel' in the subsequent figures. The C versions of the three mathematical operations are similar to the pure Java implementations.
In figures 2-4 and
figures 5-7 we show the
execution times for varying values of N on the IBM
using the AIX4.2 JDK1.0.2B (with JITC)
and the AIX4.2 JDK1.1beta (without JITC), respectively.
Here we see that with JITC, providing Level 1 BLAS primitives only
suffices to obtain performance that is close to the performance
of native C code. Moreover, because IBM's implementation
of the JVM supports the actual parallel execution
of threads, the parallel versions with native Level 1 BLAS
even outperform serial C code. Without JITC, however, the
operation still suffers from much overhead,
and here it would probably be desirable to provide
primitives from Level 2 BLAS as well.
In figures 8-10 the execution times on the Sun using the Solaris 2.5 JDK1.0.2dp (with JITC) are shown. Now, in all cases the performance is even slightly better than the performance of straightforward native C code. Obviously, Sun's implementation of the JVM does not support the actual parallel execution of threads.
In figures 11-13 the execution times
on the SGI using the IRIX6.2 JDK1.0.2 (without JITC) are shown.
Again, without JITC, the performance of
matrix times matrix operations
is substantially less than the performance of compiled C code.
Obviously, on this uni-processor,
no speedup can be expected from loop parallelization.