next up previous
Next: Linpack Benchmark Up: Experiments Previous: Experiments

Some Mathematical Operations

In the first series of experiments, we compare the performance a pure Java implementation, a Java implementation that uses native Level 1 BLAS (and possibly parallelism by means of Java multi-threading), and a compiled C implementation of an $\Theta(N)$ DAXPY-operation, an $\Theta(N^2)$ matrix times vector operation, and an $\Theta(N^3)$ matrix times matrix operation.

Below we show the pure Java implementation of the DAXPY operation ($\vec{y} \leftarrow \alpha \vec{x} + \vec{y}$), and the Java implementation that uses a primitive from Level 1 BLAS:

pure Java:
SPMquot "
SPMquotfor (int i = 0; i < N; i++)"
SPMquot y[i] += alpha * x[i];"
Java + Level 1 BLAS:
SPMquot "
SPMquotBlas.daxpy(N, alpha, x, 0, 1, y, 0, 1);"
SPMquot "

The inner product Java implementation of matrix times vector and a Java implementation that uses a call to Level 1 BLAS DDOT ($d \leftarrow \vec{x}\,^T \vec{y}$)are shown below:

pure Java:
SPMquot "
SPMquotfor (int i = 0; i < N; i++)"
SPMquot for (int j = 0; j < N; j++)"
SPMquot b[i] += a[i][j] * x[j];"
Java + Level 1 BLAS:
SPMquot "
SPMquotfor (int i = 0; i < N; i++)"
SPMquot b[i] = Blas.ddot(N, a[i], 0, 1, x, 0, 1);"
SPMquot "

Finally, the product of two matrices can be computed using either the pure Java fragment shown below, or a similar Java fragment that uses a call to DAXPY:

pure Java:
SPMquot "
SPMquotfor (int i = 0; i < N; i++)"
SPMquot for (int k = 0; k < N; k++)"
SPMquot for (int j = 0; j < N; j++)"
SPMquot c[i][j] += a[i][k] * b[k][j];"
Java + Level 1 BLAS:
SPMquot "
SPMquotfor (int i = 0; i < N; i++)"
SPMquot for (int k = 0; k < N; k++)"
SPMquot Blas.daxpy(N, a[i][k], b[k], 0, 1, c[i], 0, 1);"
SPMquot "

The performance of the fragments for matrix times vector and matrix times matrix with Level 1 BLAS may be further improved by parallelization of the outermost i-loop. In earlier work [1], we have shown how loop parallelization can be expressed in Java by means of multi-threading. In this manner the transformed Java program remains portable (i.e. the parallelized version still runs on uni-processors with only a slight overhead), while the programming efforts of the parallelization are reduced substantially with respect to exploiting parallelism in a native language. The versions with a parallel outermost loop are labeled as `parallel' in the subsequent figures. The C versions of the three mathematical operations are similar to the pure Java implementations.

In figures 2-4 and figures 5-7 we show the execution times for varying values of N on the IBM using the AIX4.2 JDK1.0.2B (with JITC) and the AIX4.2 JDK1.1beta (without JITC), respectively. Here we see that with JITC, providing Level 1 BLAS primitives only suffices to obtain performance that is close to the performance of native C code. Moreover, because IBM's implementation of the JVM supports the actual parallel execution of threads, the parallel versions with native Level 1 BLAS even outperform serial C code. Without JITC, however, the $\Theta(N^3)$ operation still suffers from much overhead, and here it would probably be desirable to provide primitives from Level 2 BLAS as well.

In figures 8-10 the execution times on the Sun using the Solaris 2.5 JDK1.0.2dp (with JITC) are shown. Now, in all cases the performance is even slightly better than the performance of straightforward native C code. Obviously, Sun's implementation of the JVM does not support the actual parallel execution of threads.

In figures 11-13 the execution times on the SGI using the IRIX6.2 JDK1.0.2 (without JITC) are shown. Again, without JITC, the performance of $\Theta(N^3)$ matrix times matrix operations is substantially less than the performance of compiled C code. Obviously, on this uni-processor, no speedup can be expected from loop parallelization.


next up previous
Next: Linpack Benchmark Up: Experiments Previous: Experiments
ajcbik@extreme.indiana.edu