Below we show the pure Java implementation of the DAXPY operation (), and the Java implementation that uses a primitive from Level 1 BLAS:

pure Java: |

`SPMquot` " |

`SPMquot` for (int i = 0; i < N; i++)" |

`SPMquot` y[i] += alpha * x[i];" |

Java + Level 1 BLAS: |

`SPMquot` " |

`SPMquot` Blas.daxpy(N, alpha, x, 0, 1, y, 0, 1);" |

`SPMquot` " |

The inner product Java implementation of matrix times vector and a Java implementation that uses a call to Level 1 BLAS DDOT ()are shown below:

pure Java: |

`SPMquot` " |

`SPMquot` for (int i = 0; i < N; i++)" |

`SPMquot` for (int j = 0; j < N; j++)" |

`SPMquot` b[i] += a[i][j] * x[j];" |

Java + Level 1 BLAS: |

`SPMquot` " |

`SPMquot` for (int i = 0; i < N; i++)" |

`SPMquot` b[i] = Blas.ddot(N, a[i], 0, 1, x, 0, 1);" |

`SPMquot` " |

Finally, the product of two matrices can be computed using either the pure Java fragment shown below, or a similar Java fragment that uses a call to DAXPY:

pure Java: |

`SPMquot` " |

`SPMquot` for (int i = 0; i < N; i++)" |

`SPMquot` for (int k = 0; k < N; k++)" |

`SPMquot` for (int j = 0; j < N; j++)" |

`SPMquot` c[i][j] += a[i][k] * b[k][j];" |

Java + Level 1 BLAS: |

`SPMquot` " |

`SPMquot` for (int i = 0; i < N; i++)" |

`SPMquot` for (int k = 0; k < N; k++)" |

`SPMquot` Blas.daxpy(N, a[i][k], b[k], 0, 1, c[i], 0, 1);" |

`SPMquot` " |

The performance of the fragments for matrix times vector and matrix times matrix
with Level 1 BLAS may be further improved by parallelization of
the outermost `i`-loop. In earlier work [1], we have
shown how loop parallelization can be expressed in Java by means
of multi-threading. In this manner the transformed
Java program remains portable (i.e. the parallelized version
still runs on uni-processors with only a slight overhead),
while the programming efforts of the parallelization are reduced substantially
with respect to exploiting parallelism in a native language.
The versions with a parallel outermost loop are labeled as `parallel'
in the subsequent figures.
The C versions of the three mathematical operations
are similar to the pure Java implementations.

In figures 2-4 and
figures 5-7 we show the
execution times for varying values of *N* on the IBM
using the AIX4.2 JDK1.0.2B (with JITC)
and the AIX4.2 JDK1.1beta (without JITC), respectively.
Here we see that with JITC, providing Level 1 BLAS primitives only
suffices to obtain performance that is close to the performance
of native C code. Moreover, because IBM's implementation
of the JVM supports the actual parallel execution
of threads, the parallel versions with native Level 1 BLAS
even outperform serial C code. Without JITC, however, the
operation still suffers from much overhead,
and here it would probably be desirable to provide
primitives from Level 2 BLAS as well.

In figures 8-10 the execution times on the Sun using the Solaris 2.5 JDK1.0.2dp (with JITC) are shown. Now, in all cases the performance is even slightly better than the performance of straightforward native C code. Obviously, Sun's implementation of the JVM does not support the actual parallel execution of threads.

In figures 11-13 the execution times on the SGI using the IRIX6.2 JDK1.0.2 (without JITC) are shown. Again, without JITC, the performance of matrix times matrix operations is substantially less than the performance of compiled C code. Obviously, on this uni-processor, no speedup can be expected from loop parallelization.