In designing a distributed event system, it is desirable to have a programming language independent solution since the Grid contains heterogeneous applications written in various languages. This means that the representation of the data on the wire also needs to maximize portability. It is widely accepted that XML is an ideal choice for platform and language independent representation of data and as an extension, SOAP is preferred for invoking remote procedure calls (RPCs). The Web services community has also gone in with this.
However, JMS does not specify how events are represented on the wire. As a result, interoperability between different JMS providers is not ensured and the primary intention is to provide a standard messaging API.
In spite of these features, interoperability is lacking due to the use of a custom binary protocol which other messaging systems will have to use in order to be compatible with it.
Nonetheless, the GMA is interested in events for performance analysis and problem monitoring and the requirements for it vary from that of application level events.
SoapRMI Events is our previous version of an application level messaging system for distributed environments. SOAP is used as the format for specifying the events and invoking the RPC calls. An event is uniquely identified by its namespace and type. A source field is used to indicate the originating point of the event. A timestamp is also present along with a message field for including additional information. A handback field can be provided by the listener to the publisher and this is set when returning events to the listener.
An event publisher and event listener, that produced and consumed events respectively, are defined. An event publisher can generate events and publish it by just writing preformatted SOAP string into the socket layer. This gives it a light footprint and makes it useful for embedded systems and resource monitors, other than regular grid applications.
One of the problems with the publisher is the handling of exceptions in case an event cannot be sent due to a network glitch or the listener being temporarily unreachable. An exception is thrown by the publisher and the application has to handle it. This puts an additional burden on the application program writer to handle failure cases. In addition, the event send is a blocking call to enable synchronous failure notification and this overhead becomes significant when many events are sent. We have overcome these limitations by using a non-blocking send in a separate thread of execution, coupled with a logging mechanism that enables retrying sending the events in case of an error. We use an intermediary known as publisher agent for this purpose.
Retrieving events in SoapRMI events is based on leasing. An event listener subscribes to a publisher with its remote reference, the lease duration and the type of events it is interested in. The publisher uses the remote reference of the listeners to send events to all those registered with it for a particular event type it has for publishing. This leasing approach is used to reclaim resources in case a listener fails to unsubscribe. An automatic lease renewal system, that enables the lease to continue as long as the application exists, is also available.
This approach is elegant and works well on a small LAN but does not scale effectively on a large scale Internet environment. The problems are due to two factors: (1) not all clients have static or globally accessible IP addresses and (2) network and server outages are fairly common in a large network.
Every listener's remote reference is a URL that includes the IP address of the host on which it is running. These are typically dynamic IP addresses and, in combination with firewalls, are inaccessible outside the local network. So a listener with such a remote reference cannot have events pushed to it when subscribing to a publisher outside the local network. This is overcome by allowing an alternative way of retrieving events by ``pulling'' them from the publisher (i.e.) the listener initiates the retrieval by contacting the publisher instead of the publisher having to use the remote reference of the listener.
The latter problem relates to robustness in the event of a network failure. If a listener's lease expires during a network outage, the publisher discontinues retaining the events for that listener since the lease is not renewed. When the resubscribe finally gets through, the listener has lost events that took place in the interim between the connection being lost and the resubscription occurring. One way to manage this is for the publisher to save the events in a persistent store so that the events are retained. In this case, the listener can use the timestamp of the event to select the ones to retrieve. But it is still possible to retrieve duplicate events or miss some of them due to the nature of the timestamp. This problem is surmounted by having the mechanism to have unlimited or persistent subscription. The listener is granted the persistent lease by an agent while the agent takes care of maintaining a valid lease, of limited duration, with the channel. Also, each event retrieval is assigned a permanent id that encapsulates the state information so that using that ID always returns the same set of events from the channel along with the ID to be used for the successive set of events. So, even if a lease is lost, events can be retrieved from where they were left off.