Appendix :Using Hasthi framework to Manage LEAD project

This document acts as a appendix to the paper, “Employing Automated Management to Administer a Large-Scale E-Science”, submitted to the Escience 2008 conference, and captures the information that does not fit in to the paper, nevertheless interesting.

Outline

1.       Lead System architecture figure

2.       LEAD services state and Dependencies

3.       LEAD failure Analysis

4.       Instrumentations supported by LEAD Services

5.       Management Rules

6.       Performance Evaluations

Lead System architecture figure

Above figure presents the architecture of the LEAD system. Following points walk though the list steps.

 

Service State, and Dependencies

We categorized services under four categories based on the state each service hold.

1.       Stateless – the service does not remember any state across two requests. That is to say that if the service complete a request, and was restarted, and after restart is completed the service received another request from the same client, and if the client does not perceive a deference from operation without restart, we call that a service stateless. 

2.       Persistent-State – when a service received a request, it writes all the data to the database, and if the request is completed, the data is guaranteed to be persistent, and service does not lose any data on a complete request due to crash. A service may failed while processing requests, and it will be responsibility of the client to retry the message in that case, where the service will detect and filter the duplicate messages (idempotent).

3.       Persistent-safe – If the service crashed, it might lose data in current sessions, but after the restart new session will be processed successfully, and also all the data related to completed sessions will be preserved.

4.       Non-Safe – In a crash service may lose data, and might be unusable after the restart, and internal state may be inconsistent.

Following table present a outline of LEAD services.  

 

Table 1: Service State and Dependancies

Service

State

Dependencies

Data Base (MySQL)

Yes, MySQL server would recover from most inconsistencies after restart, however it is possible that it would left at inconsistent state [2]

 

Data Transfer (Grid Ftp)

Stateless

 

Job Submission (Gram)

Stateless

 

Auditing Service

Persistent-State

MySQL?

Message Broker

Persistent-State

MySQL

Data Catalog

Persistent-State

MySQL

Data Movement and Naming Service

Persistent-safe

MySQL, GridFtp, Broker

Dynamic Service Creator

Stateless

Xregistry

Fault Tolerant Recovery (FTR)

 

MySQL, GridFtp, Gram, Broker

Service Factory    

Stateless

XRegistry

GeoGUI    

Stateless

 

Host Scheduler Service

Stateless

MySQL

Message Box Service

Persistent-State

MySQL

Metadata Agent

Persistent-safe – data might be lost from the queue if system crashed. However new workflows will work

Messenger , Mylead server

Metadata Publisher

Persistent-safe – data might be lost from the queue if system crashed. However new workflows will work

 

Metadata Server

Persistent-State

MySQL

Portal    

Stateless

MyLead*,GPEL

Query Mediator

Stateless

Data Catalog

Workflow Monitor

Stateless

MySQL

Service Registry

Persistent-State

MySQL

Workflow Engine

Persistent-State

MySQL, Gfac

 

LEAD Failure Analysis

LEAD workflow success rate has been around 50-60% over the time period.

 

Following is analysis of 5101 errors occurred in LEAD project over a time period of one year and seven months. Following figure show their distribution by plotting their frequency against different error type, with X axis sorted by the frequency. The distribution has a very long tail, and it suggest that addressing last 30 errors could reduce the occurrence of errors by almost 90%.

 

   

Following figure presents a categorization of all the errors.

 

Following data presents how categorizations change over time. We can see that  2008 January to April, it has best workflow success rate, and errors due to other service failures (sustainability) are clearly increasing. On next three months, software bugs and configuration errors has increased because LEAD moved to a new version of software stack.

 

 

List of all the errors happened in LEAD project can be found here. Following figure categorize the errors based on the taxonomy define by Avizienis et al [1], and the table presents different errors belong to each category.

 

fault-taxonomy

Development/internal/software

Runtime exceptions (Null Pointer errors, Class cast exceptions)

Service Aging – Memory Leaks, Connection Leaks

Op/internal/deployment

System Configuration Errors  - Outdated configurations, omissions

Deployment errors – outdated clients/ wrong WSDL and type lib versions, mix up dev/production versions, wrong security settings (infamous unknown CA), .globus directory, incompatible versions , class cast/class not found

Workflow Configuration Errors  - missing Host/Appdesc, pointing to wrong executables

Service developer faults – faulty application service configurations

Op/external/services

Unavailable Services - Job submission / File Transfers

Overloaded Services – Refuse Connections, Refuse authorization due to load

Slow services, Connection timeouts

Race conditions, lost jobs

Wrong outputs – destination file is corrupted

Service Aging

Application failures – underline command line application does not generate final output

Op/ internal/infrastructure

Hardware problems – Faulty host, Overloaded hosts
Network – Unknown Host, No Route to Host, Socket Exceptions

Operation/ internal /maintenance

Proxy Errors – proxy expiration, security configuration changes

Maintenance Issue – hard disk full

 announced down times

Operation/ internal /capacity-sustainability

Service Crash – Any Service

Overload Services – GPEL, Portal

Operator errors – delete data bases, kill services by mistake

Capacity problems – Out of memory

Service response times out

Operation/ external /end-user

User faults – wrong inputs

 

Supported Instrumentations

Different types of instrumentations supports following properties, and services may define their own properties and expose their values via management agent interface.

Instrumentation

resources

Supported management properties

Host Agent (Java agent that monitors a host)

All the hosts

1.       Operational Status

2.       Uptime Hours

3.       Memory Usage (As a percentage)

4.       Swap Usage

5.       Process Count

6.       Thread Count

7.       Load Average in last 5 minutes.

In Memory Agent (Java agent integrated with a java service. Currently XSUL is supported.)

Any Service that has in memory agent integrated

1.       Operational Status

2.       Last Request Received Time

3.       Number Of Pending Requests

4.       Number Of Successful Requests

5.       Number Of Failed Requests

6.       Last Response Time

7.       Service Thread Count

8.       Current Service File Descriptions/ Max Service File Descriptions

9.       Max Response Time

10.    Host

11.    Type

12.     Current Service Memory Usage/  Max Service Memory Usage

Polling Monitoring Agent (Polls a given URL and monitor the service running in the URL)

 

1.       Operational Status

2.       Number of XML requests

3.       Start Time

Process Monitor (If host where the service running is accessible, the process monitor can monitor the process)

MySQL

1.       Operational Status

2.       Process memory

3.       Process CPU Usage

JMX (Interface a existing JMX instrumentation and expose it as a WSDM resource)

Tomcat

1.       Current Service Memory Usage/  Max Service Memory Usage

2.       Service Thread Count

3.       Current Service open File Descriptions/ Max Service open File Descriptions

Script based monitoring  (repeatedly execute a script, and parse the data from the stdout)

 

What ever property supported by the script

 

Management Rules

Rule 1: Recover Host Failures

If a Host has failed, the following rule migrates all the services to a different Host. The two lines in the when clause select a host that has crashed, and selects all the services running in that host. Then the rule migrate each service in to an alternative host based on resource profile of each service.

rule "RecoverFailedHost"

when

      host:Host(state == "CrashedState"); 

      sl : ArrayList() from collect(ManagedService(host == host.name, type  not matches ".*MySQL.*"));

then

      final List depList = MngActionUtils.findAllDependancies(system,sl);

      final ActionCenter finalSystem = system;

      for(int i = 0;i<depList.size();i++){

            ManagedService service = (ManagedService) depList.get(i);

            final ManagedService failedService = service;

           

            ActionCallback  callback = new ActionCallback() {

                  public void actionSucessful(ManagementAction action) {

                         MngActionUtils.setResourceState(action.getActionContext(),failedService,"RepairedState");

                  }

                  public void actionFailed(ManagementAction action,Throwable e) {

                        MngActionUtils.setResourceState(action.getActionContext(),failedService,"UnRepairableState");

                        try{

                        finalSystem.invoke(new UserInteractionAction("hperera@indiana.edu",

                              "Resource "+failedService.getName()+" has falied, and repair has failed. Please Fix it and respond",  

                              "response",  new OnetimeEventCallback() {

                              public  void eventOccuered(HasthiEvent event){

                                   MngActionUtils.setResourceState(finalSystem.getActionContext(),failedService,"RepairedState");

                              }

                        }));

                        }catch(Exception e1){e1.printStackTrace();}

                  }

            };

            system.invoke(new MigrateServiceAction(failedService),callback);

      }

end

 

Rule 2: Resurrect Workflows after system recovery

The first rules logs when ever they system reaches a non-healthy state and the second rule resurrect the workflows when the system returned to healthy state. It is interesting to note that these two rules use named value pairs shared using the system object to communicate between two rules.

rule "LogSystemNonHealthyTime"

salience 8

when

      sl : ArrayList(size > 0) from collect(ManagedService(state == "CrashedState", category == "Service"));

then

      system.put("SystemFailed",new Long(System.currentTimeMillis()));

end 

 

 

rule "ResurrectWorkflowsAfterRecovery"

salience 5

when

      sl : ArrayList(size == 0) from collect(ManagedService(state == "CrashedState", category == "Service"));

      eval(system.get("SystemFailed") != null);

      leadproxy:ManagedService();

then

      system.invoke(new ResurrectWorkflowAction(leadproxy,(Long)system.get("SystemFailed")));

      system.remove("SystemFailed");

      system.remove("MailSent");

end  

 Rule 3: Resurrect MySQL database

 

This is also covered from service recovery.

 

Rule 4: Recover Service Failures

If service has crashed, but the residing host is up, the service can be recovered by restarting the service in the same location. The rules detect the condition and restart the services.

rule "RestartFailedServices"

when

    service:ManagedService(state == "CrashedState");

       host:Host(state != "CrashedState", service.host == name);

then

       final ManagedService failedService = service;

       final ActionCenter finalSystem = system;

       system.invoke(new RestartAction(service),new ActionCallback() {

          public void actionSucessful(ManagementAction action) {

              MngActionUtils.setResourceState(action.getActionContext(),failedService,"RepairedState");

          }

          public void actionFailed(ManagementAction action,Throwable e) {

              MngActionUtils.setResourceState(action.getActionContext(),failedService,"UnRepairableState");

              try{

                     finalSystem.invoke(new UserInteractionAction(finalSystem,failedService,action,e));

              }catch(Exception e1){e1.printStackTrace();}

          }

    });

end

 Rule 5: Handle Overloaded Hosts

A hsot saturated state is decided by the local management rules, and the following rule migrate services if a host is overloaded.

rule "HandleOverloadedHosts"

       salience 10

       when

            host:Host(name == service.host, state == "SaturatedState");  

          sl : ArrayList() from collect(ManagedService(host == host.name));

       then

            ArrayList migrateList = findList2Migarate(sl); 

            //migrate each service              

end

 

Rule 5: Handle Overloaded Application Services

If a application service is overloaded, following rule unregisters the service from the registry, and this rule uses a custom management action to perform unregistering the service.

rule "HandleOverloadedAppServices"

       when

            appService: ManagedService (state == "SaturatedState", category == "TransientService"); 

            registry: ManagedService(type matches ".*Xregistry.*")

       then

                     system.invoke(new UnregisterServiceAction(registry,appService));

end

 

Rule 6: Handle Deployment Errors

When a error happened in the LEAD system, an event is sent to the management system, and following event callback listen to those events and ask the owner of the resource to fix the problem. Similarly more complex error handling can be built in the management system by writing similar rules.

rule "Init"
    when
    then
        //other init       
    system.getEventCorrelator().registerCallback("WorkflowFault:App .* Not Found",new PersistantEventCallback() {
                public  void eventOccuered(HasthiEvent event){
                    system.invoke(new UserInteractionAction(getOwner(event),event.getErrorMsg(),"response",
                        new OnetimeEventCallback() {
                            public  void eventOccuered(HasthiEvent event){
                                //Post processing
                            }
                        })
                    );
                }
            });
     end   

 

Rule 7: Notify Downtimes

If the system is down for a more than 10 minutes, following rule generate a email to the mailing list notifying the users about the downtime.

rule "NotifyDowntimes"

salience 9

when

         eval(system.get("MailSent") == null);

      eval(system.get("SystemFailed") != null 

      && (((Long)system.get("SystemFailed")).longValue() < System.currentTimeMillis() - 10*60*1000) );

then

      system.invoke(new SendEmailAction("hperera@indiana.edu", "LEAD Down Time", "message"));

      system.put("MailSent","true");

end  

 

Rule 8: Unknown Error Condition

If a service is decided to faulty, this rule shut down the service.

rule "HandleUnknownErrors"

when

    service:ManagedService(state == "FaultyState");

       host:Host(state != "CrashedState", service.host == name);

then

    final ManagedService failedService = service;

       final ActionCenter finalSystem = system;

       system.invoke(new ShutDownAction(service),new ActionCallback() {

          public void actionSucessful(ManagementAction action) {

              MngActionUtils.setResourceState(action.getActionContext(),failedService,"RepairedState");

          }

          public void actionFailed(ManagementAction action,Throwable e) {

              MngActionUtils.setResourceState(action.getActionContext(),failedService,"UnRepairableState");

              try{

                     finalSystem.invoke(new UserInteractionAction(finalSystem,failedService,action,e));

              }catch(Exception e1){e1.printStackTrace();}

          }

    });

end

Handling Maintenance Tasks

Apart reactive error handling, proactive maintenance tasks are required for high availability to achieve high availability, and we have identified following tasks. We have implemented 3, 5, and 7 in the initial version.

1.       Subscribe to system notices (e.g. about down times, network failures), and notify relevant components (e.g. FTR) to exclude the compute hosts/resources based on system notices.

2.       When some third party services are down, formal error reporting and follow up (e.g. TG Help desk) is required.  For an example, when a GridFTP/ Gram is deemed faulty, compose a mail with all the details and show it to administrators, and he can 1) send a mail to developer/gateway debug mailing list 2) file a TG bug report by following a one link. In addition, the system will support administrators to browse and follow up those requests.

3.       Safely shutdown, kill or restart services that are old and having known problems when they are too old.

4.       Perform actions in scheduled times (e.g. at 12 am shutdown a service, do a svn update, rebuild it and restart it). This would be useful for performing and keeping track of tasks like swapping development and production stacks.

5.       Monitor disk usages and send warnings, and clean up temporary directories.

6.       Moving to new versions of tools, and automate possible rollback as much as possible.

7.       If the system is not healthy for more then give time (e.g. 15 minutes) send a mail to lead-all mailing list notifying them about the down time and projected recovery time, and report back when the system is healthy.

 

Evaluations

We have conducted two tests to evaluate Hasthi in the LEAD environment. The first test is geared to measure the scalability of Hasthi, and the second experiment measures the overhead induced by different management actions in the LEAD environment.

Experiment 1

To measure how Hasthi fares with increasing number of resource, we have conducted the following experiment. As the test bed for our experiments, we have modeled a high scale deployment of the LEAD workflow system. The test system consists of many replicated units of the workflow system and each service in the workflow system is modeled by a real web service. However, instead of doing real processing, each service randomly simulate a workload (e.g. number of requests it received, is it overloaded ect.), and generates associated management events. For an example, each service randomly simulates receiving a request, request success or failure, time took to process the request, and one in every 100 services randomly fails in every hour. We could run 200 services that simulate the workload in a one host without affecting performance of each other, and each host was given a replicated unit of the workflow system. That is each unit has a workflow engine, persistent services, and the others are modeled as transient services. We have changed the number of resource by changing the number of units (e.g. 5000 resources = 25 hosts with 200 each).

To manage this system, we deployed Hasthi, and as far as the management framework is concerned, it is not aware of workload simulations done in the services. We have deployed each manager in a separate node. The management system used rules mentioned in the first use case, and reacts to the system changes accordingly. The system was setup with 30s as both resource and management heartbeat interval, and each test system was managed for one hour, and 60 minutes/30s = 120 readings were collected. The tests were conducted using Odin cluster (128 nodes, Dual AMD 2.0GHz processor, 4GB Memory, 1 GB Ethernet, Red Hat Linux, sun java 1.5.).

This experiment compares the coordinator's behavior up to 20000 resources with 5, 10 manager s and the coordinator loop latency ( Global control loop) were measured while managing the systems. Following Graph depicts the results.

Experiment 2

To measure the overhead induced by different management actions, LEAD system was setup, and each of the following management actions are performed 500 times, and overhead to complete actions were measured. Results are given below. The test a node with Dual 2.0GHZ Opterons CPU 16GB RAM was used to host the service.

ActionMedian Overhead (ms)
Create Service5142
Shutdown Service60
Configure Service130

The Create Service action has an explicitly added delay of 5000 ms wait time to verify successful service initiation and the overhead is high but explained.

References  

1.     Dependability and its threats: a taxonomy, Avizienis, A. and Laprie, J.C. and Randell, B., Building the information society, 18th IFIP World Computer Congress, 2004