Appendix
:Using Hasthi framework to Manage LEAD project
This
document acts as a appendix to the paper, “Employing Automated Management to
Administer a Large-Scale E-Science”, submitted to the Escience 2008 conference,
and captures the information that does not fit in to the paper, nevertheless
interesting.
1. Lead System architecture figure
2. LEAD
services state and Dependencies
4. Instrumentations supported by LEAD Services

Above
figure presents the architecture of the LEAD system. Following points walk
though the list steps.
We categorized services under four categories based on the state each service hold.
1. Stateless – the service does not remember any state across two requests. That is to say that if the service complete a request, and was restarted, and after restart is completed the service received another request from the same client, and if the client does not perceive a deference from operation without restart, we call that a service stateless.
2. Persistent-State – when a service received a request, it writes all the data to the database, and if the request is completed, the data is guaranteed to be persistent, and service does not lose any data on a complete request due to crash. A service may failed while processing requests, and it will be responsibility of the client to retry the message in that case, where the service will detect and filter the duplicate messages (idempotent).
3. Persistent-safe – If the service crashed, it might lose data in current sessions, but after the restart new session will be processed successfully, and also all the data related to completed sessions will be preserved.
4. Non-Safe – In a crash service may lose data, and might be unusable after the restart, and internal state may be inconsistent.
Following table present a outline of LEAD services.
Table 1: Service State and Dependancies
LEAD workflow success rate has been around 50-60% over the time period.

Following is analysis of 5101 errors occurred in LEAD project over a time period of one year and seven months. Following figure show their distribution by plotting their frequency against different error type, with X axis sorted by the frequency. The distribution has a very long tail, and it suggest that addressing last 30 errors could reduce the occurrence of errors by almost 90%.
Following figure presents a categorization of all the errors.

Following data presents how categorizations change over time. We can see that 2008 January to April, it has best workflow success rate, and errors due to other service failures (sustainability) are clearly increasing. On next three months, software bugs and configuration errors has increased because LEAD moved to a new version of software stack.

List of all the errors happened in LEAD project can be found here. Following figure categorize the errors based on the taxonomy define by Avizienis et al [1], and the table presents different errors belong to each category.

|
Development/internal/software |
Runtime exceptions (Null Pointer errors, Class cast exceptions) Service Aging – Memory Leaks, Connection Leaks |
|
Op/internal/deployment |
System Configuration Errors - Outdated configurations, omissions Deployment errors – outdated clients/ wrong WSDL and type lib versions, mix up dev/production versions, wrong security settings (infamous unknown CA), .globus directory, incompatible versions , class cast/class not found Workflow Configuration Errors - missing Host/Appdesc, pointing to wrong executables Service developer faults – faulty application service configurations |
|
Op/external/services |
Unavailable Services - Job submission / File Transfers Overloaded Services – Refuse Connections, Refuse authorization due to load Slow services, Connection timeouts Race conditions, lost jobs Wrong outputs – destination file is corrupted Service Aging Application failures – underline command line application does not generate final output |
|
Op/ internal/infrastructure |
Hardware problems – Faulty host,
Overloaded hosts |
|
Operation/ internal /maintenance |
Proxy Errors – proxy expiration, security configuration changes Maintenance Issue – hard disk full announced down times |
|
Operation/ internal /capacity-sustainability |
Service Crash – Any Service Overload Services – GPEL, Portal Operator errors – delete data bases, kill services by mistake Capacity problems – Out of memory Service response times out |
|
Operation/ external /end-user |
User faults – wrong inputs |
Different types of instrumentations supports following properties, and services may define their own properties and expose their values via management agent interface.
|
Instrumentation |
resources |
Supported management properties |
|
Host Agent (Java agent that monitors a host) |
All the hosts |
1. Operational Status 2. Uptime Hours 3. Memory Usage (As a percentage) 4. Swap Usage 5. Process Count 6. Thread Count 7. Load Average in last 5 minutes. |
|
In Memory Agent (Java agent integrated with a java service. Currently XSUL is supported.) |
Any Service that has in memory agent integrated |
1. Operational Status 2. Last Request Received Time 3. Number Of Pending Requests 4. Number Of Successful Requests 5. Number Of Failed Requests 6. Last Response Time 7. Service Thread Count 8. Current Service File Descriptions/ Max Service File Descriptions 9. Max Response Time 10. Host 11. Type 12. Current Service Memory Usage/ Max Service Memory Usage |
|
Polling Monitoring Agent (Polls a given URL and monitor the service running in the URL) |
|
1. Operational Status 2. Number of XML requests 3. Start Time |
|
Process Monitor (If host where the service running is accessible, the process monitor can monitor the process) |
MySQL |
1. Operational Status 2. Process memory 3. Process CPU Usage |
|
JMX (Interface a existing JMX instrumentation and expose it as a WSDM resource) |
Tomcat |
1. Current Service Memory Usage/ Max Service Memory Usage 2. Service Thread Count 3. Current Service open File Descriptions/ Max Service open File Descriptions |
|
Script based monitoring (repeatedly execute a script, and parse the data from the stdout) |
|
What ever property supported by the script |
If a Host has failed, the following rule migrates all the services to a different Host. The two lines in the when clause select a host that has crashed, and selects all the services running in that host. Then the rule migrate each service in to an alternative host based on resource profile of each service.
rule "RecoverFailedHost"
when
host:Host(state == "CrashedState");
sl : ArrayList() from collect(ManagedService(host == host.name,
type not matches
".*MySQL.*"));
then
final List depList = MngActionUtils.findAllDependancies(system,sl);
final ActionCenter finalSystem = system;
for(int i = 0;i<depList.size();i++){
ManagedService service =
(ManagedService) depList.get(i);
final ManagedService failedService
= service;
ActionCallback callback = new ActionCallback() {
public void actionSucessful(ManagementAction
action) {
MngActionUtils.setResourceState(action.getActionContext(),failedService,"RepairedState");
}
public void
actionFailed(ManagementAction action,Throwable e) {
MngActionUtils.setResourceState(action.getActionContext(),failedService,"UnRepairableState");
try{
finalSystem.invoke(new
UserInteractionAction("hperera@indiana.edu",
"Resource
"+failedService.getName()+" has falied, and repair has failed. Please
Fix it and respond",
"response", new
OnetimeEventCallback() {
public void eventOccuered(HasthiEvent event){
MngActionUtils.setResourceState(finalSystem.getActionContext(),failedService,"RepairedState");
}
}));
}catch(Exception
e1){e1.printStackTrace();}
}
};
system.invoke(new
MigrateServiceAction(failedService),callback);
}
end
The first rules logs when ever they system reaches a non-healthy state and the second rule resurrect the workflows when the system returned to healthy state. It is interesting to note that these two rules use named value pairs shared using the system object to communicate between two rules.
rule
"LogSystemNonHealthyTime"
salience 8
when
sl
: ArrayList(size > 0) from collect(ManagedService(state ==
"CrashedState", category == "Service"));
then
system.put("SystemFailed",new
Long(System.currentTimeMillis()));
end
rule
"ResurrectWorkflowsAfterRecovery"
salience 5
when
sl : ArrayList(size == 0) from collect(ManagedService(state ==
"CrashedState", category == "Service"));
eval(system.get("SystemFailed") != null);
leadproxy:ManagedService();
then
system.invoke(new ResurrectWorkflowAction(leadproxy,(Long)system.get("SystemFailed")));
system.remove("SystemFailed");
system.remove("MailSent");
end
This
is also covered from service recovery.
If
service has crashed, but the residing host is up, the service can be recovered
by restarting the service in the same location. The rules detect the condition
and restart the services.
rule
"RestartFailedServices"
when
service:ManagedService(state == "CrashedState");
host:Host(state
!= "CrashedState", service.host == name);
then
final
ManagedService failedService = service;
final
ActionCenter finalSystem = system;
system.invoke(new
RestartAction(service),new ActionCallback() {
public void
actionSucessful(ManagementAction action) {
MngActionUtils.setResourceState(action.getActionContext(),failedService,"RepairedState");
}
public void
actionFailed(ManagementAction action,Throwable e) {
MngActionUtils.setResourceState(action.getActionContext(),failedService,"UnRepairableState");
try{
finalSystem.invoke(new
UserInteractionAction(finalSystem,failedService,action,e));
}catch(Exception e1){e1.printStackTrace();}
}
});
end
A hsot saturated state is decided by the local management rules, and the following rule migrate services if a host is overloaded.
rule
"HandleOverloadedHosts"
salience
10
when
host:Host(name == service.host,
state == "SaturatedState");
sl
: ArrayList() from collect(ManagedService(host == host.name));
then
ArrayList migrateList =
findList2Migarate(sl);
//migrate each service
end
If a application service is overloaded, following rule unregisters the service from the registry, and this rule uses a custom management action to perform unregistering the service.
rule "HandleOverloadedAppServices"
when
appService: ManagedService (state
== "SaturatedState", category == "TransientService");
registry: ManagedService(type
matches ".*Xregistry.*")
then
system.invoke(new
UnregisterServiceAction(registry,appService));
end
When
a error happened in the LEAD system, an event is sent to the management system,
and following event callback listen to those events and ask the owner of the
resource to fix the problem. Similarly more complex error handling can be built
in the management system by writing similar rules.
rule "Init"
when
then
//other init
system.getEventCorrelator().registerCallback("WorkflowFault:App .* Not
Found",new PersistantEventCallback() {
public void eventOccuered(HasthiEvent event){
system.invoke(new
UserInteractionAction(getOwner(event),event.getErrorMsg(),"response",
new OnetimeEventCallback() {
public void
eventOccuered(HasthiEvent event){
//Post processing
}
})
);
}
});
end
If the system is down for a more than 10 minutes, following rule generate a email to the mailing list notifying the users about the downtime.
rule "NotifyDowntimes"
salience 9
when
eval(system.get("MailSent") ==
null);
eval(system.get("SystemFailed") != null
&&
(((Long)system.get("SystemFailed")).longValue() <
System.currentTimeMillis() - 10*60*1000) );
then
system.invoke(new SendEmailAction("hperera@indiana.edu",
"LEAD Down Time", "message"));
system.put("MailSent","true");
end
If a
service is decided to faulty, this rule shut down the service.
rule "HandleUnknownErrors"
when
service:ManagedService(state == "FaultyState");
host:Host(state
!= "CrashedState", service.host == name);
then
final ManagedService failedService = service;
final
ActionCenter finalSystem = system;
system.invoke(new
ShutDownAction(service),new ActionCallback() {
public void
actionSucessful(ManagementAction action) {
MngActionUtils.setResourceState(action.getActionContext(),failedService,"RepairedState");
}
public void
actionFailed(ManagementAction action,Throwable e) {
MngActionUtils.setResourceState(action.getActionContext(),failedService,"UnRepairableState");
try{
finalSystem.invoke(new
UserInteractionAction(finalSystem,failedService,action,e));
}catch(Exception
e1){e1.printStackTrace();}
}
});
end
Apart reactive error handling, proactive maintenance tasks are required for high availability to achieve high availability, and we have identified following tasks. We have implemented 3, 5, and 7 in the initial version.
1. Subscribe to system notices (e.g. about down times, network failures), and notify relevant components (e.g. FTR) to exclude the compute hosts/resources based on system notices.
2. When some third party services are down, formal error reporting and follow up (e.g. TG Help desk) is required. For an example, when a GridFTP/ Gram is deemed faulty, compose a mail with all the details and show it to administrators, and he can 1) send a mail to developer/gateway debug mailing list 2) file a TG bug report by following a one link. In addition, the system will support administrators to browse and follow up those requests.
3. Safely shutdown, kill or restart services that are old and having known problems when they are too old.
4. Perform actions in scheduled times (e.g. at 12 am shutdown a service, do a svn update, rebuild it and restart it). This would be useful for performing and keeping track of tasks like swapping development and production stacks.
5. Monitor disk usages and send warnings, and clean up temporary directories.
6. Moving to new versions of tools, and automate possible rollback as much as possible.
7. If the system is not healthy for more then give time (e.g. 15 minutes) send a mail to lead-all mailing list notifying them about the down time and projected recovery time, and report back when the system is healthy.
We have conducted two tests to evaluate Hasthi in the LEAD environment. The first test is geared to measure the scalability of Hasthi, and the second experiment measures the overhead induced by different management actions in the LEAD environment.
To measure how Hasthi fares with increasing number of resource, we have conducted the following experiment. As the test bed for our experiments, we have modeled a high scale deployment of the LEAD workflow system. The test system consists of many replicated units of the workflow system and each service in the workflow system is modeled by a real web service. However, instead of doing real processing, each service randomly simulate a workload (e.g. number of requests it received, is it overloaded ect.), and generates associated management events. For an example, each service randomly simulates receiving a request, request success or failure, time took to process the request, and one in every 100 services randomly fails in every hour. We could run 200 services that simulate the workload in a one host without affecting performance of each other, and each host was given a replicated unit of the workflow system. That is each unit has a workflow engine, persistent services, and the others are modeled as transient services. We have changed the number of resource by changing the number of units (e.g. 5000 resources = 25 hosts with 200 each).
To manage this system, we deployed Hasthi, and as far as the management framework is concerned, it is not aware of workload simulations done in the services. We have deployed each manager in a separate node. The management system used rules mentioned in the first use case, and reacts to the system changes accordingly. The system was setup with 30s as both resource and management heartbeat interval, and each test system was managed for one hour, and 60 minutes/30s = 120 readings were collected. The tests were conducted using Odin cluster (128 nodes, Dual AMD 2.0GHz processor, 4GB Memory, 1 GB Ethernet, Red Hat Linux, sun java 1.5.).
This experiment compares the coordinator's behavior up to 20000 resources with 5, 10 manager s and the coordinator loop latency ( Global control loop) were measured while managing the systems. Following Graph depicts the results.
To measure the overhead induced by different management actions, LEAD system was setup, and each of the following management actions are performed 500 times, and overhead to complete actions were measured. Results are given below. The test a node with Dual 2.0GHZ Opterons CPU 16GB RAM was used to host the service.
| Action | Median Overhead (ms) |
| Create Service | 5142 |
| Shutdown Service | 60 |
| Configure Service | 130 |
The Create Service action has an explicitly added delay of 5000 ms wait time to verify successful service initiation and the overhead is high but explained.
1. Dependability and its threats: a taxonomy,
Avizienis, A. and Laprie, J.C. and Randell, B., Building the information
society, 18th IFIP World Computer Congress, 2004