Data management is growing in complexity as large-scale applications take advantage of the loosely coupled resources brought together by grid middleware and by abundant storage capacity. Metadata describing the data products used in and generated by these applications is essential to disambiguate the data and enable reuse. Data provenance, one kind of metadata, pertains to the derivation history of a data product starting from its original sources [Record2006] [TR-618].
As the earth sciences improve their ability to sense the world around, it is resulting in a growing need for data-driven applications that are under the control of data-centric workflows composed of grid- and web- services. The focus of Karma is on provenance collection for these workflows, necessary to validate the workflow and to determine quality of generated data products. The challenge we address is to record uniform and usable provenance metadata that meets the domain needs while minimizing the modification burden on the service authors and the performance overhead on the workflow engine and the services. The framework, based on a loosely-coupled publish-subscribe architecture for propagating provenance activities, satisfies the needs of detailed provenance collection while a performance evaluation of a prototype finds a minimal performance overhead [ICWS2006, JWSR2008].
Empirical results have established the scalability of Karma for collecting and querying provenance over hundreds of thousands services and from numerous clients, as is typical in large collaboratory scientific projects like LEAD. Provenance collection for Karma shows a linear trend with a low slope as the number of service invocations increases and as concurrent clients reach up to 36. Recording time with Karma increases linearly with the number of data products involved. However, it remains comparably low for 95% of the use cases for LEAD that involve less than 250 data products per workflow. Querying Karma for workflow trace, process provenance, and data provenance increases in at most linear time as the number of results retrieved and the number of clients increase. Our future experiments includes evaluating the performance of Karma for real workflow runs and getting usable results for them by suppressing the I/O variations we encountered, possibly by the use of local storage instead of network file systems [IPAW2006].
Provenance activities about workflow runs and service invocations help monitor the workflow progress in real-time [XBaya]. Workflows in LEAD are composed using the XBaya Composer GUI, and the GUI can display the progress of the workflow based on the provenance notifications that it listens to.
Searching over provenance is less intuitive and the types of queries supported depends on the application requirements. Karma supports data oriented and process oriented views of the provenance, and the ability to query on them recursively over space (through different levels of the workflow) and time (forward and backwards in the dataflow). The query capabilitites of the Karma system [CCPE2007] were explored as part of the First Provenane Challenge where we answered 8 of the 9 queries provided. The Second Provenane Challenge is currently underway in an effort to understand the interoperability between provenance systems, with an entry from our system also in the fray.
Provenance is an important quality metric since the derivation process has significant implications on the data's quality, and errors introduced by faulty data tend to inflate as they propagate to data derived from them. This is especially so in workflows, which execute several processes, generating intermediate and final data products whose quality depends on preceding workflow steps. Karma is used as the provenance framework that provides one of the quality metrics in the data quality estimation project for LEAD. The process that generated the data, its configuration parameters, and the input data form the provenance metadata, and the metric is a function over them. We address the non-intuitive nature of specifying constraints on provenance attributes by using machine-learning techniques to aggregate them into a quality score [SciFlow2006].
The current version of Karma (v2.1) supports provenance activities published from Services, Workflows and nested Workflows. Karma supports synchronous submission of provenance as activities using a web-services API or a scalable asynchronous mode using WS-Eventing notifications. Provenance clients can use the Notifier library to generate provenance activities. Synchronous recording is suggested for recording provenance in the scale of hundreds of workflows or if setup of a notification broker is to be avoided. The WS-Messenger notification broker is the suggested WS-Eventing implementation to use. Karma service requires the availability of a MySQL v5.0 or later server with a database assigned to it (preferebly named 'karma2').
News: The Karma system will be participating in the Second Provenance Challenge. The goal of this challenge will be to compare the interoperability between provenance systems. A workshop to compare the results is scheduled as part of HPDC at Monterey Bay, CA on 26 June, 2007.
| Author: Yogesh L. Simmhan |
Last modified: $Id: index.html,v 1.4 2007/06/19 11:52:18 ysimmhan Exp $ |