IU Extreme! Lab LOGO

Karma Provenance Framework

Karma: Sanskrit word meaning "deed or act", more broadly describing the principle of cause and effect;
Total effect of a person's actions during the successive phases of existence that determines the person's destiny.

[Download] [Documentation] [Publications] [Contact]

Introduction

Data management is growing in complexity as large-scale applications take advantage of the loosely coupled resources brought together by grid middleware and by abundant storage capacity. Metadata describing the data products used in and generated by these applications is essential to disambiguate the data and enable reuse. Data provenance, one kind of metadata, pertains to the derivation history of a data product starting from its original sources [Record2006] [TR-618].

As the earth sciences improve their ability to sense the world around, it is resulting in a growing need for data-driven applications that are under the control of data-centric workflows composed of grid- and web- services. The focus of Karma is on provenance collection for these workflows, necessary to validate the workflow and to determine quality of generated data products. The challenge we address is to record uniform and usable provenance metadata that meets the domain needs while minimizing the modification burden on the service authors and the performance overhead on the workflow engine and the services. The framework, based on a loosely-coupled publish-subscribe architecture for propagating provenance activities, satisfies the needs of detailed provenance collection while a performance evaluation of a prototype finds a minimal performance overhead [ICWS2006, JWSR2008].

Performance

Empirical results have established the scalability of Karma for collecting and querying provenance over hundreds of thousands services and from numerous clients, as is typical in large collaboratory scientific projects like LEAD. Provenance collection for Karma shows a linear trend with a low slope as the number of service invocations increases and as concurrent clients reach up to 36. Recording time with Karma increases linearly with the number of data products involved. However, it remains comparably low for 95% of the use cases for LEAD that involve less than 250 data products per workflow. Querying Karma for workflow trace, process provenance, and data provenance increases in at most linear time as the number of results retrieved and the number of clients increase. Our future experiments includes evaluating the performance of Karma for real workflow runs and getting usable results for them by suppressing the I/O variations we encountered, possibly by the use of local storage instead of network file systems [IPAW2006].

Application

Provenance activities about workflow runs and service invocations help monitor the workflow progress in real-time [XBaya]. Workflows in LEAD are composed using the XBaya Composer GUI, and the GUI can display the progress of the workflow based on the provenance notifications that it listens to.

Searching over provenance is less intuitive and the types of queries supported depends on the application requirements. Karma supports data oriented and process oriented views of the provenance, and the ability to query on them recursively over space (through different levels of the workflow) and time (forward and backwards in the dataflow). The query capabilitites of the Karma system [CCPE2007] were explored as part of the First Provenane Challenge where we answered 8 of the 9 queries provided. The Second Provenane Challenge is currently underway in an effort to understand the interoperability between provenance systems, with an entry from our system also in the fray.

Provenance is an important quality metric since the derivation process has significant implications on the data's quality, and errors introduced by faulty data tend to inflate as they propagate to data derived from them. This is especially so in workflows, which execute several processes, generating intermediate and final data products whose quality depends on preceding workflow steps. Karma is used as the provenance framework that provides one of the quality metrics in the data quality estimation project for LEAD. The process that generated the data, its configuration parameters, and the input data form the provenance metadata, and the metric is a function over them. We address the non-intuitive nature of specifying constraints on provenance attributes by using machine-learning techniques to aggregate them into a quality score [SciFlow2006].

Current Status

[2007-06-19]

The current version of Karma (v2.1) supports provenance activities published from Services, Workflows and nested Workflows. Karma supports synchronous submission of provenance as activities using a web-services API or a scalable asynchronous mode using WS-Eventing notifications. Provenance clients can use the Notifier library to generate provenance activities. Synchronous recording is suggested for recording provenance in the scale of hundreds of workflows or if setup of a notification broker is to be avoided. The WS-Messenger notification broker is the suggested WS-Eventing implementation to use. Karma service requires the availability of a MySQL v5.0 or later server with a database assigned to it (preferebly named 'karma2').

News: The Karma system will be participating in the Second Provenance Challenge. The goal of this challenge will be to compare the interoperability between provenance systems. A workshop to compare the results is scheduled as part of HPDC at Monterey Bay, CA on 26 June, 2007.

Downloads

Karma v0.3 (2006-02-18)
The Karma distribution files required to run the provenance service and GUI [source] [binary]
Notifier v0.3.2 (2006-02-18)
The Notifier (aka Workflow Tracking) library required to publish provenance activities [source] [binary]
WS-Messenger
WS-Eventing based notification broker used to asynchronously publish provenance activites

Documentation

Publications

[TR-618] A Survey of Data Provenance Techniques
Simmhan, Y.L.; Plale, B. & Gannon, D.; Technical report 612, Computer Science Department, Indiana University, 2005.
[Record2005] A Survey of Data Provenance in e-Science
Simmhan, Y.; Plale, B. & Gannon, D.; SIGMOD Record, Vol. 34, No. 3, pp. 31-36, 2005.
[ICWS2006] A Framework for Collecting Provenance in Data-Centric Scientific Workflows
Simmhan, Y.; Plale, B. & Gannon, D.; ICWS Conference, 2006.
[IPAW2006] Performance Evaluation of the Karma Provenance Framework for Scientific Workflows [Slides]
Simmhan, Y.; Plale, B.; Gannon, D. & S. Marru; IPAW Workflow and LNCS , 2006.
[SciFlow2006] Towards a Quality Model for Effective Data Selection in Collaboratories [Slides]
Simmhan, Y.L.; Plale, B. & Gannon, D.; IEEE Workshop on Workflow and Data Flow for Scientific Applications (SciFlow) in conjunction with ICDE, 2006
[CCPE2007] Query capabilities of the Karma provenance framework
Simmhan, Y.; Plale, B. & Gannon, D.; Concurrency and Computation: Practice and Experience, 2007 (In Press).
[JWSR2008] Karma2: Provenance Management for Data Driven Workflows
Simmhan, Y.; Plale, B. & Gannon, D.; International Journal of Web Services Research, 2008 (To Appear).

Contact

Yogesh L. Simmhan [EMail] [WWW]
Doctoral Candidate & Research Assistant
Extreme! Computing Lab & Distributed Data Everywhere Lab
Computer Science Department, Indiana University

Author: Yogesh L. Simmhan
Last modified: $Id: index.html,v 1.4 2007/06/19 11:52:18 ysimmhan Exp $