Mbirn: Workflow Update 3-20-2006
A Workflow Portlet for Medical Image Processing to enable the Application of Informatics Approaches through the Biomedical Informatics Research Network
People
Shawn N. Murphy, Michael E. Mendis, Jeffrey S. Grethe, Jason Novotny, Brendan Faherty, Burak Ozyurt, Anders Dale, Anthony Kolasny, Tim Brown, Mike Miller, Michael Pan, Arthur Toga, Randy Gollub, David Kennedy, Bruce Rosen
Introduction
The ability to send data through a succession of software programs is critical for the successful analysis of complex images. Over the years, groups have developed “data pipelines”, many of which are simple TCL scripts, but some which are entire applications to handle these processes. Although the pipelines are effective in their various local environments, they tend to fail under circumstances where a high degree of collaboration is required in a calculation. Being local, they also are not easily transferable to other institutions where calculations are being tested for reproducibility or extended for further experimentation. Finally, the pipelines are not available to the clinical researcher as the domain space in which collaboration is taking place increases to genomics and epidemiology.
Nonetheless, the current state-of-the-art for image processing exists in these data pipeline applications. Perhaps the most sophisticated is the LONI pipeline from the University of California Los Angeles. Others in use include the Kepler pipeline from University of California at Berkley and at San Diego, as well as the jBPM workflow engine from JBoss.
A system was envisioned that could consume the existing pipeline applications and achieve the following goals: 1) Allow software produced by BIRN to be made available to people inside and outside of the BIRN group, 2) allow a consistent computing platform of BIRN software to be maintained with special attention to metadata and data provenance, and 3) allow study metadata to be tightly organized across groups to allow for collaboration and comparison of results.
Methods
The potential for a portal-based solution was appreciated. A portal-based solution is where a web-site would accept uploaded images and host the calculating machinery so that imaging processing could occur on the web-site. The revised images would then be returned to the users once they had been processed, along with the numerical results of any calculations upon the images, such as the volume of certain structures in the image.
To understand what would be required from the portal-based solution we surveyed the needs across all of the sites which included two groups at Harvard University, one at Johns Hopkins University, one at Washington University, one at University of California at San Diego, one at University of California at Irving, and one at University of California at Los Angeles. The requirements that emerged were as follows: 1) The system must be able to incorporate the pipeline tools that are currently available for image processing including LONI, Kepler, and jBPM; 2) The system must allow human review of intermediate results of calculations. The most common use case supporting this requirement is the review of images to ensure that calculation have not made a gross error and converged to irrelevant local minima. 3) The system must allow human handoffs. Several projects exist within the BIRN where various groups participate in various portions of a calculation. Therefore, a process where one group automatically indicates what calculations are ready for the next group is necessary. 4) The system must allow data provenance to be managed to allow calculations to be reproduced accurately. 5) The system must be available both for direct human interaction through a set of web pages, and also to software processes through a set of services such that other computerized systems can call and interact with the system directly; 6) The system must provide a clear plan on how to represent the results of calculations and have the ability to access to their results by direct viewing or through software processes; 7) the system must contain security, scalability, and reliability to be expected in a multi-user system.
Current pipeline tools are used to work with this data at the various local sites. It was imperative that the portal did not require the functionality of these tools to be reinvented, because this would not represent an efficient use of BIRN resources. For example, the MGH Freesurfer calculation consists of over 40 steps, and we did not wish to redo the workflow in a new portal-based tool. Current pipeline applications that the Portal incorporates are:
1) Kepler @ [1]
Kepler is a visual modeling tool written in Java. It was begun in 1997 at UC Berkley. Several recent SDM efforts have extended the Ptolemy-II platform (http://ptolemy.eecs.berkeley.edu/) to allow for the drag-and-drop creation of scientific workflows from libraries of actors. The Ptolemy actor is often a wrapper around a call to web service or grid service. Ptolemy leverages an XML-meta language called Modeling Markup Language (MoML) to produce a workflow document describing the relationships of the entities, properties, and ports in a workflow. The process of creating a workflow with the Ptolemy software is centered on creating Java classes that extends a build-in Actor class.
2) LONI pipeline @ [2]
The LONI Pipeline is a visual environment for constructing complex scientific analyses of data. It is written in Java and utilized an OWL-based XML representation of the workflow. The environment also takes advantage of supercomputing environments by automatically parallelizing data-independent programs in a given analysis whenever possible.
3) jBPM @ [3]
The primary focus of JBoss jBPM development has been the BPM (business process management) core engine. Besides further development of the engine, the JBoss roadmap for jBPM focuses on three areas a) native BPEL support, b) a visual designer to model workflows, and c) process management capabilities enhancement. jBPM can stand alone in a Java VM, inside any Java application, inside any J2EE application server, or as part of an enterprise service bus.
The further development and customization of these tools will allow existing script based analysis pipelines to move into an era of improved metadata management and the ability to take advantage of web services. Meticulous metadata management is essentially when pooling data from various locations, especially in science where there is no overseeing body to establish standards. The ability to use web services provides a way to perform distributed computing, and also in a grander scheme a way to allow rapid deployment of new computation algorithms. This is achieved by enabling the ownership and maintenance of the web service by those who are actually developing a specific computational algorithm.
The goal of the Portal was not to necessarily produce new software, but rather to link together and support existing BIRN software such that it could be more effectively utilized by various groups of collaborating users. To achieve this goal, we architected the system as shown in the diagram below. Because the BIRN is dedicated to open source solutions, all parts of the infrastructure are available to the public for free as open source projects, as well as the Kepler pipeline engine. In the diagram, the parts built by the authors of this paper are shown in orange, while pre-existing software that was integrated into the solution are shown in yellow.
The system relies on uploads and downloads to and from an open-source file management system named the Storage Resource Broker (SRB, available at [4]). The SRB provides a way to access data sets and resources based on their attributes and/or logical names rather than their names or physical locations and allows file security to be managed on a network shared resource in conjunction with the Grid Account Management Architecture (GAMA, available at [5]). The GAMA system is used for authorization and authentication and consists of two components, a backend security service that provides secure management of credentials, and a front-end set of portlets and clients that provide tight integration into web/grid portals.
The main system software is divided amongst a Web Server and an Execution Server to comply with the general architecture of the BIRN portal. The Execution server has access to a (open source) Condor grid ([6]). We chose to use jBPM as the principle engine for scheduling and executing other applications. This is because it is a reliable and open-source Workflow engine that is particularly geared towards making human handoffs in a workflow. It’s “out of the box” functionality includes a set of services that allows breakpoint in a workflow to be defined where the workflow will enter a “wait” state until human intervention occurs. This gives the chance for handoffs between groups to occur and intermediate calculations to be checked. Additional required software includes the web portal open source software GridSphere ([7]) and the open source Apache Tomcat project ([8]).
We combined the above pieces with a custom designed workflow portlet that drives web access to the infrastructure, a J2EE ([9]) based interface to some of the deeply embedded functionality of jBPM, custom interfaces to the Kepler and LONI pipeline engines, and a versatile database that tracks workflows and stores results. All of the pieces of this project will be available as open source through the BIRN website ([10]). An important design principal is the definition of calculation “zones” that use consistent versions of the Java Virtual Machine, the pipeline engine (jBPM, Kepler, or LONI) and all the associated programs that will be used in the calculation. To this end, users may not upload new programs and indeed must restrict themselves to software available in a predefined calculation zone. These zones are defined by BIRN administrators. It is possible to upload new workflows within a zone, although the process is complex and currently intended only to be used by administrators.
The functionality works as follows, and a request is illustrated in the following figures. The workflows are stored as objects and can be called when an instance is requested to be started by the “Request” User Interface (UI) form. The Request form is used to start the Workflow which in the figure is a LONI pre-defined workflow that is overseen by jBPM. Data is retrieved from the SRB. As the workflow starts, runs, and finishes, updates are made to the custom database (DB) from which they are displayed from the “Status” UI and the “Check on Status” UI. Upon finishing, the “Final Report” UI is used to show confirmation of the run and the resulting image data is then downloaded from the SRB. Numerical results can be downloaded from the Custom DB. The “Check on Status and Continue” UI allows the intermediate states of the workflow to be checked and acted upon.
Discussion
Besides collaboration, the use of a well functioning public portal enables not only the initial calculation of the experimental results, but also the recalculation for verification of the results, and the exploration of the parameter space of the results. Each initial parameter used in a calculation will be associated with a certain influence on the results. If the parameter is changed slightly, the results may be the same, or they may change a certain amount. The amount of change per change of the initial parameter may be graphed as a parameter vector space, and this graph will show where care must be taken with the initial guess of that parameter.
Disadvantages of the currently architected portal exist, some that have potential solutions, others that are inherent parts of the architecture. Because of the careful ontology mapping and application provenance tracking requirements, more time must be spent setting up a calculation. This discourages quick, ad-hoc calculations from being performed. If one is in the initial stages of a using a new application to perform calculations, the portal will be cumbersome. The de-identification of data prior to being used in calculations is also cumbersome in initial phases of a project. We are currently working through optimizing this process, and it appears software solution should help alleviate this problem. Finally, the architecture requires hardware be available to perform the public calculations. In some ways, that is the entire purpose of the portal, to allow centralized hardware resources to be effectively utilized. Grid enabling the architecture is part of the design, although this may not allow local resources to be utilized.
Setting up the BIRN analysis portal allows general use of BIRN resources and enables effective collaborations between sites. It allows greater exploration of recalculated experiments and the ability to routinely explore complex parameter spaces. The BIRN analysis portal is built is a completely open source solution and is based upon existing workflow expression standards and architecture. The requirements of the BIRN analysis portal are common to those of other large projects that should lead to code and design reuse.
Action Items
- Further work on Ontology management, Research subject management, and data provenance.
- Work towards complete support of ADNI calculations in October of 2006.
- Work towards support of project using two web services provided by UCSD (Anders Dale) and JHU (Mike Miller).
- Streamline and possibly redefine BIRN-DUP so that images can be de-identified without user interaction.
- Formalize web services so they can be evoked from the infrastructure of XNAT and HID.