BIRN AHM 2005: Workflows Overview and discussion items

From NAMIC Wiki
Jump to: navigation, search
Home < BIRN AHM 2005: Workflows Overview and discussion items

Workflows and Analyses Working Group BIRN All Hands Meeting Oct 18-19, 2005, San Diego


Background

Software tools for the complex processing of human imaging data are currently being developed in the Biomedical Informatics Research Network. Workflow tools will allow a smoother analysis pipeline for the processing of images and accompanying non-image based data. It will also allow for normalization of the data across sites so that data can be pooled in a unified manner.

Several tools are available (see below) that can be used to work with this data. The development and customization of these tools will allow existing script based analysis pipelines to move into an era of improved metadata management and the ability to take advantage of web services. Meticulous metadata management is essentially when pooling data from various locations; and without an overseeing body, the use of software tools to achieve this end is of great importance. The ability to use web services provides a way to perform distributed computing, and also in a grander scheme a way to allow rapid deployment of new computation algorithms. This is achieved by enabling the ownership and maintenance of the web service by those who are actually developing a specific computational algorithm.

A workflows & data analyses working group has been defined to help coordinate these efforts. The original group started with a focus on the requirements for the Morphometry BIRN, but has been inclusive to obtain feedback from the other testbeds. A first meeting with representatives from all BIRN testbeds was held in July 2005 at San Diego. Summary notes from this meeting can be found in mBIRN’s wiki pages (see References). At the upcoming BIRN AHM 2005, the Workflow & Analyses group will get together to report on progress and follow up on the discussions from the previous July meeting.


AHM 2005 goals for the Workflow & Analyses Working Group

The tentative agenda for this two half-day working meetings is on the wiki pages (see References). In general terms, the group will: a) get updated on the current state and development plans from three sites (MGH, UCLA and UCSD); b) review the relevance of the applications that should be driving the workflow developments; c) agree on an agenda for the workflow track that allows the exchange of ideas regarding analysis workflow tools; and d) agree on when the next meeting of the group should happen. A summary report from the group will be presented to the whole BIRN group in the afternoon of Wed Oct 19, outlining the milestones/dates/names agreed by the group.


Outline of current workflow strategies that will be discussed

Several tools will be presented in detail in their current state of development, and discussion will ensue regarding the use of these tools and tracks for the future of each of these tools. The discussions will focus around 3 tools:

1) Kepler (Murphy, Pieper, Kolasny) http://www.kepler-project.org

Kepler is a visual modeling tool written in Java. It was begun in 1997 at UC Berkley. Several recent SDM efforts have extended the Ptolemy-II platform (http://ptolemy.eecs.berkeley.edu/) to allow for the drag-and-drop creation of scientific workflows from libraries of actors. The Ptolemy actor is often a wrapper around a call to web service or grid service. Ptolemy leverages an XML-meta language called Modeling Markup Language (MoML) to produce a workflow document describing the relationships of the entities, properties, and ports in a workflow. Presently, Ptolemy actor libraries exist for the domains of bioinformatics and ecology at NCSU and SDSC.

The process of creating a workflow with the Ptolemy software is centered on creating Java classes that extends a build-in Actor class. Usually, the Actor corresponds to a MoML entity, and the Actors is a wrapper around a web service stub or a local program. A concrete workflow consists of a series of these actors interacting in way deemed legal by the abstract workflow document.

2) LONI pipeline (Toga, Pan) http://www.loni.ucla.edu/twiki/bin/view/Pipeline/

The LONI Pipeline is a simple graphical environment for constructing complex scientific analyses of data. It provides a visually intuitive interface to data analysis while also allowing for diverse programs to interact seamlessly. The Pipeline allows researchers to share their methods of analysis with each others easily and provides a simple platform for distributing new programs, as well as program updates, to the desired community. The environment also takes advantage of supercomputing environments by automatically parallelizing data-independent programs in a given analysis whenever possible.

A modular architecture, with task specific components that provide quick and easy customization, is currently under development. Taking advantage of this has allowed dramatic improvements to be made in the upcoming v3 release. These new improvements include improved security, grid computing integration, automated process management, automated resource management, advanced debugging, and fault tolerance.

3) jBPM (Dale, Ozyurt) http://jbpm.org/

JBoss jBPM is a flexible, extensible workflow management system. Business processes, expressed in a simple and powerfull language and packaged in process archives, serve as input for the JBoss jBPM runtime server. JBoss jBPM bridges the gap between managers and developers by giving them a common language : the JBoss jBPM Process definition language (jPdl). This gives software project managers much more control on their software development efforts. After loading the process archive, users or systems perform single steps of the process. JBoss jBPM maintains the state, logs and performs all automated actions. JBoss jBPM combines easy development of workflow-applications with excellent enterprise application integration (EAI) capabilities. JBoss jBPM includes a web-application and a scheduler. JBoss jBPM can be used in the simplest environment like an ant task and scale up to a clustered J2EE application. For easy of evaluation, there is a download of a preconfigured JBoss application server.

4) Imagine (Gerig, UNC) http://www.ia.unc.edu/dev/download/imagine/index.htm

Motivated by the needs of large clinical neuroimaging projects for fully automatic batch processing of complex sequences and networks of programs, and the availability of a large library of ITK tools, the UNC Neuroimaging Laboratory has developed a pipeline processing tool that includes a large set of commonly used 3D image processing tools (multimodal tissue segmentation, bias correction, linear and nonlinear mutual information registration, high-dimensional fluid registration, reformatting, mathematical morphology methods etc.). “Imagine” is used to design image processing pipelines and to run these on large set of data as batch jobs, supporting multiple processor machines. Imagine is composed of: a) a pipeline creator to generate your pipeline, b) an image viewer to look at the 3D images, and c) a wizard for easily running the pipeline. The tool can be used without any programming experience, is easy to learn, and is tested in large clinical studies with hundreds of 3D images (tutorial at http://www.ia.unc.edu/dev/tutorials/Imagine/index.htm).

Imagine will be followed by Imagine 2, a cross-platform software which combines pipeline processing and a visual programming environment, similar in its spirit to AVS. Imagine 2, developed by Matthieu Jomier, is a user friendly interface that generates pipelines by connecting modules together and does not require advanced programming skills. It integrates ITK for image processing and VTK for visualization and also enables new modules or command line executable to be easily added. A full graphic user interface can be generated to run the created pipeline easily and a source code describing the pipeline can be automatically written in C/C++ language with Dyoxygen documentation. The tool will integrate most ITK filters, especially registration and registration modules and spatial objects.


Next steps proposed for discussion are:

Having agreed on scientifically relevant applications as drivers of the technological workflow developments, the discussion will include:

1) Arrive at a consensus for a set of workflow languages that can at be intraconverted using a specific transformation sequence. This should be possible by adhering to the support of a specific set of workflow patterns. Guaranteeing the existence of appropriate modules and data types will be addressed in (2) below. Importantly, the workflow engines of mBIRN should be able to support these patterns, but more importantly, workflows should not be expected to support patterns outside of this set. It appears that the patterns that are supported by most scientific workflows are Sequence (execute activities in sequence), Parallel Split (execute activities in parallel), Synchronization (synchronize two parallel threads of execution), Exclusive Choice (choose one execution path from many alternatives), Simple Merge (merge two alternative execution paths), Arbitrary Cycles (execute workflow graph w/out any structural restriction on loops), Cancel Case (cancel the process), and Sub-workflow (place a workflow within a larger workflow). Clearly the business workflows (jBPM) are capable of many more patterns than these, although if they were somehow artificially restricted to this set of patterns they could also potentially be intraconverted with the scientific workflows (Kepler, LONI). Complex activity requiring patterns outside of these specifications could possibly be supported if specific modules existed for the engines to support them.

2) Set up a repository of programming modules, web-services, and data types to be used in workflows. The scientific workflows achieve most of their functionality through the software programs that they tie together. These programs are highly specialized and rapidly changing. If one expects scientific workflow programs to operate with these programs, they must be available. The use of web services to conduct these special functions takes away the challenge of delivering these programs to the workflow engines environment, but the locations of the web services will still need to be published. This can be solved with a system of web service discovery that can be put into place. Finally, the data types used by the programs can be complex and specific down to sub-versions of format types. A repository of data type definitions (commonly know as “standards”) and conversion programs that allow one data type to be converted to another will be necessary to achieve workflow engine convertibility.

3) Set up web services to provide enumerated values to be used as choices in the workflows. Workflow definition languages provide an organization of slots in a document, similarly to providing a standard request form to fill out. It does not provide an interchangeable language to be used to fill out the form with values. For example, the programming modules and data types referred to in (2) will need to have names that are entered in the workflow diagram. Either these names will need to be presented prospectively to the users while they are creating their workflows, or a retrospective mapping will need to be made between names to keep them organized and pointing to the same module. This service will also allow mappings to be expressed between physical entities that have different names at different sites or in different workflows but refer to the same object.

4) Security infrastructure for workflow services. A key requirement for the ability of the workflow services to access compute and data resources is the integration of these services with the core BIRN infrastructure. This core security infrastructure is based on GSI (Grid Security Infrastructure) that provides the PKI (Public Key Infrastructure) security components that allow test bed researchers to launch long running jobs to any available computing resources on the BIRN grid. Specifically, there are three aspects of security to be considered. First, the workflow services and systems need to be able to utilize the GSI for authentication and authorization. Second, one needs to have a catalog of trusted web services and a method to verify that as a client, one is indeed connecting to the trusted web service. Third, the web service will need to verify that the client is trusted and allowed to use the resources that it provides.


References:

UCSD Workflow Development Retreat (July 17-18, 2005, San Diego): http://www.na-mic.org/Wiki/index.php/Mbirn:_UCSD_Workflow_Development_Retreat_%28July_17-_18%2C_2005%29

Workflow & Analyses Working Group Agenda for AHM 2005: http://www.na-mic.org/Wiki/index.php/2005_AHM_Planning:_Workflows/Analyses_Working_Group

General Workflow Information on mBIRN wiki pages http://www.na-mic.org/Wiki/index.php/Mbirn:_Computational_Informatics, http://www.na-mic.org/Wiki/index.php/Workflow_Updates