SDIWG:Ongoing Discussion

From NAMIC Wiki
Jump to: navigation, search
Home < SDIWG:Ongoing Discussion

Meeting notes for NIH Roadmap National Centers for Biomedical Computing (NCBC) Centers and the Software and Data Integration Working Group (SDIWG).

Link to NCBC Working Groups, and Recent discussions with PIs in 2011

Charter of SDI: The RFA states the goal of creating “the networked national effort to build the computational infrastructure for biomedical computing for the nation”. Here is a link to the wording in the original RFA-RM-04-002 [1]. In furthering this, the goals of the SDIWG in concert with the Project Team and Centers staff are: 1. To advance the domain sciences, and promote software interoperability and data exchange. 2. To capture the collective knowledge of software engineering and practices among the Centers and publish this knowledge widely.

Summary: The Three efforts that are proposed under SDI are:

1. All Software Engineering Meeting. This was proposed by Bill Lorensen of the Kikinis’ Center. The best time for this meeting is in concert with Digital Biology 2005 that will be run by BISTIC later this year (or early next year). The impetus for this meeting stemmed from the realization that the ITK/VTK software development process is highly mature and that other Centers could have a lot to share. It is not expected that the software engineering meeting will endorse a single monolithic set of practices, e.g., clinical researchers may need to develop along different lines than the ITK/VTK process. This is potentially a one-of-a-kind meeting where software engineering and data management efforts that span the Centers are compared, lessons learned are captured and published, and the Centers can help each other develop toward mature processes. Preliminary dialogue will be conducted electronically and hosted by NAMIC wiki site http://na-mic.org/Wiki/index.php/Main_Page.

2. Yearly interoperability demonstrations. This will require some continuous effort on the part of the SDIWG and Centers leading up to a ‘live’ yearly demonstrations, the first being a major conference in the Spring of ’06. The goal is to promote practices that will enhance software interoperability and data exchange, and to demonstrate that publicly. The plan is for the Centers to develop and run a (computer) application that exercises some components of each Center’s software, and illustrates how they can be made to work together (i.e., interoperate) to solve a common, meaningful problem. The plan is for the consortium (involved program staff plus Centers staff) to decide on the goals and parameters of the demonstration in the coming months. It is expected that if the consortium cannot decide on a single application that covers all Centers, then we will work on a cluster of applications that are meaningful to subgroups of Centers. Initially, intra-operability demonstrations may be appropriate. The demonstration topic(s) and the conference will be decided a year in advance. The demonstration may involve formal presentations. The scale and scope of the demonstration should not require innovation just to support it. Coordination of the demonstrations will be supported by the NIH.

3. Foster ineractions and collaborations with investigators and Centers that are relevant to the NCBC Program, but funded from other sources.

These efforts involve time commitments from the Centers staff (the interoperability demonstrations will be continuous throughout the life of the Centers, while the All Software Engineering Meeting is potentially only one-of-a-kind) and will need to be coordinated with other activities that are being proposed from the Program (little P and big P) side.

Discussion Points: General

General comment: there are two types of ‘big P’ activity: (i) opportunistic synergy between two or more teams (e.g., imaging as phenotype), and (ii) changing the culture of biomedical computing through, e.g., interaction with standards bodies, leadership in developing processes for biomedical software engineering, developing community databases, etc. The first type (i) is clearly useful. The second type (ii) is more difficult. So far the PIs have self-organized successfully more around the first type (i). How to capture and publish the successful, sometimes unique, software engineering approaches of the Centers? For example, what is a good granularity for software development repositories? Does anyone use ‘requirements, design, implementation, testing,…’ to develop software? How does Extreme Programming fit in to the software engineering practices that are needed for the Centers? How will NCBC disseminate software, tools, and data? Will the NCBC expose CVS, or expose java files or whatever? Should the NCBC develop a common portal? How are applications built (early vs. late binding), and how does that relate to the domain science? How do the database, data modeling, curation,… approaches of each Center differ? Should the teams adopt leadership roles in standards, data modeling,…? Interactions with standards bodies can slow up the progress of an NCBC Center, especially in the early stages. How are the 7 Centers going to connect with related efforts that are funded under other programs, e.g., the modeling and simulation community, image post processing community, data acquisition and database…? Communications: Videoconferencing, teleconference, Wiki. Some Investigators are not convinced that large-scale videoconferencing is all that useful. IP: Centers may need real Program help here.

Discussion Points: Compatibility/Interoperability Issues

At the SDIWG teleconference on 20050218, Peter Covitz talked about the interoperability issues that are being faced in the caBIG effort--at the outset there is a need to define what is meant by interoperability. There are multiple categories and different degrees of strategies illustrated by a 2D matrix with ‘Category’ on one axis and ‘Stringency’ on another axis. Here is the link to the caBIG Compatibility document.

Draft caBIG Compatibility Guidelines Revision 2 caBIG participants have provided extensive feedback on the caBIG Compatibility Guidelines. Revision 2 of the document incorporates that feedback, and is now available in draft form for additional public comment. Please send any comments you may have to ncicb@pop.nci.nih.gov. The final form of Revision 2 will be posted no later than June 2005, and at that time will become the prevailing operational guideline for the caBIG program.You can go to the caBIG Compatibility web page for the latest version of the guidelines.


Discussion Points: Interaction Matrix--Current list of possible interaction among the Centers based on Computer science and Domain science areas that overlap

Kikinis-Altman: Extending ITK/VTK approach for software development to SimTK. Links with imaging and modeling. Kikinis-Toga: Neuroinformatics shared interest. Mostly dealing with LONI pipeline. Kikinis-Kohane: Using imaging as characterization of phenotype (lung—COPD or asthma, DiGeorge syndrome). Database is also common ground through Partners (Glasser), and data sharing connection with BIRN. Altman-Toga: Database theory. Modeling. Altman-Kohane: Standards and database. Genotype-phenotype studies. Toga-Kohane: Genemap and Huntington’s. Image as phenotype.

Debrief on December 8 Software and Data Integration discussion with Altman Team

Wednesday 20050112 Attendees on phone: Russ Altman, Jeanette Schmidt, Scott Delp, Michael Sherman (chief software architect), David Paik (exec team SimBios), Peter Highnam, Peter Good, and Peter Lyster.

Please note that this material has been edited to make it easier to read and should not be considered a transcript.

Preliminary concerns: resources associated with networking effort. The focus should be on things in critical path, and making sure that this is publicized. There should be limited extra drag.

Delp: There is a trade off between little p and big P, and that may involve losing student slots to big P.

Lyster: Asks how to move beyond the low hanging fruit, and develop a national profile and impact? Altman: With success we need to disseminate knowledge gained with written peer reviewed reports, panel discussions and workshops, similar to pharmGBK—that effort was similar to NCBCs in that it involved separate awards that were not explicitly initially formed to be a network, and so it took a number years to form the network. Thus we need time. Then hit the road with conference panels and white papers. The second big P (see discussion points above) will take time to mature. Low hanging (first big P) is short timescale.

Schmidt: A plan for full integration of ITK/VTK/SimTK would not be simple, but we will certainly do it in at least an incremental fashion, beginning with applications which use all these packages.

Altman: Data software engineering summit best practices in Fall good idea. Interoperability demo need to demonstrate examples early on like ITK/VTK/SimTK. Should follow the operability demonstration -- keep in mind that SimTK is in an early development state compared with VTK and ITK. Fall ’05 is overly optimistic. Start that discussion soon and then need a 12 month lead time.

ITK may have a commercial tie in, whereas SimTK is scientific and will develop along a long timescale. How to gather the researchers in related domain fields around the country: the competition is in the DBPs, e.g., neuromuscular biomechanics. In other areas there are fewer researchers in the community who have overlap with Stanford Center. For RNA modeling there is strong tie in with such efforts as Amber, CHARMM, X-Plor, GROMACS, so there are a lot of community connections. However for myosin modeling at larger spatial scale there are fewer efforts. So there are some low risk and high risk areas.

Sherman: Stanford Center is aware of ITK/VTK vs. late binding philosophy. We will have a thin framework in which a wide variety of software modules can be obtained from our repositories and assembled into novel combinations. We expect this to include a large number of simple, purpose-built applications. Some of these could be constructed for web delivery, but that is not our primary focus at the moment. Development of good user interfaces is a big challenge and we hope to support a great deal of lightweight experimentation allowing for some Darwinian evolution towards increased utility. Their primary mission is to deal with models. They don’t think early late binding are mutually exclusive approaches, indeed ITK C++ develop nice web applications.

SimTK development will maintain a journal (the SimTK Engineering Journal) with design documents and technical discussions for use by our development team. The primary feature is to have archival material that we can reference from within our code, although it also provides a nice way to share some of what we're doing with interested parties, locally and elsewhereIn the past. I have had good luck getting programmers interested in making contributions to a journal like this, some more than others. The bar will be set very low to encourage submissions and a forward-referencing facility provided so that readers are led to errata, updates, and replacements. We are working on getting this published to SimTK.org and as soon as it is there I will put links to it from the na-mic Wiki; this may be useful for SDI purposes.

Schmidt: They can certainly explore the exposure of CVS repository. [Sherm technical note: our current intention is to use the CVS successor, SubVersion.] Regarding database curation, there will be substantial differences between the Centers and these can be deal with later in the effort. With the current mix of basic biomedical computing and preclinical efforts, not every Center will have same policy for curation. The database issues for Stanford modeling effort are mostly back ended, e.g., they will create data from simulations. They will need to do informatics to data mine it later, i.e., will not have large-scale problems with data management any time soon. Regarding the ITK/VTK approach, Stanford can check in models without putting core software at risk. They will talk with ITK/VTK (Lorensen/Kikinis) regarding key issues such as optimal frequency of builds. Keep in mind that it is not a requirement to have the build all at once. It will be important to define interfaces.

Highnam: Build assembly is piece of puzzle, but a separate important issue for software development is discipline with regression testing. It is important to develop an objective way to reject or accept submissions. Regarding interoperability, it is not the purpose to corral everyone to use same standards, but to make interoperability easier. Use an occasion such as RSNA/SC/SFN/AMIA to show off wares.

Altman: A candidate for an interoperability demonstration is simulation with visualization enabled by ITK/VTK. Either neuromuscular simulation or RNA simulation could be visualized by ITK/VTK.

Delp: Can we use funds for the Evaluation effort to support an interoperability demonstration?

Altman: What are expectations of collaborating projects? Ultimately, the Stanford Center may receive requests from 20-50 groups. There is a concern of gridlock. There is a cost for each initial interaction—time spent on the phone. There have been numerous contacts from other large and small groups in the country. Good: Perhaps we will have to come up with guidelines across Centers. Two stage interaction, superficial followed by detailed after funding. Project team is trying to provide matchmaker service.

Altman: Although the working group is titled “Software and Data Integration”, a metric of our own success should be the added ‘expertise and experience’ when this is done—this is beyond just interoperability.

Debrief on December 8 Software and Data Integration discussion with Toga Team

Friday 20050107 CCB Attendees on phone: Arthur Toga, Ivo Dinov, Tony Chan, John Haller, Greg Farber, and Peter Lyster

Toga: Taking advantage of the pairwise interactions of mutual interest, aka, the low hanging fruit, is good and can be leveraged quickly. More involved efforts can happen later. Kikinis’ and Toga’s Centers are well under way with discussion for mutual interactions. He suggests not to change what they’re doing, and to seek good demonstration projects. The Interaction matrix should be developed with specifics. Individuals can be matrixed. Then coming up with things that are not yet under way.

Haller: Asks if there are examples of interoperability demonstrations? Kohane gene map of Huntington’s should make a good connection with Toga Centers’ atlas effort. Kikinis’ Center has obvious possibilities—this week, Jags came on board and will interface between Toga’s and Kikinis’ Centers. Jags will also bring commercial sector perspective to the academic model.

Toga: Suggests there is a need for two types of meetings (i) leadership (as in Dec 8); (ii) technical staff meet without leadership. For example, he has sent people to Harvard. Sociology is important. Video conferences should not be too frequent. Toga is interested in how to develop a spontaneous relationship.

Toga: Because they have a hire (Jags), Toga’s team can absorb some cost of interoperability test. The mechanics of an interoperability test--is it seamless, should it use software and databases, how much real-time user mediation should there be? It needs to be more than just cosmetic. Perhaps we should have a joint meeting this Spring, or else we won’t all be on the same page, like kids whispering different messages to each other on the school bus. Farber says that one meeting together is better. Arthur agrees. First priority is to decide on parameters of interoperability.

Toga: It would be good to get coordinated early regarding software standards and data models. It would be good to look at the seven Centers as unified whole.

Haller: The lead science officer (LSO) has a job to facilitate. Toga thinks informal visit is a good way to start, with periodic conference calls—not every week. There is an inside ccb web site. Dinov says they have special interests groups that meet every 3 to 4 weeks and have focused discussions on techniques, tools, distribution, planning, evals, education, etc. Toga says they will soon have improved video conference capability.

Toga: Regarding the ITK/VTK/SimTK development process, Toga Center’s core development is relatively centralized, and they plan to move to open source is later in cycle. This is not expected to change the ultimate outcome (of software quality). Some part of development will use Kikinis’ Centers expertise. Once the software is stable is will be placed in an open platform. The proof will be: if it can’t be used by others then it’s not useful. Java is front end (runs everywhere), while the compute core(s) are C++, so easily adopted. All CCB Java efforts take place on CVS. Ivo, Jags is starting on project to use Java interface interactions with Java tools and precompiled C++ libraries. The obvious thing to do is to work on the interface between ITK and Toga’s tools.

Toga: His team doesn’t need to be a leader in standards. The main issue is ontology development, where it is important to develop different structures. Toga’s Center deals with anatomical structures—Ivo is the leader here and DBPs relate to other efforts in the country. A key issue will be the development/discovery of biomarkers. Scientific developments should be taken in account as much as computational tool developments.

Toga: There are several leading groups that are starting to adopt his computational atlas and database infrastructure. This could feed into the All Software Engineering meeting at the right time.

How will NCBC disseminate software, tools, and data? Toga: De-centralized model: each NIH/NCBC Center should create a software (SW) download page that simply has direct links to the individual Center’s SW downloading sites. So, NCBC site only contains one paragraph description of all SW available at all NCBC Centers. All technical specifications and instructions will be provided at the specific Center’s pages.

Toga: Pros: Most up-to-date software is made immediately available; Center’s retain distribution rights/responsibilities (e.g., support); The least maintenance required.

Will the NCBC expose CVS, or expose java files or whatever? Toga: No. But, Centers probably will, or at least have a downloadable current SW source code link.

Should the NCBC develop a common portal? Toga: Sure, but having all NCBC pages freely available without a portal (using access/cookie counts) seems better. Why should users login? Is there a real need (e.g., stats reports, evals, etc.)?

How are applications built (early vs. late binding), and how does that relate to the domain science? Toga: Continuous integral binding of applications across Centers will be the most efficient – attempting a complete early binding will slow down development. Late banding won’t work, as foundations may be vastly different and require major redesigns to integrate.

Toga: CCB/NAMIC efforts are well synched as we hold joint meetings, exchange developer visits and integrate tools as we go on. Not clear if similar integrations of all 7-8 NCBC’s is possible, however.

Toga: Recommendation: Perhaps we can recommend that each NCBC Center should have close tool integration with at least one other NCBC, possibly more. This will ensure we have a connected network – all Centers are interconnected and do form a basis for a National Biomedical Computing Network.

Toga: Joint yearly meetings of high-level Center principals may help spot and advance inter-Center relations and tools integration. More frequent meetings are not recommended.

--Dinov 17:05, 18 Mar 2005 (EST)

Debrief on December 8 Software and Data Integration discussion with Kohane team

Wednesday 20041229 Attendees on phone: Isaac Kohane, Susanne Churchill, Henry Chueh, Valentina Di Francesco, and Peter Lyster

Kohane: A number off issues we could discuss in this conversation. 1. Standardization of tools and conformance procedures, 2. standardization of data, 3. interoperability demonstrations, 4. common repositories or inventories.

Kohane: Regarding interoperability, believes that one of the central points of contention is going to be at what level of interoperability? Does it mean depositing (data and software) in the same repository or is it that data and tasks can be passed around “seamlessly” between applications written by different NCBCs? Or is it somewhere in between? The first option is the easiest but does not test the real underlying question of synergy in our efforts across the NCBCs. The other option is going to require much more coordination, agreement on the overall task and a much higher level of interoperability. It will also require more resources.

Lyster: What is software development process (requirements, design, implementation, testing, validation,…)? Is there a software repository? Do they intend to give open access to the, say, CVS repository, or disseminate the software ‘bundled’? What role does Kohane’s team see that ITK/VTK/SimTK approach could play in their own Center?

Kohane: A good repository is one that is routinely run through conformance suite. Of all the four NCBCs, Kohane’s is more human disease to gene oriented, as such they deal with microarray datasets, and standard classification of supervised and unsupervised learning, translation from one vocabulary to another (GO, SwissProt,….), natural language processing, extraction of coded items. All this has to be folded into ‘interoperability’. The Kohane approach which is more along the line of process control or (which might use) Web services. Chueh has developed a Hive (or catacomb) approach (see below). This is flexible, and the Kohane team could readily embed ITK tools in their pipeline or workflow, e.g., CT scans of the bronchial tree. Huntington’s is not standard imaging, but there is a good chance of overlap with the gene expression side of Toga Centers’ atlas work. In relation to I2B2 the ITK approach may fit the beginnings of the interoperability question whereas the goal for I2B2 is to answer head-on what it takes to integrate disparate tools into a generic workflow environment through choreographed web services.

Lyster: Does Hive expose the source code (‘open to developers and users’)? Kohane: Dissemination is sharing the open source and exposing the cell locally and including it in the workflow. Chueh: the Kohane Center may develop cells using the ITK approach. Kohane: Any potential I2B2TK (toolkit) has the same generic aspect of ITK (test data sets, dashboard, etc), but the Hive approach applies to a different level of the workflow. Hive could be populated by different widgets. There is a great need to emphasize interoperability between Kohane Center’s widgets and other widgets. This can be a candidate for interoperability demonstration. And again, that is a level of interoperability that goes beyond what is currently available in the ITK and Kohane would like to see a joint effort across all TK-activities to this next level of integration. But because that is a vast task, focusing on a single application/demonstration testbed will be an important requirement for success. Churchill: We are going to know new Centers by summer, and hopefully one will have a clinical flavor. Kohane’s Center expects to lead by example, and will be happy to deal with open or closed institutions. A key aspect of software that is intended for clinical research deals with semantics. Thus an ‘early bind’, or monolithic, approach to the software application build process may not be as appropriate—the Hive will promote late binding. In general, a loosely coupled interconnect is useful for clinical-type situations. Although a lot of the software development process of the Hive is traditional, the requirement for loosely coupled build may be more critical for clinical research than in other areas of scientific computing. In this setting it is difficult to have a stricter or open(?) protocol. The Hive espouses a lightweight interconnect model, like the web. Develop software using JAVA/J2EE with an open source development process.

Lyster: Does the Kohane Team use ‘traditional’ software engineering (SE) approach of: requirements, design, prototype, implementation, testing,…. Kohane: They have replaced classical SE by bottom up development and then top-down DBPs and see what investigators are looking for. This is a continuous, i.e., real time, process. Clinical genomic research is breaking new ground. First iterations use whatever it takes to get the job done. Then the goal is to step back and capture what works about the process. Their approach is software engineering is consciously “naïve”. The goal is to capture what developers and investigators really want. Chueh: this is not unlike Extreme Programming.

Lyster: Given their demands of semantic complexity, should they connect with W3C? Kohane: they are interested and have made connection—they need to stay lightweight. The semantic web is about data, and Kohane has heterogeneous data, but his team needs to be flexible because semantics around how the data moves even if someone tries to forcefully define ontologies.

Lyster: The data and informatics environment at BWH and associated Partners institutions is not unlike the new Clinical Research Information System (CRIS) at the NIH Clinical Center—should Kohane Center connect with CRIS? Also, should they connect with caBIG and BIRN? How do we capture the knowledge that we gain in making those decisions?

Kohane: Regarding NIH-CRIS<->Partners, Clem McDonald is on standards bodies, and could be prevailed upon. The shared pathology informatics network (SPIN) is a consortia funded by NCI which is a good resource. They used lightweight protocols successfully, and are making good connections with caBIG. There is a balance between the need for centralized database management systems and distributed systems. But how much of Big P (ii) can the NCBCs do? It’s probably not productive to start off as committee or standards—you will end up spending a lot of time solving things that don’t need to be solved, rather than getting on with the scientific work. Kohane’s philosophy is they don’t expect people to adopt his approach unless the software and standards they develop are useful--hence the need to expose as much as possible. Kohane: It’s best for form a good Center first and then later on making connections (software and data federation) with research informatics systems such as CRIS.

Miscellaneous issues around All Hands Software Engineering meeting and the Interoperability demonstration: Kohane and Toga Centers may develop a strong interoperability demonstration around brain disease-gene expression. Question: What timelines for the interoperability demonstration is Program hoping for? Answer: The plans and execution of the interoperability demonstration(s) and the timeline needs to be decided on by Program and the Centers (i.e., the ‘consortium’). At first, the suggestion was for a Fall ’05 demonstration, but this does not allow enough lead time for preparation. The Collaborating solicitation (due out this Spring) may be too slow to contribute to this (at least in the early phases). There is a need for a quicker turnaround for something like this—a supplement could help prime the interoperability demonstration.

Debrief on December 8 Software and Data Integration discussion with Kikinis team

Friday 20041217 Attendees on phone: Steve Pieper, Karen Skinner, and Peter Lyster. With additional later comments from Ron Kikinis.

Pieper: Kikinis’ and Altman’s Centers are already meeting to discuss joint ITK/VTK/SimTK software engineering approach. This may be a powerful combination. There is effective use of C++ hierarchies. Kikinis’ original U54 application proposed interoperability effort with Toga Center’s LONI pipeline. LONI is different from SimTK/ITK is pipeline, but will build components in similar way. Kikinis’ Center has extensive plans for outreach and training. Their short-term collaboration with Kohane team deals more with domain science and algorithms (image as phenotype) rather than software development aspects. There are some connections between Sean Murphy of Kohane’s Center involved with Glasser clinical db (Murphy is head of BWH Core 4 I2B2) and is an investigator with the BIRN initiative. This is a natural place for data integration with I2B2. There are two elements that relate to interoperability with other Centers (i) image processing pipeline in LONI and (ii) networking and data sharing in BIRN.

Lyster: Does the Kikinis team use traditional ‘requirements, design, implementation, testing,…’ to develop software? Kikinis: NA-MIC employs extreme programming techniques (mostly). http://www.na-mic.org/Wiki/images/3/39/SoftwareProcess.pdf

ITK is about taking established, validated, software and migrating it to an open environment that has a professional repository with development and build process. BIRN is about infrastructure for data sharing and analysis. Kikinis team is developing algorithms and applications software that are built on ITK and BIRN. Kohane Center may have a connection to Kikinis via BIRN, Stanford has a natural connection to the ITK algorithms and software engineering practices. Toga fits in both.

Lyster: Regarding the development of SimTK, do you think this will naturally lead to one large software repository with a ‘dashboard?

Pieper: Perhaps the approach may be similar to ITK/VTK that are maintained in separate repositories, yet promote effective build and interoperability (they use C++). Perhaps SimTK could use lessons learned from ITK/VTK to enable interoperability, a la ITK/VTK/3D Slicer. This has been discussed at recent workshops meeting between Center members. VTK/ITK are large by any standards. The two are separate and this is a good testbed to assess the level of granularity that is needed for repositories.

Lyster: We would like to capture this knowledge about process and methods and disseminate it.

Pieper: Process at Stanford will be good test case—pilot case. 3D Slicer will go through architectural rework to adapt to needs and practices that develop in the NCBC framework.

Lyster: How big is community that is not included? Pieper: ITK is a newer set than VTK—broadly speaking, the group who use VTK don’t use ITK is large, but among ITK users I'd guess more than half use VTK. There is a text book about ITK. What about the lone wolves who don’t want to be ‘in’? There may be technical reasons for not joining, e.g., compromises in ITK may make certain optimizations more difficult. Can we capture that? What legitimate reasons will keep people out? People use proprietary tools that may not be compatible with NCBC—a case in point involves the use of Matlab. What can they do to enable interoperability? Kikinis’ Center is reluctant to promote proprietary software in their build because that could mess up (among other things) the ability to debug. The principle of forming an API to handle proprietary software is good in principle, but difficult to implement. Beauty of open source (OS) is it gives you a complete do it yourself installation. VTK has software build tree. Every piece of process can be debugged—every line of code, and this makes for a powerful environment. If you use an API to firewall proprietary code then it is difficult to tell which side of the firewall the bug occurred. Try to build in such a way to not propagate. NAMIC has assembled a completely non-proprietary set of functionality.

Lyster: Can they impact data modeling and data standards? Pieper: As part of BIRN they have spent a lot of time on this. A lot of BIRN has been especially constructed to deal with database issues and data hierarchy.

Skinner: Is the Kikinis Center involved with NIfTI. Pieper: They plan to use NIfTI 1 format. NIfTI has done good job. They are now working on NIfTI 2 (fBIRN) file format is basis for standardization. If people use NIfTI it can be built upon. Skinner: Can Centers identify new software that need to be made compatible? Pieper: Kikinis team is seriously working on using it as a NAMIC file format for sharing data. Skinner: What are sociological or technical issues? Pieper: No one wants to support many file I/O routines for file formats that have essentially identical capabilities, but no file format supports all of the desirable attributes.

Pieper: IRB approval is a lot of work. For example, investigators in the BIRN initiative had to deal with this issue. Lyster: How do efforts like this compare with NCI Lung Image Database Consortium (LIDC)? Pieper: LIDC-type effort is bigger project. Huge numbers of subjects is bigger problem. BIRN de-identification, particularly of facial features, is an important research effort.

Kikinis: Part of the IRB issue was created by HHS/HIPAA. HIPAA is not tailored for scientific research. NIH could/should provide leadership in improving this situation.

Lyster: How can we plan and execute interoperability demonstrations? Pieper: For starters, Kikinis Center naturally in the mode for communication—they use a Wiki site http://na-mic.org/Wiki/index.php/Main_Page. Wiki is like capturing emails in useful form. They are using it fleshed out communication issues. We can look in on this site as well. All hands meetings may help to force wrap up. Deadlines may be used to create focus. DBP should be used to promote useful tangible software—not just toy efforts. Perhaps discuss this at the NAMIC all hands meeting in February.

Kikinis: Programmatic requirements beyond things that happen spontaneously have to be backed up by resource allocation. There is also the danger of ‘demo disease’ that occurs in some of the supercomputer centers. A related issue is where the academics should stop and the for-profits should take over? How to deal with a legitimate effort that requires product level quality but does not have sufficient short term commercial potential?

Skinner: What is the most useful thing we (Program) can do for the NCBC Centers? Pieper: Already the opportunity to do the project is huge help. Kikinis’ concern: If we put effort into inter-NCBC interactions then it could detract from their own effort. One job of Program is to do PR, to make sure the NCBC gets credit for discovery, and this means the PIs need to inform Program of progress and discovery. Dissemination will capture success.

Lyster: What is a feasible timetable for All Software Engineering meeting? Pieper: There are a lot of meetings in the Spring and Summer. Perhaps wait until Fall to include the new Centers. Lyster: Should we do it with interoperability demonstration? Pieper: Pick examples that would push discussion at Software Engineering meeting. Centers could describe their practices that they would offer as models for others to emulate.

Summary of Slides presented at December 8 Software and Data Integration session

Mission statement for the software and data integration working group under the NCBC

Building the computational infrastructure for biomedical computing for the nation:

• (1) Improvements and efficiencies in software development and maintenance that may be achieved through promotion of shared software engineering practices, interoperable software, reuse of software components at various stages of granularity, and the use of common software repositories where appropriate for software development and distribution.

• (2) Explore opportunities for networking data relating to data acquisition, data and metadata interchange standards and formats, data models and ontologies, knowledge bases, grid computing, data distribution, security and confidentiality, IP, and the interfaces between algorithms, models, and data. Effective use of Web services.

• (3) Address areas of overlap and interaction with other NIH and government-funded efforts.

• (4) Address substantial issues related to the physical world modeling—this involves how models are defined, manipulated, visualized and interchanged.

Related Issues:

How similar/different are the Cores among the NCBC Centers?

• Software engineering methods • Shared software repositories • Software dissemination methods • Effective use of web services • Shared data models

Down the Road:

• Generation of an inventory of relevant areas where SDI is important for the NCBC • Make connections with similar efforts in large-scale computational science and their lessons learned • Analyze barriers to forming the national infrastructure • Plan leveraged demonstration projects • Plan for team infrastructure to continue development and implementation of activities among NCBC Centers and with other NIH and government-funded efforts

Activities for the Coming Year:

• Working group interaction with NCBC Centers to explore integration • Road trips • Involvement in upcoming meetings: ISMB, AMIA, SC05… • Portal

Appendix A: Notes from Peter Lyster on Dec 8 SDI session

Investigators have pairwise (although not all-to-all) overlap that can be leveraged in (i) software engineering approaches and (ii) domain science. E.g., Toga-Kikinis neuroinformatics; Kikinis-Kohane imaging phenotype; Altman-Kikinis Software engineering approaches; Altman-Kohane: database phenotype analysis; Toga-Kohane database theory; Toga-Altman: database theory.

Kohane: the effort will be sterile unless it happens in the context of a domain problem. Use USE cases which involve representative issues.

Altman: radio silence is good, but there is a need to declare areas of interest • Level 1 issues involve areas of that are critical to the success of combined group (big P) that need to be addressed, e.g., mesh methods. Theses should be dealt with in a bottom-up manner. • Level 2 issues involve long-run efforts, e.g., building an ontology. Should that be top-down or bottom-up?

Toga: There is a need to connect with other efforts at NIH and other agencies. NIH program staff should make the inventory. Need history of documentation and lower-level staff need to be involved. We don’t need an ‘all-hands’ meeting to deal with SDI.

Mike Marron: Should we convene a ‘bazaar’ to exchange ideas and technology and software. Perhaps involve other agencies. People were positive about that.

Lorensen (and Altman?): Convene an All Software Engineering Meeting (software development practices, data modeling, …)

Mike Marron: We need to develop big P which helps avoid problematic competitiveness among NCBC Centers.

Sri Kumar says we need branding. Discussion about the need for proper branding or logo’s. Mike Marron suggests using common set of glyphs (icons) for presentations. Maybe one of the PIs (Toga?) could develop a global logo. [Since December 8 the NIH Roadmap management has asked not to do this].

Appendix B: Notes from Shira Katseff on Dec 8 SDI Session

Report of pairwise interactions There was a good deal of excitement about new initiatives, the R01 program, and swapping out DBPs over time. Many Centers have preexisting connections and face common challenges (especially in software engineering) They also face common administrative challenges. One concern was that POs and LSOs have different levels of involvement between projects.

Overall, they found the opportunity to meet alone to be informative and would like to be to have a forum to speak without the funding agency present upon occasion.

Arthur Toga requested adding a “management team” session before the end of the day including (John Whitmarsh, all POs, all LSOs, all SOs, and all team members present)

Management plan Each Center is managed by a Program Officer and a Lead Science Officer (one arm removed from funding decisions, sits on advisory board from Center). The PO and LSO will be from different Centers. Science Officers are chosen by the Lead Science Officer.

Since this year's funds began to be used later in the year, there is some extra funding. Carryover for FY06 is likely to be approved, but all carryovers must be requested and approved.

Committees - Executive committee (Eric Jakobsson) - Assessment and Evaluation Working Group (Chuck Friedman) - Software and Data Integration Working Group (Peter Lyster)

All committees report to RIWG, which makes funding recommendations to the RICC. The RICC makes final funding decisions.

Software and Data Integration Working Group (Peter Lyster) Up front: Arthur Toga, Russ Altman, Isaac Kohane, William Lorensen (GE Research)

Mission statement for the software and data integration working group under the NCBC: Building the computational infrastructure for biomedical computing for the nation.

Introductory Presentation (Peter Lyster) 1. Software development (shared practices, interoperability, reuse, common software repositories, based on open source methods)

2. Data (networking, meta-data information, web services, overlap with NIH activities and other government-funded efforts) Common issues related to physical world modeling include: data models, knowledge bases, grid computing, data distribution, security and confidentiality, IP, and the interfaces between algorithms, models, and data

3. Related issues (software engineering methods, shared software repositories, tools for handling overlap). Open source will lead to improved overall product

4. Down the Road Will need inventory and protocol for overlap in domain areas and software engineering methods) Identify barriers to national infrastructure and decide how to turn individual efforts into a larger national vision.

5. Activities in the coming year Working group interaction with NCBC Centers to explore integration Road trip(s) Portal (site-internal search engine) Involvement in upcoming meetings (ISMB, AMIA, SC05, RSNA, SFN,…)

Response presentation (William Lorensen, GE Research) We need an open source model: - community with a common vision - pool of talented and motivated developers - mix of academic and industrial participation - organized, lightweight approach to software development - leadership structure - business model (share credit, not just money)

Advocated extreme programming approach and portable, open source tools.

Principles of extreme programming: Community owns the code, continual release and integration. All developers agree to keep software defect-free.

Isaac Kohane pointed out that goals can only be successful with shared semantics and standards of interoperability for a specific context. This will ensure that uniform standards will be used.

Russ Altman suggested two possibilities: - create a high level ontology with an ontology language too. This method is tedious but precise. - outline key data structures. This method can lead to vague definitions. For issues core to the existence of Centers, it is useful to adopt the first possibility of high-level formalism. These issues need to be discussed immediately. For issues that are not mission-critical, definitions can be more flexible and verified on a case-by-case basis.

To begin practical collaboration among Centers, Chuck Friedman suggested that they think of something that all four Centers need and agree to build it once in a way that's compatible with needs of all Centers. However, Center PIs are too high-level to make effective suggestions. An all-hands meeting with software engineers would be better. Eric Jakobsson suggested a virtual meeting.

Issues: Some young faculty have philosophical issues with working in an open source environment. The challenge will be to balance IP protection with a cooperative technological environment. One suggestion is gaining credit through publications. ACTION: NCBC Program logo is a good way to get started. Will post on website.

Software and Data Integration Working Group

Peter Lyster (NIGMS, Chair) Stephan Bour (NIAID) Carol Bean (NCRR) Arthur Castle (NIDDK) German Cavelier (NIMH) Larry Clarke (NCI) Elaine Collier (NCRR) Peter Covitz (NCI) Jennifer Couch (NCI) Valentina Di Francesco (NIAID) Peter Good (NHGRI) John Haller (NIBIB) Donald Harrington (NIBIB) Peter Highnam (NCRR) Michael Huerta (NIMH) Jennie Larkin (NHLBI) Yuan Liu (NINDS) Michael Marron (NCRR) Richard Morris (NIAID) Bret Peterson (NCRR) Karen Skinner (NIDA) Michael Twery (NHLBI) Terry Yoo (NLM)