2005 AHM Planning: Databases/Integration Working Group

From NAMIC Wiki
Jump to: navigation, search
Home < 2005 AHM Planning: Databases < Integration Working Group

Organizers

  • Morphometry BIRN: Jorge Jovicich
  • Function BIRN: David Keator, Stephen Wong
  • Mouse BIRN: Maryann Martone, Bill Bug
  • BIRN CC: Jeffrey Grethe



Agenda

  • Tuesday, October 18
    • 2:00pm - 3:15pm
    • Data interchange and identification
      • Methods for the interchange of data (XML, data hierarchies, standards and conventions)
      • Unique identification of data sets and partial data sets for publishing (e.g. LSID)
    • 3:30pm - 4:45pm
    • Workflows (Joint Meeting with Workflow Working Group)
      • Utilization of workflow tools for data upload and download
      • Integration of workflows into database environment for launching of analyses and capturing of results and data provenance
      • Requirements gathering
  • Wednesday, October 19
    • 8:30am - noon
    • Utilization of Ontologies in Databases (Joint Meeting with Ontology Working Group)
      • Utilization of ontologies in databases
      • Requirements for databases to incorporate semantic information
        • What semantic information needs to be stored
        • Automated and standardized methods to retrieve this information
      • Tool requirements
    • Data Integration (Interaction with Ontology Working Group)
      • Utilization of the BIRN data integration environment
      • Strategies for data integration
      • Requirements for the data integration environment
    • Afternoon
    • Summary Report
      • Summary report from the group will be presented to the whole BIRN group, outlining the milestones/dates/names agreed by the group.



Background Information for the Database/Integration Working Group

‘’’Background’’’ The Neuroscience research community deals not only with large distributed databases, but also with highly heterogeneous sets of data. A query may need to span several relational databases, ontology references, spatial atlases, and collections of information extracted from image files. To that end, the BIRN-CC has deployed a data source mediator that enables researchers to submit these multi-source queries and to navigate freely between distributed databases. This data integration architecture for BIRN builds upon our work in knowledge-guided mediation for integration across heterogeneous data sources (Martone et al., 2004; Ludäscher et al., 2000; Gupta et al., 2001). In this approach, the integration environment uses additional knowledge captured in the form of ontologies, spatial atlases and thesauri to provide the necessary bridges between heterogeneous data. Unlike a data warehouse, which copies (and periodically updates) all local data to a central repository and integrates local schemas through the repository’s central schema, this mediator approach creates the illusion of a single integrated database while maintaining the original set of distributed databases. This is achieved via so-called integrated (or virtual) views – in a sense the “recipes” describing how the local source databases can be combined to form the (virtual) integrated database. It is the task of the data integration system to accept queries against the virtual views and create query plans against the actual sources whose answers, after some post-processing, are equivalent to what a data warehouse would have produced. The main advantages of this integrated system over a data warehouse are:

  • User queries automatically retrieve the latest data from the source databases;
  • There is no need to host, maintain, and keep updated the central repository;
  • A distributed and integrated system is flexible in that a new source can join the system simply be registering its schema and integrated views can be added as needed, and
  • Sites maintain the autonomy and ownership of their data, because the mediator is just another client to the source databases.

Goals for the Database/Integration Working Group The Database/Integration Working Group will focus on 3 key areas of database use: 1) Data Interchange and Identification (Session 1) 2) Data driven workflows (Session 2) 3) Ontologies and Data Integration (Sessions 3&4)

The following section will provide some background information for each of the sessions

1) Data Interchange and Identification (Tuesday 2:15pm) This disccussion will center around the requirements and infrastructure required to promote data interchange within BIRN (and between applications) and also excahnge of data externally (i.e. publicly available data). Discussions will focus on two key concepts: the use of XML and LSIDs..

BIRN XML Information BIRN is already making use of XML. A key BIRN XML specification is the XCEDE (XML-Based Clinical Experiment Data Exchange Schema; http://www.nbirn.net/Resources/Users/Applications/xcede/). The XCEDE schema provides an extensive metadata hierarchy for describing and documenting research and clinical studies. The schema organizes information into five general hierarchical levels: a complete project studies within a project subjects involved in the studies visits for each of the subjects the full description of the subject's participation during each visit Each of these sub-schemas is composed of information relevant to that aspect of an experiment and can be stored in separate XML files or spliced into one large file allowing for the XML data to be stored in a hierarchical directory structure along with the primary data. Each sub-schema also allows for the storage of data provenance information allowing for a traceable record of processing and/or changes to the underlying data. Additionally, the sub-schemas contain support for derived statistical data in the form of human imaging activation maps and simple statistical value lists.

XCEDE was originally designed in the context of neuroimaging studies and complements the Biomedical Informatics Research Network (BIRN) Human Imaging Database, an extensible database and intuitive web-based user interface for the management, discovery, retrieval, and analysis of clinical and brain imaging data. This close coupling allows for an interchangeable source-sink relationship between the database and the XML files, which can be used for the import/export of data to/from the database, the standardized transport and interchange of experimental data, the local storage of experimental information within data collections, and human and machine readable description of the actual data. However, the XCEDE schema is highly generalizable and discussions can examine the utility of extending XCEDE for use in other applications.

For more information on XCEDE, you can view this document: https://portal.nbirn.net/BIRN/cgi-bin/DataGrid/srbView.cgi/birnschema.pdf?object=/home/BIRN/AHM/2005/Database_Integration/birnschema.pdf

What is an LSID Life Science Identifiers (LSIDs) are the standard adopted by the Object Management Group (OMG) for the identification of life science data objects. They are a little like DOIs (http://www.doi.org/) used by many publishers. They provide a standard mechanism for retrieving data and metadata across different life science databases, containing diverse information and information types.

LSID are used to refer to one unchanging data object each. Unlike the familiar URLs of the World-Wide-Web, LSIDs are location independent. This means that a program or a user can be certain that what they are dealing with is exactly the same data if the LSID of any object is the same as the LSID of another copy of the object obtained elsewhere.

In order to retrieve the data referenced by an LSID, a 'resolver' is needed. At a minimum this software system usually comprises of two parts that communicate over a network. The first part is server software operated by any party that wishes to make data available and that has assigned LSID names to this data. This party is also known as the LSID issuing authority. The second part is software that usually executes on a client that can communicate over a network using an agreed protocol with the LSID authority server in order to retrieve the data or metadata associated with a particular LSID instance.

LSID metadata is normally represented in an RDF serialization. A key benefit of using LSID as a naming convention is the clear separation of data and metadata. This also gives the implementer of an LSID authority the task of determining what is data and what is metadata. The first thing to realize is that while most every LSID will have associated metadata, many LSIDs may not be associated with data.

A LSID conforms to the URN standards defined by the IETF. Every LSID consists of up to five parts: - LSID Designator - the Network Identifier (NID) signifies that the identifier is an LSID - Authority Identifier - An internet domain, i.e. the root DNS name, belonging to the organisation that assigned the LSID - Namespace Identifier - the name of the resource that contains the data - Object Identifier - The unique ID in the specified namespace of a data item (e.g. this could be the unique DICOM identifier or SRB URI) - Revision Identifier - An optional parameter to identify different versions of the same data item

Each part is separated by a colon to make LSIDs easy to parse. Here is an example that references a PubMed article: urn:lsid:ncbi.nlm.nih.gov:pubmed:12571434

MyGrid use of LSID: LSIDs are used throughout myGrid for the identification of data objects from external sources as well as internally created data. Using a standard mechanism for identification allows for more efficient and cohesive exchanges between myGrid components.

The LSID components that can be deployed in myGrid are as follows:

  • LSID Assigning Service- New data objects, created by Taverna or the MIR can be given unique LSIDs using this service
  • LSID Authority- This service stores the locations of other services, internal or external, that can provide data for a particular LSID-assigned data object.
  • LSID Data Resolver - This service retrieves data for a particular object from the place where the actual data is stored. In myGrid, this is the MIR.
  • LSID Metadata Resolver - This service retrieves metadata associated with a particular data object. Metadata could be stored in several places. In myGrid metadata is stored in the MIR and in KAVE

Some sites that could be of interest: OMG LSID RFP - http://www.omg.org/lsr/

IBM LSID Best Practices - http://www-128.ibm.com/developerworks/opensource/library/os-lsidbp/

http://lsid.sourceforge.net/

Sample LSID resolver - http://lsid.biopathways.org/resolver/

UW LSID Authority - http://lsid.limnology.wisc.edu/

2) Data driven workflows (Tuesday 3:45pm)

This discussion will center around the interoperability between data services (i.e. databases and data stores like SRB) and workflow services (i.e. Kepler, LONI Pipeline, FIPS). The discussion builds on the earlier session (XML) and will add some discussion on the utilization of web services for interoperability.

Workflow Working Group Notes: http://www.na-mic.org/Wiki/index.php/2005_AHM_Planning:_Workflows/Analyses_Working_Group

Background material from previous workflow meeting: http://www.na-mic.org/Wiki/index.php/Mbirn:_UCSD_Workflow_Development_Retreat_(July_17-_18,_2005)

3) Ontologies and Data Integration (Wednesday 8:30am)

The BIRN Data Mediation architecture employs ontologies as the means for linking related or identical concepts in different databases. To aid in this process, concepts in each of the source databases should be mapped to one or more of the shared knowledge sources available to BIRN. This session will be held jointly with the Ontology Working Group (http://www.na-mic.org/Wiki/index.php/2005_AHM_Planning:_Ontologies_Working_Group)

BIRN Knowledge Sources (BONFIRE http://imhotep.ucsd.edu:7873/knowme/bonfire.html): The following sources are or will be available to BIRN:

1) UMLS (including SNOMED) 2) Gene Ontology 3) Brain Info/Neural Names 4) Others as determined by the Ontology Task Force

In addition to providing the above listed sources, BONFIRE allows BIRN users to accommodate concepts not present in the available pre-defined source ontologies. The concepts from the disease maps created around human neurological diseases will be incorporated into BONFIRE.

Versions: To ensure that everyone in the BIRN is using the same version of the source ontologies, the BIRN CC will maintain a current version of each that can be accessed through the BIRN portal. Tools will be provided for browsing and querying the Knowledge Sources. A prototype version of one of these tools, the Know Me tool (Knowledge Map Explorer) is currently available. This tool is currently being updated for use by BIRN.

Database Mapping: Each relevant concept in a database record should have an ontology ID. In the simplest case, there exists in one of the ontologies a term that exactly matches the concept in the database. As an example let us take a look at the Cell Centerred Database (CCDB). The CCDB has a lot of descriptive information associated with a given record. Not every concept has to be equated with an ontology ID. For example, the CCDB has examples of filled Purkinje neurons. For data set ALXP4, the following ontology IDs are supplied:

CCDB data set AXLP4

1. species = rat (UMLS: C003493) 2. brain 3. region = cerebellum 4. subregion = vermis (C0598118) 5. cell type = Purkinje neuron (UMLS ID C0034143)


In this case, terms were supplied for rat, vermis and Purkinje neuron but not for cerebellum. It was not necessary for cerebellum to be mapped, because the knowledge that a Purkinje neuron is a cell type in the cerebellum is already in the UMLS. Similarly, the relationship “Brain has a cerebellum” is already specified. The fact that the Purkinje neuron ALXP4 is in the vermis region of the cerebellum cannot be inferred from any existing relationship and must be specified.

However, in many cases, the source database may have a concept that does not exist in any of the knowledge bases. When this occurs, the source concept must be “lifted up” to one of the base ontologies. This process is illustrated in the following example.

CCDB data set Osaka3

1. species = rat (UMLS: C003493) 2. region = neostriatum (UMLS: C0162512) 3. cell type = medium spiny cell (No Concept Available) 4. structure = spiny dendrite(No Concept Available) 5. segmented object = dendritic spine (UMLS: C0872341) 6. segmented object = dendritic shaft (No Concept Available)

For this data set, no ontology IDs exist for medium spiny cell, spiny dendrite or dendritic shaft. These entities have to be defined in terms of the UMLS or some other ontology:

medium spiny cell (BONFIRE: BID006) medium spiny cell “is a” neuron (UMLS: C0027882) medium spiny cell “has location” neostriatum (UMLS: C0162512) medium spiny cell “is a” neuron AND “has property” dendritic spine (UMLS: C0872341)

spiny dendrite (BONFIRE: BID007) spiny dendrite “is a” dendrite (UMLS: C0011305) spiny dendrite ‘contains” dendritic spine (UMLS: C0872341)

dendritic shaft (BONFIRE: BID008) dendritic shaft “is part of” spiny dendrite (BONFIRE: BID007)

Whenever possible, users should employ the relationship terms provided within the UMLS or other source ontologies provided by BIRN. Once new terms are defined, they will become part of the BIRN Ontology (BONFIRE), which can then be used to define additional terms.

Database Requirements:

The method used to map the database onto the BIRN Shared Knowledge Sources is up to the local database creator. In the CCDB, we have created two tables:

Table 1: Direct mapping onto an ontology in the CCDB CCDB Concepts Name (CCDB_ID) Relation Source Name Source Ontology Source ID Constraint Linked by

Alxp4: Purkinje Neuron is_a Purkinje Cell UMLS C000076 Maryann Martone Alxp4: Purkinje Neuron located_in Vermis UMLS C000076 Maryann Martone Osaka3: medium spiny cell is_a Medium Spiny Cell BONFIRE BID006 Maryann Martone Osaka3: medium spiny cell located_in Neostriatum BONFIRE BID006 Maryann Martone

Table 2-4: New Concept Entries in BONFIRE BONFIRE Concepts BONFIRE ID Concept Name Submitter Submit Date Constraints 006 Medium Spiny Neuron Maryann Martone 16122002 007 Spiny Dendrite Maryann Martone 16122002 008 Dendritic Shaft Maryann Martone 16122002

BONFIRE Relations BONFIRE ID BONFIRE Concept Relationship Target Concept Target Ontology Target Ontology ID Submitter Submit Date Constraints 006 Medium Spiny Cell is_a Neuron UMLS C0027882 Maryann Martone 1612200 006 Medium Spiny Cell has_location Neostriatum UMLS C0162512 Maryann Martone 16122002 006 Medium Spiny Cell has_property Dendritic Spine UMLS C0872341 Maryann Martone 16122002 Property of Neuron 007 Spiny Dendrite is_a Dendrite UMLS C0011305 Maryann Martone 16122002 007 Spiny Dendrite contains Dendritic Spine UMLS C0872341 Maryann Martone 16122002 Dendritic Spine > 1 008 Dendritic Shaft is_part_of Spiny Dendrite BONFIRE BID007 Maryann Martone 16122002

Example of BONFIRE Allowed Relations (adapted from UMLS) Relationship Source Ontology is_a UMLS contains UMLS is_part_of UMLS has_location UMLS has_property UMLS



Possible Participants at AHM

  • Morphometry BIRN: David Kennedy, Dan Marcus, Heidi Schmidt, Timothy Brown, Burak Ozyurt (Ontology Interaction adds Christine Fennema-Notestine)
  • Function BIRN: David Keator, Stephen Wong, Bryon Mueller, Dingying Wei, Syam Gadde, Burak Ozyurt, Karen Pease/Hans Johnson/Bill Klawitter, Jeremy Bockholt, Mark Anderson, Katie Hayes, Risha (UCLA), Tiffany Elliott
  • Mouse BIRN: Edriss Merchant, Sally Gewalt, Karen Crawford, Joy Sargis, Bill Bug (also in ontologies)
  • BIRN CC: Amarnath Gupta, Vadim Astakhov, David Little, Ed Ross, Aylin Yilmaz, Jeffrey Grethe