PrelimResultsSOWG1
Contents
Science Ontologies (etc.)
Kohane, Ashburner, Musen, States, Lussier, Murphy, Bean, Smith and a cast of 30
This is a very preliminary discussion held after several telephone conference calls. The focus was pragmatic. What ontologies and related artifacts were the NCBC's going to use and what recommendations could we make to others working in the same application area?
Introductory comments and motivation
The overarching goal of this exercise among the national centers for biomedical computing is to arrive at a very crude but simple set of recommendations that can be followed by investigators within the various NCBC who are in the process of building databases, or annotation engines, or catalogs relevant to the board swath of enterprise represented by biomedical research. In doing so we recognize that we have run-roughshod over several important points regarding the definition of what constitutes an ontology, a terminology or a nomenclature. Although these are worthy subjects, we have started from a point of pragmatism whereby we are providing broad guidance not least by adopting our own recommendations within our own national centers for biomedical computing. Additionally as it stands now, the categorization that we have provided serves only as a point of initial discussion rather than a final set of recommendations from the NCBC Science Ontology Working Group. To help move this discussion forward, some additional detail about the categories that we have chosen for classifying these various onotologies and artifacts may be illuminating.
- class 1 Ontologies which are designated as belonging to class 1 are viewed as having broad utility today, often proven by extended use by investigators throughout the world; they have licenses which are favorable for the untrammeled and unencumbered use for a broad variety of purposes internationally. They are by and large computationally and structurally sound and readily usable for computation for the particular purposes for which they are designed. This last qualification is important because several of these ontologies are terminologies that are a poor fit for the uses for which they were not designed and yet where some developers have attempted to use them in this unintended fashion. Finally the intent is that if you were to ask an NCBC investigator building or curating an artifact in a particular domain that corresponded to one of these Class 1 ontologies whether they would use that ontology, the expectation is that the answer would be in the affirmative.
- Class 2. Class 2 terminologies and artifacts refer to those artifacts which are imperfect in one or more parts of their realization but are nonetheless widely used and which NCBC investigators more or less grudgingly would find themselves using even though they might wish for improved implementations or other desiderata of these artifacts. Problems identified include cycles inputed within/across hierarchies, unclear or time limited licenses for national or international use, difficulty to maintain or limited adoption.
- Class 3 Class 3 onotolgies are those which appear to be making significant progress in an area of great utility to biomedical research but remain insufficiently mature or complete to afford a recommendation for a use by a party that may neither be expert in the development or assessment of terminologies or ontologies. class 3 is included in part as a means to point to future areas of success, efforts which could be bolstered by present and improved funding and where the NCBC community recognizes that there is an as yet unmet need.
- Class 4 would include those artifacts which should not be used or recommended and fortunately this remains a fully abstract class as there are no instances populating it at this time.
- Class 5 class 5 includes those terminologies and onotologies that have been developed or are being developed but yet where there is no clear leading standard and therefore it would not be possible to provide an uncontroversial recommendation for use. It is anticipated that future “harmonization” efforts will result in the depleting the membership of this class.
- Class 6 includes those ontologies that would serve as bookkeeping efforts that map concepts and relationships between two or more ontologies. This thesaurus-like function is of course best exemplified by the UMLS metathesaurus.
- Finally there is Class 7 which represents the many domain areas that are prevalent to biomedical research that have not been yet categorized by the NCBC science ontology work group. Members of this class only represent a to do list in the substantially large area that have yet to be addressed.
The most important observation that most observers make when they visit this initial categorization is their surprise at just how many terminologies or ontologies made it into class 1. This solitary result should be reason for optimism about our ability to make progress in this important area and in the payoff of substantial international investments in inventing systematic means to describe the elements of biomedical research.
Open Source Categories of Ontology’s and Related Artifacts
- Among the items missing here are
- Clear annotation of licenses (even if free)
- Editorial riffs on uses of these ontologies & terminologies
Class Descriptions
- I – All NCBC's endorse
- II – All NCBC's will use under protest (or more often, with a wish for some additions/corrections)
- III – promising but under construction
- IV – will not use
- V – no clear standard
- VI – mapping from one ontology to the other.
- VII – to be determined
Class I:
- Gene Names and Symbols
- Human – HGNC
- Mouse – JAXMGI
- Drosophila – FlyBase
- C. elegans – WormBase
- Yeast – SGD
- Zebrafish – ZFIN
- Protein names
- UNIPROT
- Primary molecular sequence data
- IUPAC single letter
- Genomic sequence coordinates
- NCBI genome builds
- Protein Sequences
- UNIPROT
- Attributes of Gene Products
- Gene Ontology
- Taxonomy
- NCBI taxonomy minus virii
- Human Anatomy
- FMA (note that FMA's license is now open, unlike what is on their website)
- Clinical laboratory observations
- LOINC
- Sequence Features
- SO
- Protein Sequence Domains
- InterPro
- RNA Sequence domains
- RNAFam
- Macromolecular Structure Identifiers
- RCSB
- Medications used in US
- RxNorm
Class II
- Literature headings for literature indexing
- MeSH
- Virus taxonomy
- ICTVdb
- Disease Classification
- ICD9-CM
- ICD10
- Disease Nomenclature
- SNOMED CT (longish discussion about international efforts for a "free" SNOMED* Medical Procedures
- CPT4
Class III
- Chemical Nomenclature
- CHEBI
- Mammalian Phenotype
- Mammalian Phenotype Ontology
- Cell types
- Cell Ontology
- Functional Genomics Investigation Ontology
- FuGO
- Post Translational Modification
- no standardization
- RNA Structure Domains
- RNA Ontology
The Rest
- Class V
- Protein Structure Classification
- CaTH/SCOP
- Protein Structure Classification
- Class VI
- UMLS
- Class VII
- Neurobiology
- Imaging
Next Steps
- Review and vet list with NCBC larger group
Comments
Bill Bug
Kudos to all!
I think this is an excellent foundation to work from. I have just a few comments. Please pardon the redundancy, if these are issues already being addressed as follow-up to your Tcons or meetings.
Though it may sound like I'm asking for a lot, I actually think most of what I suggest below for the narrow domain of the resources you are looking to endorse/classify could actually be effected with very little effort. It's mostly a matter of compiling information most of which is readily available on the web, having a bit of discussion of the relevant issues, then just editing a few Wiki pages.
Knowledge Resource Classification
I believe it would be of considerable value to group these resources according to whether they are Ontologies, Classification Schemes, Lexical Resources, Data Repositories, or complex knowledgebases (e.g., PubChem, Reactome, OMIM, KEGG LIGAND, etc.). To simply state "Ontologies and related artefacts" covers this broad group certainly but does not provide a sufficiently precise description of what is being listed - e.g., "related artefacts" could be anything. In being more specific, you provide a considerable service to those seeking to use these resources. Each of those different types of knowledge resource have different "best practices" for construction and use. One might also break out Taxonomies and Dictionaries, as these bring with them - or at least should when a resource labelled as such actually is a taxonomy or a dictionary - specific attributes and preferred uses. Since there is considerable confusion on this issue amongst biologists using knowledge resources for KE/KR/KD, it would be very helpful to add this level of classification, along with a brief statement of:
- how these resources differ in their construction and use
- how they are similar
- where their content & use overlaps
I don't mean to imply this is an exhaustive list of resource types, or that I can provide a comprehensive listing of what is what. I also don't believe the process of producing such a classification would be deterministic and without debate, though I do believe for most of what you list above, the task is quite tractable. I just strongly believe the NCBCs could provide extremely valuable leadership on this issue that would be of great practical value and ultimately save us all considerable time and $$$.
To give an example of how the distinct resources types can differ in their implementation, one can look to the preferred manner of formally expressing such resources for the purpose of data exchange and shared use across the field. For instance, a Data Resource such as the NCBI Genome builds might be best expressed according to an XML-based Schema or DTD (e.g., das2sources.dtd). Most of the ontologies listed here would be best expressed in OWL. Lexical resources such as NCBI Taxonomy could be most effectively expressed via SKOS or one of the proposed formalisms mixing SKOS with OWL. There are clear technical implications these different formalisms bring with them and dispite there being some overlap, very different recommended tools (e.g., Java or PERL processing libraries) and services are implicated.
Here's an example as to how you might implement such a classification using the following letters to specify type (Onotology [O], Classification [C], Lexical Resource [L], Data Repository [D], Complex Knowledgebase [K]. Please note, as I state above, reasonable, knowledgeable people can differ over the classification I provide below, but I believe most of the disagrements could be resolved via debate:
- Class I
- Gene Names and Symbols
- Human – HGNC [L]
- Mouse – JAXMGI [L]
- Drosophila – FlyBase [L]
- C. elegans – WormBase [L]
- Yeast – SGD [L]
- Zebrafish – ZFIN [L]
- Protein names
- UNIPROT [L]
- Primary molecular sequence data
- IUPAC single letter [D]
- Genomic sequence coordinates
- NCBI genome builds [D]
- Protein Sequences
- UNIPROT [D]
- Attributes of Gene Products
- Gene Ontology [O]
- Organismal Taxonomy
- NCBI taxonomy minus virii [L]
- Human Anatomy
- FMA [O]
- Clinical laboratory observations
- LOINC [L]
- Sequence Features
- SO [O]
- Protein Sequence Domains
- InterPro [C]
- RNA Sequence domains
- RNAFam [C]
- Macromolecular Structure Identifiers
- RCSB [L]
- Medications used in US
- RxNorm [L]
- Gene Names and Symbols
Mappings amongst knowledge resources
Seeing as there are extensive inter-resource mappings available (GO alone has 21 mappings) and as these are critical resources in implementing large-scale, field wide semantically-based data integration and query capabilities, I think it would be worth breaking these out as a separate category - perhaps even a matrix that lists all resources above both as columns and rows and indicates where mappings exist between them.
Service-based access to knowledge resources
In this day and age, many of these resources are available via online web-service-like interfaces. There are a range of service types from plain HHTP APIs, HTTP APIs serving up XML, full WS-I/WSDL web services, BioMOBY/MOBY-S, etc.. Some resources are still not available algorithmically via such a dynamic, network-based mechanism.
I beleive again it would be very useful for this group to provide a "yellow pages"-style listing of which resources can be accessed via which mechanisms. Nothing very involved - just a simple listing. Over time, such a listing would be of great assistance, should funding ever become avaialble to produce an NCBC-wide knowledge resource mediation service - a one-stop-shop for those seeking integrated, dynamic, algorithmic access to the universe of resources you site here.
Small Bio-molecules (ChEBI)
I strongly agree with Karen Skinner's request these entites be covered. They can be particularly valuable by providing critical "bridging" across domains - e.g., think of the myriad roles and interactions NO has across a range of biomedical domains. This can prove critical when seeking "hidden" statistically corelated links across large collections of annotated/knowledge-mapped data.
ChEBI is definitely the closest out there to a true ontology of small bio-molecules and does an admirable job avoiding the multiple inheritance one typically finds in standard biochemical classification schemes, though it still appears to lack a foundation in an upper ontology. It also appears to have a pre-ponderence of plural form labels (e.g., 'cerebroside' is listed as 'cerebrosides' (id: CHEBI:23079, is_a: CHEBI:17761). The core branches - molecular structure, biological role, and application - do look to be a very practical means for classifying the relevant domains.
I think it's important to also note resources such as KEGG LIGAND, PubChem, CAS Registry, etc. as they include a wealth of additional knoweldge that can be extremely valuable for more complex KR/KD applications. The ChEBI file in Sourceforge also includes an extensive collection of links to these and other small molecule knowledge resources.