Commentary on States 8
Contents
Comment on Chemical Entities
(To go back to the top page of the NCBC Scientific Ontologies discussion click http://na-mic.org/Wiki/index.php/SDIWG:_NCBC_Scientific_Ontologies)
David, have you thought any more about standard ID's/terminologies for non-protein biological chemical entities such as dopamine, glucose or cerebroside, as well as drugs? - Karen Skinner
Comment on Top-Down Approaches to Ontology Development
from Barry Smith'
David,
Of the items on your list:
NCBI taxonomy Human - HGNC Mouse - JAX/MGI Drosophila - flybase C. elegans - wormbase Saccharomyces cerevisiae - SGDB IUPAC single letter NCBI genome builds NCBI RefSeq MeSH headings Gene Ontology
I would like to comment on just the last two, and also on the three 'top-down' ontologies you mention in your more recent posting:
UMLS NCI Thesaurus BIRNlex
My presumption, given the mission of this Working Group, is that we are considering ontologies and related artifacts from the perspective of how well they can support the biomedical computing needs of the future.
First, MeSH does not live up to current standards of formal rigor (thus, and most blatantly, it contains classificatory cycles), and as far as I know it has no plans to adjust its policies in this regard. Hence, if it is to be recommended by this WG, then I would suggest that this should be as a tool for retrieving PubMed information (which is of course how MeSH conceives itself). At the same time we should recommend that it adjusts its policies to ensure evolution towards the kind of robust system which could support genuine formal reasoning with PubMed data in the future. A similar recommendation could be made also regarding the UMLS.
Both the Gene Ontology Consortium and the NCI Thesaurus (NCIT) have made the decision to adjust their policies in this respect within the framework of the OBO Foundry. The NCI has announced funding to support the radical restructuring of the NCIT that will be needed in order to remove the large number of structural defects in its present version ([1]).
NCIT was compiled in large part by adjoining (not always mutually compatible) bits from the UMLS which seemed relevant to its goals. Something similar can be said about BIRNLex, which -- as Bill Bug [William.Bug@DrexelMed.edu] and other members of the BIRN community have now agreed -- also needs to be replaced by a more rigorous BIRN ontology within the OBO Foundry framework.
In your comment you say "top down solutions will inevitably be incomplete and lag behind real usage." What you say does indeed apply to the UMLS and NCIT as currently conceived. But the OBO Foundry represents a top down solution of a new type. It is not a mere compilation but is rather designed to serve as a prospective standard, in terms of which both existing ontologies can be reformed in the direction of mutual interoperability, and new ontologies can be constructed ab initio. Three new ontologies are already being built in this way -- the Protein Ontology, the RNA Ontology, and the Functional Genomics Investigation Ontology -- and the OBO Foundry has also inspired the MIFoundry project, which aims to serve as an analogous prospective standard in the world of Minimal Information Checklists for high-throughput experiments.
Barry Smith
Response by David States
There are two possibilities - 'standards lead' and 'standards trail'. In areas like hardware, large scale implementation is delayed until a standard is agreed upon, and then everyone roles out their solution. Language, on the other hand is intrinsically a bottom up endeavor where the standards attempt to capture usage in the community. In this sense, ontologies, whether top down or bottom up, trail. Agree that a framework such as you describe in the OBO Foundry would be useful in organizing ontology development efforts. But using gene names as an example, the biomedical community seems to want to go through the hundred flowers bloom phase before settling on a standard. We may eventually train biologists to use standard definition in refering to molecular function, cellular anatomy or experimental design, but Alfonce Valencia points out that at the moment only about 20% of authors use HGNC gene names. In the model organism literature, most authors do use standard gene names so there is hope, but progress is measured in decades.
While exclusion of cycles may be desirable, any graphical representation of synonym sets will be cyclic, and even 'proper' ontologies are DAGs not trees. Is insulin a protein, a hormone or a drug? Obviously the answer is 'yes'. Name collisions are also inevitable (PCR => polymerase chain reaction, phosphocreatine, premature contraction). There will continue to be role for resources like MeSH; even if they are not 'proper' ontologies they do capture a great deal of real language use. Maybe such resources deserve their own name. Rather than calling them 'bad ontologies' call them useful 'structured dictionaries'.
David States
Response to this response by Barry Smith
DS: "In areas like hardware, large scale implementation is delayed until a standard is agreed upon, and then everyone roles out their solution. Language, on the other hand is intrinsically a bottom up endeavor where the standards attempt to capture usage in the community."
BS: What you say applies not only to hardware, of course, but also to language-like artifacts (such as programming languages, operating systems). There, too, good prospective standards can bring untold benefits. And now it seems to me that we can choose whether we treat a given language-like artifact (e.g. GO + GO Annotations, HGNC gene names) as a trailing or as a leading standard. And surely the very purpose of this WG, which is to prepare the way for awarding the NCBC imprimatur to certain chosen artifacts, forces this latter choice: trailing standards do not need official blessing.
DS: "In this sense, ontologies, whether top down or bottom up, trail."
BS: The GO methodology and architecture is often referred to as a de facto standard. It has now been imitated often enough that it can by now more precisely described as a de facto leading standard. The OBO Foundry initiative is seeking to formalize this role, and already in the first 3 months of its existence three entirely new ontologies are being built on its terms, in each case by influential consortia who have the authority to impose their use on large parts of the relevant communities.
DS: "using gene names as an example, the biomedical community seems to want to go through the hundred flowers bloom phase before settling on a standard. We may eventually train biologists to use standard definition in refering to molecular function, cellular anatomy or experimental design, but Alfonce Valencia points out that at the moment only about 20% of authors use HGNC gene names. In the model organism literature, most authors do use standard gene names so there is hope, but progress is measured in decades."
BS: Again -- surely this is precisely where the NCBC imprimatur can be useful: to speed up this progress. Moreover, we have to remember that it is not only biologists who need care and proper feeding. Computers are involved in all of this, and computers like their informational food to come in tidy bits.
DS: "While exclusion of cycles may be desirable, any graphical representation of synonym sets will be cyclic"
BS: I think we should support the idea (1) that controlled vocabularies and nomenclatures should be employed to the maximum possible degree, so that (2) the importance of synonyms (which are in any case never exact) should diminish.
DS: "and even 'proper' ontologies are DAGs not trees. Is insulin a protein, a hormone or a drug? Obviously the answer is 'yes'."
BS: There are now formal ('normalization') methods to deal with departures from the tree-form structure in ways which (a) reduce the number of errors such departures cause, and (b) enhance reasoning. We are also working out the ways to apply these methods to the OBO Foundry ontologies. There are also ways of dealing with 'insulin' in coherent fashion (we do not want computers e.g. inferring that every occurrence of 'insulin' in a text is also a reference to some drug).
DS: "There will continue to be role for resources like MeSH; even if they are not 'proper' ontologies they do capture a great deal of real language use. Maybe such resources deserve their own name. Rather than calling them 'bad ontologies' call them useful 'structured dictionaries'."
BS: Exactly. But then we can add also that they could be even more useful if, by taking account of the new understanding of what is possible in ontology, they would be better structured.