Tailoring the NCI Thesaurus for semantic interoperability
Representing cancer knowledge is a daunting task, as there are hundreds of subtypes of cancers, and often inconsistent naming conventions used by different stakeholders. The NCI thesaurus (NCIt) is a widely used cancer reference taxonomy that covers over 100,000 terms, developed by the National Cancer Institute (NCI) as a standalone ontology since 2003. While the NCIt is an OWL 2 ontology with full anatomical axioms that are linked to disease concepts, it pre-dated and evolved independently of the Open Biomedical Ontologies (OBOs) and the best practices that have been established in this community. As a result, the NCIt lacks the interoperability that would support integrated analysis of data annotated with these ontological resources, especially for data where more basic research OBO ontologies have been utilized. Further, despite NCI having APIs to leverage the NCIt’s semantics, most use of NCIT is for simple data tagging. Recently, NCI partnered with members of the Monarch Initiative to enhance the ontology for interoperability with OBO ontologies. This work was performed in several steps.
Making ontologies more interoperable: Analyzing the discrepancies
A number of projects “cross-reference” NCIt classes in their terminological artifacts. Ontologies often cross-reference other classes if they are equivalent terms, in order to provide association between one data item and another. For example, NCIT:C4872 ‘breast carcinoma’ is cross-referenced by DOID:3459 ‘breast carcinoma’. Cross-references to NCIt terms, however, are often a) not vetted by NCIt; b) not kept current; c) not logically consistent; and d) not coordinated across communities. This fundamentally leads to difficulties in data integration and analytics. To support data integration and cross-dataset analyses, we analyzed selected non-NCI sources that cross-reference NCIt using the kBOOM algorithm. kBOOM allows one to merge different disease terminologies, which using various inputs including cross-references, to inform Bayesian probabilistic inferences about the relationship between mapped terms, and generate logical axioms between terms to produce a merged ontology. The probability of the accuracy of each mapping is available on the GitHub repository. For the low probability mappings, we evaluated the inferred equivalences, subclass and superclass relationships between NCIt and other terms, and identified potentially incorrect relationships. The final report summarizes the findings, and an example issue is described here. This work facilitates best practices for linking of NCIt for accurate data annotation, validation, mapping, and integration. Improved mappings across disease sources have been implemented in Monarch’s Disease Ontology (MONDO).
Evaluating Molecular Subtyping Modeling
With the emergence of new experimental and analytic technologies, diseases are increasingly characterized and classified by their molecular features, and the NCIt includes an enormity of expert-curated knowledge about molecular features of cancers. We performed a thorough evaluation of molecular subtype information in the NCIt with the aim of identifying opportunities where improved modeling and infrastructure might accrue real benefits for its curation and maintenance, its ability to interoperate with and leverage external resources, and its utility for bioinformatics and big-data applications (the report is available here). Our recommendations targeted four key areas:
(1) Identification of additional external ontologies and identifier systems that the NCIt could leverage in mapping its terms to equivalent concepts, to improve interoperability across community resources.
(2) Refactoring the “Molecular Abnormality” hierarchy to follow more principled axes of classification, and better align with models implemented in established biomedical ontologies and knowledge bases.
(3) Improving “Biomarker” representation to address issues related to data richness, hierarchical classification, and modeling consistency and precision.
(4) Harmonizing logical design patterns used to describe molecular subtypes of cancer.
Addressing these issues will improve the consistency, connectedness, and query-ability of the NCIt, and enhance its utility for biomedical applications aimed at molecular characterization and subtyping of cancer.
OBO-izing the NCIt
There are currently a number of experimental .obo format (a simpler subset of the Web Ontology Language, OWL, not to be confused with OBO the organization) versions of NCIt, but these may not be consistent, none are official, and none have been created in collaboration with the NCI so as to ensure consistency, currency, and quality. We created an automated transformation pipeline which produces an edition of NCIt which is more closely aligned with OBO Library conventions:
(1) The ontology is available in both OWL and OBO formats via OBO persistent URLs (PURLs); dated versions can always be retrieved via their OWL version IRI.
(2) The ontology uses standard OBO annotation properties for definitions and synonyms.
(3) The subclass hierarchy is pre-reasoned, for ease of use with a wide range of tools.
For deeper OBO integration, the build pipeline also creates a version of NCIt called “NCIt Plus”. This edition replaces specific concept hierarchies in NCIt with those provided by domain-specific ontologies in the OBO Library. For example, NCIt anatomical terms and logical relationships are replaced by corresponding concepts from Uberon. Currently, Uberon and Cell Ontology are included in NCIt Plus; in the future we intend to extend this integration to terminologies including Gene Ontology, NCBI taxon, and others. The build pipeline also makes available some smaller subsets of NCIt, such as a subset of NCIt Plus focused on the Neoplasm Core, as well as a slim based on terms used in OncoTree. More information about available downloads can be found on the wiki.
We also created a public SPARQL endpoint which can be used to query both the NCIt OBO Edition and the “Plus” version. The database also includes a range of precomputed inferences which facilitate querying of property relationships without an OWL reasoner. The project wiki provides a library of sample queries.
The new NCIt OBO Edition will be generated concurrently with every NCIt release and is available for direct download via the OBO library PURLs. In addition the OBO Edition is available for browsing at Ontobee and the EBI Ontology Lookup Service. The NCI welcomes requests and feedback from the community at the Github tracker (https://github.com/NCI-Thesaurus/thesaurus-obo-edition/issues).
About the authors:
Jim Balhoff is a Senior Research Scientist at Renaissance Computing Institute: RENCI, University of North Carolina at Chapel Hill, North Carolina.
Matthew Brush is an ontologist for the Monarch Initiative at Oregon Health & Science University, in Portland, Oregon.
Sherri de Coronado is the Lead and Program Manager for the Semantic Infrastructure Section, Center for Biomedical Informatics and Information Technology, National Cancer Institute.
Gilberto Fragoso, is a Biomedical Informatics Program Manager for the Semantic Infrastructure group, Center for Biomedical Informatics and Information Technology, National Cancer Institute.
Melissa Haendel co-leads the Monarch Initiative at Oregon Health & Science University, in Portland, Oregon.
Chris Mungall co-leads the Monarch Initiative at Lawrence Berkeley National Laboratory in Berkeley, California.
Nicole Vasilevsky is the Lead Biocurator for the Monarch Initiative at Oregon Health & Science University, in Portland, Oregon.
Larry Wright, Retired, National Cancer Institute.
Image credit: https://commons.wikimedia.org/wiki/File:Cancer-cell.jpg