Introducing monarchr: The Monarch Knowledge Graph in R

Monarch Initiative
5 min readFeb 4, 2025

--

Do you love biomedical knowledge graphs and the R programming language? Have you been itching to explore associations between genes, phenotypes, diseases, and other entities using a tidy-inspired interface? Would you love to compare these relationships across species within a broad ecosystem of data science libraries? Well, we have good news: the monarchr package is now broadly available, bringing the expansive Monarch knowledge graph to an R terminal near you!

Cystic Fibrosis subgraph of the Monarch KG

Background and Motivation

The Monarch Knowledge Graph (KG) hosts detailed information about millions of biological entities, including genes, diseases, phenotypes, molecular functions, drugs, and more, across both human and other species. Even better, these are connected by tens of millions of curated relationships, including subclass information (e.g., for organizing classes of rare diseases), gene orthology, causal gene-to-disease relationships, disease-phenotype associations, and so on, enabling many scientific uses. The Monarch Initiative provides access to this information in a variety of ways, including a website, REST API, Python package, and even an AI chatbot. Now, the monarchr package enables powerful and flexible access to the Monarch KG using the popular R programming language.

The scale and complexity of very large KGs presents practical challenges. We can easily find all genes connected to a single disease, but how efficiently can we find all genes associated with a whole class of diseases and their phenotypes? What if we wanted to compare those across multiple disease classes or organisms organized by molecular function?

In other domains, the R community has addressed complex data manipulation needs with the tidyverse ecosystem. Tidy tabular data follows traditional database normalization principles, but the expanded tidyverse has grown to cover models, lists, tensors, and even omics data. These packages typically define both data structures and the ‘verbs’ we can apply to them: tables may be joined, lists mapped, tensors reshaped, and genomic ranges intersected. While not unique to R, these intuitive combinations of data and methods provide powerful tools for analytics.

Graphs are no stranger to tidying in R. The tidygraph package provides flexible views of nodes and edges, along with a suite of graph-specific filters, joins, maps, and morphs. Built on igraph, it supports a variety of algorithms and is compatible with 3rd party libraries for visualization, clustering, link prediction, graph neural networks, and many more. Our goal with monarchr was to develop an interface to the full Monarch KG in R, leveraging these existing tools to provide a framework for tidy knowledge graph querying and manipulation.

Data and Operations

The Monarch KG is large, and in many cases, only a subset of its contents are relevant. For example, researchers studying Lysosomal storage disorders will likely only want information about these diseases and their close neighborhoods. Even a broad clustering analysis of all human phenotypes may only require nodes provided by the Human Phenotype Ontology (HPO) and specific connections. The entry point of monarchr is thus a flexible method to fetch nodes from the larger KG into a local tidygraph. This can be accomplished from a list of specific node IDs, as in fetch_nodes(query_ids = c(“MONDO:0009061”, “HGNC:1884”)), or in bulk with R logical syntax. For example, fetch_nodes(in_taxon_label == “Homo sapiens” & “biolink:Gene” %in_list% category) retrieves all human genes. The package also incorporates Monarch-specific features, such as monarch_search(“Cystic fibrosis”), using the free-text search API to fetch nodes matching a query string.

After fetching a set of nodes (e.g. with monarch_search()), the next natural operation is to fetch additional connected nodes from the surrounding neighborhood and add them to the retrieved set, which we implement with expand(). Here’s a complete example in code:

In this example, we first fetch five nodes matching a search for “Cystic fibrosis”; from these we expand to include ten connected gene nodes, and from this set of diseases + genes we fetch ten nodes connected via a biolink:has_phenotype relationship. The use of limit helps with exploration but does not provide a complete picture; the same query without limits results in a graph with 532 nodes and 1,372 edges. With this more complete picture, we can be sure that the non-human variants of cystic fibrosis are not connected to genes or phenotypes in the KG:

Using familiar R syntax, these data are easily processed as node and edge tables. We might, for example, filter this result to include only phenotypes connected to more than one gene, then expand again to find other genes or diseases connected to this subset of phenotypes.

This expansion operation is flexible, supporting combinations of criteria for directionality, edge predicates, and node categories. Some predicates, especially biolink:subclass_of, are transitive, so expand() also supports transitive expansion. In the following example we start with Niemann-Pick disease, fetch all of its subtypes, sub-subtypes, and so on by transitively following biolink:subclass_of in an inward direction, and lastly expand to include all associated genes:

While these queries are small, larger queries could retrieve tens of thousands of nodes and edges. To process these efficiently when accessing the cloud-hosted Monarch KG, queries are paged (broken down into steps) and retrieved iteratively. While this does increase the runtime for large queries, it helps protect the centralized data resource and balance resource use.

R users will recognize the ‘pipe’ operator |> here. Like many R packages, monarchr functions primarily take and return a consistent data type for use with pipelines, in this case, a tidygraph object. Together with tidygraph features for joining, filtering, and other manipulations, iterative fetching and expansion can answer even complex questions. For a more detailed example, see the vignette classifying Alzheimer’s disease phenotypes as mental, behavioral, or both.

Standards and More

The monarchr package is designed to work seamlessly with the cloud-hosted Monarch KG via API calls, but alternative “engines” allow working with any knowledge graph in TSV-based KGX format, such as those at kghub.org. These may be provided as files (or URLs to files), or hosted via Neo4j, as monarchr provides generic file_engine() and neo4j_engine() access. Local graphs can also be saved in TSV KGX format, allowing users to create reusable knowledge graphs as subsets of larger ones.

To start using monarchr see the installation instructions on the package homepage and more functionality details in the Getting Started vignette. For those wishing to contribute, the GitHub repository is at https://github.com/monarch-initiative/monarchr.

--

--

Monarch Initiative
Monarch Initiative

Written by Monarch Initiative

Semantically curating genotype-phenotype knowledge. Visit us at https://monarchinitiative.org/ #OpenScience #Collaborative #Data

No responses yet