SPIRES: building structured knowledge bases from unstructured text using Large Language Models

Monarch Initiative
2 min readApr 7, 2023
Overview of the SPIRES approach. Rather than relying on GPT alone, which is prone to hallucination, SPIRES is part of the OntoGPT package which leverages knowledge in existing public repositories and ontologies. SPIRES takes text as input and produces structured knowledge according to a knowledge schema, specified in advance by a domain modeler.

Large Language Models (LLMs) such as GPT-3+ can help with tasks ranging from generating sophisticated software to writing love sonnets. We wondered whether we could harness the power of LLMs to facilitate a difficult hands-on task that we often encounter: assisting in the curation of a structured knowledge base that can respond to complex queries with results that conform to a user-specified schema. (Spoiler: YES! But with caveats.) Our new arxiv preprint by J. Harry Caufield et al. describes the process.

Development of knowledge bases and ontologies is a time-consuming and labor-intensive task that currently requires specialized expertise. The human knowledge that should ideally be integrated in these structured knowledge bases exists mostly in unstructured forms such as natural language texts. Machine learning (ML) and natural language processing (NLP) can help, but they typically rely heavily on extensive training data and are not designed to populate arbitrary knowledge schemas of the kind found in biomedical and translational science, such as the Biolink Model.

Our approach, Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES), uses LLMs (such as GPT-3+) to perform zero-shot learning (ZSL), which requires no additional training data. SPIRES provides general-purpose query answering that returns information in a specified schema. SPIRES can be used in almost any domain; our preprint explores examples such as extraction of food recipes, multi-species cellular signaling pathways, disease treatments, multi-step drug mechanisms, and chemical to disease causation graphs.

We evaluated SPIRES on a Chemical-Disease relation extraction task and found it gave comparable performance to pre-trained specialized pipelines. One of the strengths of SPIRES is the generality and high expressivity, which allows for more complex data models including multi-step food recipes, biological pathways, or multi-step drug mechanisms. We demonstrate an end-to-end pipeline that takes as input recipe URLs for popular websites and produces a first pass at a hierarchical ontology of recipes, organized according to dietary preferences, making use of Open Bio Ontologies such as the Food Ontology.

SPIRES is available as part of the open source OntoGPT package: https://github.com/monarch-initiative/ontogpt. Currently, to use SPIRES you need a subscription to the OpenAI API — but we hope it will soon be possible to run it without this dependency.

Reference: Caufield HJ, Hegde H, Emonet V, Harris NL, Joachimiak MP, Matentzoglu N, Kim H, Moxon SAT, Reese JT, Haendel MA, Robinson PN, Mungall CJ. Structured prompt interrogation and recursive extraction of semantics (SPIRES): A method for populating knowledge bases using zero-shot learning. arXiv [cs.AI]. 2023. http://arxiv.org/abs/2304.02711

--

--

Monarch Initiative

Semantically curating genotype-phenotype knowledge. Visit us at https://monarchinitiative.org/ #OpenScience #Collaborative #Data