Subject Section
MELODI Presto: A fast and agile tool to explore
semantic triples derived from biomedical
literature
Benjamin Elsworth1,*
and Tom R Gaunt1
1MRC Integrative Epidemiology Unit, Bristol Medical School, University of Bristol, Oakfield House,
Oakfield Grove, Bristol, BS8 2BN, United Kingdom.
*To whom correspondence should be addressed.
Associate Editor: XXXXXXX
Received on XXXXX; revised on XXXXX; accepted on XXXXX
Abstract
The field of literature based discovery is growing in step with the volume of literature being produced.
From modern natural language processing algorithms to high quality entity tagging, the methods and
their impact are developing rapidly. One annotation object that arises from these approaches, the subject-predicate-object
triple, is proving to be very useful in representing knowledge. We have implemented
efficient search methods and an application programming interface (API), to create fast and
convenient functions to utilize triples extracted from the biomedical literature by SemMedDB. By refining
these data we have identified a set of triples that focus on the mechanistic aspects of the literature,
and provide simple methods to explore both enriched triples from single queries, and overlapping triples
across two query lists.
Supplementary information: Supplementary data are available at Bioinformatics online.
1 Introduction
The biomedical literature contains a wealth of knowledge on disease
mechanisms that link risk factors to disease outcomes. Literature reviews
are an important part of developing new mechanistic hypotheses, but are
time-consuming when exploring many risk factors or disease outcomes in
parallel. We propose utilizing the entire biomedical literature corpus to
generate a metric of evidence based on the number of times a particular
statement has been documented. This approach can be used to produce an
overview of the main topics and terms for any given biomedical query,
e.g. a risk factor, or identify overlapping elements between two queries,
e.g. a risk factor (exposure) and disease outcome.
Previously we created MELODI, a web application that derives overlapping
enriched literature elements connecting a risk factor and a disease
(Elsworth et al., 2018), thus identifying potential intermediate mechanisms.
MELODI utilises semantic triples (‘subject-predicate-object’) derived
from the titles and abstracts of nearly 30 million biomedical articles
using SemRep (Rindflesch and Fiszman, 2003) and provided by
SemMedDB (Kilicoglu et al., 2012). Whilst effective, the scale and complexity
of data utilized by MELODI, its implementation of a graph
database and web application, and its focus on single risk factor/outcome
combinations limits its application to many queries in parallel.
To address these challenges, we developed MELODI Presto, a quicker and
more agile tool to identify overlapping elements between any two query
lists using millions of semantic triples from the scientific literature
(https://melodi-presto.mrcieu.ac.uk/).
2 Features
The main features and innovations in MELODI Presto are described be-
low:
Filter by UMLS semantic type: SemMedDB triples were filtered to include
only those matching particular ‘term types’. These types are defined
by the UMLS semantic type abbreviations
(https://metamap.nlm.nih.gov/SemanticTypesAndGroups.shtml). We focus
on terms most relevant to mechanistic inference. Supplementary table
1 (Supplementary material) lists selected terms. For a triple to be included
both the subject and object semantic types needed to be in this list.
© The Author(s) 2020. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License
(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the
original work is properly cited.
Downloadedfromhttps://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa726/5893950bygueston07October2020
K.Takahashi et al.
Filter by predicate type: To improve the interpretability of the data we
removed the more ambiguous predicates. Supplementary table 2 (Supplementary
material) lists excluded predicates. We retained predicates that
implied direction or causality, excluding predicates such as
PROCESS_OF, PART_OF and ISA. Combined, these two filtering criteria
reduced the number of PREDICATION triples from 103,284,300 to
8,295,443.
Improve search performance: MELODI Presto implements a simpler
approach than MELODI which does not require a graph database architecture.
We therefore selected Elasticsearch
(https://www.elastic.co/elasticsearch) for performance based on our previous
experience applying it to large biomedical datasets (Elsworth et al.,
2020).
3 Implementation
Data sources and pre-processing
MELODI Presto incorporates semantic triples from the SemMedDB database
(Kilicoglu et al., 2012), which is built by running the SemRep semantic
knowledge representation tool (Rindflesch and Fiszman, 2003) on
the MEDLINE database.
The SemMedDB resource (version semmedVER42_R, 2020) is updated
periodically and provides data downloads in SQL format. For MELODI
Presto we extracted the PREDICATION, SENTENCE and CITATION
tables. For enrichment analysis, frequency counts of triples were pre-calculated
using Elasticsearch aggregation calls and added to a separate index.
This will be updated with new SemMedDB releases.
The quick brown fox jumps over the lazy dog. The quick brown fox jumps
over the lazy dog.
Functions
MELODI Presto provides three functions:
1) Enrich: The enrichment method follows the same principle as
MELODI, using a standard 2x2 Fisher’s exact test. For example, if a query
‘Sleep duration’ returned a set of triples "Sleep Apnea, Obstructive :
PREDISPOSES : Hypertensive disease" then we can count the number of
this specific triple returned by the query (localCount), the total number of
triples returned by the query (localTotal), the number of this specific triple
in the database (globalCount), and the total number of triples in the database
(globalTotal). These values are then provided as a 2x2 contingency
table using a two-sided alternative hypothesis, producing a prior odds ratio
and P-value (a=localCount, b=localTotal-localCount, c=globalCount,
d=globalTotal-globalCount).
2) Overlap: This is an extension of the Enrich function described above.
By providing two lists of search terms, e.g. multiple risk factors and multiple
diseases, all terms are first tested for enrichment, then overlapping
enriched elements are identified. An overlap is taken to be cases where the
object of a triple from the set of ‘x’ queries overlaps with a subject from
the set of ‘y’ queries (Figure 1).
3) Sentence: This function enables the user to check the source literature
for SemMedDB triples. To enable rapid evaluation of the literature underpinning
any potential mechanism, MELODI Presto provides a function
that takes a PubMed ID and returns triples from the refined PREDICATE
data, and the sentences from which they were derived.
Performance
The first time a query is run MELODI Presto creates local copies of the
enrichment data, as illustrated above. For this reason if a query has not
been run previously it may take a few seconds (generally <30 seconds depending
on the number of articles returned from the query). However, if
an existing variable, or list of variables, are queried, the Overlap function
generally runs in a few seconds.
Access and source code
MELODI Presto is available via an application programming interface
(API, using either the Swagger interface or an appropriate interface library
(e.g. Python requests). Python examples of which can be found in the Jupyter
notebooks linked below. Each function returns JSON objects which
can be easily incorporated into standard workflows. We also provide a
web application to enable some of the functionality to be explored.
All code used to process the raw data, create the Elasticsearch indexes,
API and web application are publicly available
(https://github.com/MRCIEU/MELODI-Presto) with Jupyter notebooks
providing a demonstration of basic API usage, use cases and details of
specific methods and performance.
Fig. 1. Data flow of the MELODI Presto overlap function. Two lists of queries (Q1 and
Q2) are first checked for previous enrichment analysis. For queries which have not been
previously analysed the text of each missing term is run as a PubMed query and returned
IDs are matched to the MELODI Presto database for enrichment. The results of previous
analysis are loaded from a local store. Overlapping elements between each pair of enriched
triple sets (one from each query list, Q1 and Q2 ) are then identified and returned.
URLs
MELODI Presto - https://melodi-presto.mrcieu.ac.uk/
MELODI Presto Web - https://melodi-presto.mrcieu.ac.uk/app/
MELODI Presto API - https://melodi-presto.mrcieu.ac.uk/docs/
MELODI Presto GitHub - https://github.com/MRCIEU/MELODI-Presto
Downloadedfromhttps://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa726/5893950bygueston07October2020
Article short title
MELODI Presto Notebooks - https://github.com/MRCIEU/MELODI-
Presto/blob/master/notebooks/
4 Conclusion
MELODI Presto provide a fast and efficient method to systematically profile
semantic triples derived from the literature. This can be used to explore
the enriched literature data for a given search term and identify potential
intermediate disease mechanisms between lists of terms. Using a refined
literature data set and a high-performance search architecture it is possible
to query millions of articles in seconds.
The agility of the method and its construction mean that as and when alternative
or improved triples of data are produced, we can add them
(Koroleva et al., 2020; Sybrandt et al., 2020). There is also the scope to
expand this approach to include preprints, full text and any other source
of data.
Funding
This	work	has	been	supported	by	the	UK	Medical	Research	Council	Integrative	
Epidemiology	Unit	(MC_UU_00011/4),	the	Cancer	Research	UK	Integrative	Can-
cer	Epidemiology	Programme	(C18281/A19169)	and	the	University	of	Bristol.	
TRG	holds	a	fellowship	from	the	Alan	Turing	Institute.	
	
Conflict	of	Interest:	TRG	receives	funding	from	GlaxoSmithKline	and	Biogen	for	
unrelated	research.
References
Elsworth B, Dawe K, Vincent EE, Langdon R, Lynch BM, Martin RM, Relton C,
Higgins JPT, Gaunt TR. 2018. MELODI: Mining Enriched Literature Objects to
Derive Intermediates. Int J Epidemiol 47:369–379. doi:10.1093/ije/dyx251
Elsworth B, Lyon M, Alexander T, Liu Y, Matthews P, Hallet J, Palmer T, Haberland
V, Davey Smith G, Zheng J, Haycock PC, Gaunt TR, Hemani G. 2020. The IEU
OpenGWAS data infrastructure. https://gwas.mrcieu.ac.uk
Kilicoglu H, Shin D, Fiszman M, Rosemblat G, Rindflesch TC. 2012. SemMedDB:
a PubMed-scale repository of biomedical semantic predications. Bioinformatics
28:3158–3160. doi:10.1093/bioinformatics/bts591
Koroleva A, Anisimova M, Gil M. 2020. Towards creating a new triple store for
literature-based discovery. Presented at the The 24th Pacific-Asia Conference on
Knowledge Discovery and Data Mining. Singapore.
Rindflesch TC, Fiszman M. 2003. The interaction of domain knowledge and linguistic
structure in natural language processing: interpreting hypernymic propositions
in biomedical text. J Biomed Inform 36:462–477.
doi:10.1016/j.jbi.2003.11.003
Sybrandt J, Tyagin I, Shtutman M, Safro I. 2020. AGATHA: Automatic Graph-mining
And Transformer based Hypothesis generation Approach. ArXiv200205635
Cs Stat.
Downloadedfromhttps://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa726/5893950bygueston07October2020