# Lecture 02: Hands-On Examples of Doing Biomedical AI/ML

### Outline
1. [Sandbox for Playing Around with AI/ML](#s1)
2. [Representing Genes, Proteins and Their Interactions](#s2)
 - 2.1 [Working with Gene and Protein Databases](#s2.1)
 - 2.2 [Representing Interactions (Between Genes, Proteins, etc.) Using an Ontology](#s2.2)
3. [From Certainty Factors to Graphical Models](#s3)
4. [Actual ML - Two Takes on Diabetes Prediction](#s4)
 - 4.1 [Two Paradigms of Machine Learning](#s4.1)
 - 4.2 [Diabetes Prediction Using Classical Machine Learning](#s4.2)
 - 4.3 [Diabetes Prediction Using Deep Learning](#s4.3)
5. [Spotting Genes and Other Beasts in Text](#s5)
6. [Checking Out a Melanoma Classification Pipeline](#s6)

---
<a class="anchor" name="s1"></a>
### 1. Sandbox for Playing Around with AI/ML

- Most coding relevant to the course (and indeed AI/ML in general) can be done within a [Python](https://en.wikipedia.org/wiki/Python_(programming_language)) [notebook](https://en.wikipedia.org/wiki/Notebook_interface) environment
 - An interactive programming environment
 - Encapsulating code, comments, descriptive text and other media (images, interactive plots, etc.) in one living, executable "document" (such as this one)
- One convenient way to work with Python notebooks is [Google Colaboratory](https://colab.research.google.com/)
 - Cloud-based, easily handling dependencies and computational infrastructure regardless your own machine's configuration
 - Available to MU students via their IS/MU accounts (you may need to enable Google Suite [here](https://is.muni.cz/auth/extservices/))
 - The notebooks can be uploaded to your IS/MU Google Drive and opened from there by either the standard double-clicking, or right-clicking on the corresponding notebook file and choosing _"Open with" $\rightarrow$ "Google Colaboratory"_ (depending on your browser/OS configuration)
 - One can also work with a public Google Colaboratory instance within their private accounts, though
- Many other environments are totally fine as well, depending on your preferences
- Some examples:
 - [Jupyter](https://jupyter.org/) (available on all Linux FI MU machines after adding the Python 3 module (`module add python3`)
 - [PyCharm](https://www.jetbrains.com/pycharm/) (a powerful and comprehensive Python IDE)

---
<a class="anchor" name="s2"></a>
### 2. Representing Genes, Proteins and Their Interactions

<a class="anchor" name="s2.1"></a>
#### 2.1 Working with Gene and Protein Databases

- Installing [BioPython](https://en.wikipedia.org/wiki/Biopython), a collection of tools for computational biology and bioinformatics

In [None]:
!pip install biopython

- Defining a function for retrieving gene annotations from the [NCBI](https://www.ncbi.nlm.nih.gov)'s [Gene](https://www.ncbi.nlm.nih.gov/gene) database

In [20]:
import sys

# importing the Entrez API (a general access interface to many services and DBs)
from Bio import Entrez

# we must tell NCBI (the Entrez maintainers at NLM/NIH) who we are
Entrez.email = "novacek@fi.muni.cz"

def retrieve_annotation(id_list):

    """Annotates Entrez Gene IDs using Bio.Entrez, in particular epost (to
    submit the data to NCBI) and esummary to retrieve the information.
    Returns a list of dictionaries with the annotations."""

    request = Entrez.epost("gene", id=",".join(id_list))
    try:
        result = Entrez.read(request)
    except RuntimeError as e:
        print("An error occurred while retrieving the annotations.")
        print("The error returned was %s" % e)
        sys.exit(-1)

    webEnv = result["WebEnv"]
    queryKey = result["QueryKey"]
    data = Entrez.esummary(db="gene", webenv=webEnv, query_key=queryKey)
    annotations = Entrez.read(data)

    print("Retrieved %d annotations for %d genes" % (len(annotations), 
                                                     len(id_list)))

    return annotations

- Getting gene annotations for the BRCA1 gene

In [None]:
id_list = ['672'] # the ID of BRCA1 in the Gene DB

dct = retrieve_annotation(id_list)

- Using the gene annoations for generating a mapping between a list of all BRCA1's human-readable names and its canonical name (useful for many downstream text mining and/or knowledge integration tasks)

In [None]:
name_mappings, ambiguities = {}, 0
for record in dct['DocumentSummarySet']['DocumentSummary']:
  # printing out the "canonical" gene name
  print('Record items for gene name:', record['Name'])
  
  # printing out the known aliases for the canonical name,
  # incremental updates of a mapping from the alternative names 
  # to the canonical one
  for key, value in record.items():
    print(' ', key, '->', value)
  for alias in record['OtherAliases'].split(','):
    new_key = alias.strip()
    if new_key not in name_mappings:
      name_mappings[new_key] = record['Name']
    else:
      ambiguities += 1
  for alias in record['OtherDesignations'].split('|'):
    new_key = alias.strip()
    if new_key not in name_mappings:
      name_mappings[new_key] = record['Name']
    else:
      ambiguities += 1
  name_mappings[record['NomenclatureName']] = record['Name']
  name_mappings[record['NomenclatureSymbol']] = record['Name']
print('Name mappings:\n'+'\n'.join(['  '+x+' -> '+y for x,y in 
                                     name_mappings.items()]))
print('  (total number of mappings: %d)' % (len(name_mappings),))
print('Number of ambiguities:', ambiguities)

- Retrieving a representation of the protein product of BRCA1 using the [UniProt](http://www.uniprot.org) protein database

In [None]:
# necessary imports
from Bio import SeqIO
import urllib

# getting a handle from the UniProt database, based on the UniProt BRCA1 ID
handle = urllib.request.urlopen("http://www.uniprot.org/uniprot/P38398.xml")
record = SeqIO.read(handle, "uniprot-xml")

# initialising a list of alternative names
alternatives = [record.name]

# printing out the various name-related annotations
print(record.name)
if 'alternativeName_fullName' in record.annotations:
  print(record.annotations['alternativeName_fullName'])
  alternatives += record.annotations['alternativeName_fullName']
if 'recommendedName_shortName' in record.annotations:
  print(record.annotations['recommendedName_shortName'])
  alternatives += record.annotations['recommendedName_shortName']
if 'alternativeName_shortName' in record.annotations:
  print(record.annotations['alternativeName_shortName'])
  alternatives += record.annotations['alternativeName_shortName']
if 'gene_name_primary' in record.annotations:
  print(record.annotations['gene_name_primary'])
  alternatives += [record.annotations['gene_name_primary']]

# updating alias records for BRCA1
for alias in alternatives:
  new_key = alias.strip()
  if new_key not in name_mappings:
    name_mappings[new_key] = 'BRCA1'
  else:
    ambiguities += 1
print('Name mappings:\n'+'\n'.join(['  '+x+' -> '+y for x,y in 
                                     name_mappings.items()]))
print('  (total number of mappings: %d)' % (len(name_mappings),))
print('Number of ambiguities:', ambiguities)

<a class="anchor" name="s2.2"></a>
#### 2.2 Representing Interactions (Between Genes, Proteins, etc.) Using an Ontology
- Installing the [owlready2](https://owlready2.readthedocs.io/en/latest/) package for working with ontologies in Python

In [None]:
!pip install owlready2

# importing everything from the package
from owlready2 import *

- Loading the [Interaction Ontology](https://bioportal.bioontology.org/ontologies/INO)

In [10]:
# an interaction ontology
ONTO_URL = 'https://www.fi.muni.cz/~novacek/courses/iv121/data/ino_merged.owl'
ONTO_NAMESPACE = \
  'https://www.fi.muni.cz/~novacek/courses/iv121/data/ino_merged.owl'

# setting the namespace
onto_ns = get_namespace(ONTO_NAMESPACE)

# loading the ontology
onto = get_ontology(ONTO_URL).load()

- Listing the ontology contents

In [None]:
print('List of classes  and their immediate sub-classes:\n')

# iterating over the top-level ontology classes
for onto_class in onto.classes():
    print(onto_class, '(%s)' % onto_class.iri)
    
    # iterating over the sub-classes of the current class, their listing
    for sub_class in onto.search(subclass_of=onto_class):
        print('  sub-class:', sub_class)

print('\n'+'-'*80+'\n')
print('List of individuals in the loaded ontology:\n')

# iterating over the ontology individuals
for onto_individual in onto.individuals():
    print(onto_individual, '(%s)' % onto_individual.iri)

print('\n'+'-'*80+'\n')
print('List of object properties in the loaded ontology:\n')

# iterating over the object properties in the ontology
for onto_property in onto.object_properties():
    print(onto_property, '(%s)' % onto_property.iri)

- Running a reasoner ([HermiT](http://www.hermit-reasoner.com/)) on the ontology

In [None]:
with onto:
    sync_reasoner()

Examples of new knowledge materialised by the classification process:
- the class [negative regulation of transcription by binding to promoter](https://www.ebi.ac.uk/ols/ontologies/ino/terms?iri=http://purl.obolibrary.org/obo/INO_0000074) inferred to be a sub-class of [negative regulation of gene transcription](https://www.ebi.ac.uk/ols/ontologies/ino/terms?iri=http://purl.obolibrary.org/obo/INO_0000042)
- the property [causally upstream of, positive effect](https://www.ebi.ac.uk/ols/ontologies/ro/properties?iri=http://purl.obolibrary.org/obo/RO_0002304) inferred to be a sub-property of [causally upstream of](https://www.ebi.ac.uk/ols/ontologies/ro/properties?iri=http://purl.obolibrary.org/obo/RO_0002411)

---
<a class="anchor" name="s3"></a>
### 3. From Certainty Factors to Graphical Models

- Despite not really making it to a massive adoption stage, the expert systems have been hugely influential 
- For instance, the uncertainty handling (certainty factors) proposed in the MYCIN system inspired [graphical models](https://en.wikipedia.org/wiki/Graphical_model) (a popular class of probabilistic knowledge representation and reasoning models)
- An example model's structure:

<img src="https://www.fi.muni.cz/~novacek/courses/pv287/img/cancer-bn.png" alt="architecture" width="400px" title="Original image provided at https://www.bnlearn.com/bnrepository (license unknown)."/>

- Representing the model's structure using the [pgmpy](https://github.com/pgmpy/pgmpy) Python module

In [None]:
# installing the module 
# (the following code is heavily based on the module's documentation)
!pip install pgmpy

In [15]:
# importing a specific class for the model representation
from pgmpy.models import BayesianNetwork

# representing the cancer model's structure
cancer_model = BayesianNetwork(
    [
        ("Pollution", "Cancer"),
        ("Smoker", "Cancer"),
        ("Cancer", "Xray"),
        ("Cancer", "Dyspnoea"),
    ]
)

- Extending the model with specific conditional probability distributions (CPDs) associated with the nodes

In [16]:
# defining the CPDs
from pgmpy.factors.discrete import TabularCPD

cpd_poll = TabularCPD(variable="Pollution", 
                      variable_card=2, 
                      values=[[0.9], [0.1]],
                      state_names={'Pollution': ['Low', 'High']})
cpd_smoke = TabularCPD(variable="Smoker", 
                       variable_card=2, 
                       values=[[0.3], [0.7]],
                       state_names={'Smoker': ['True', 'False']})
cpd_cancer = TabularCPD(
    variable="Cancer",
    variable_card=2,
    values=[[0.03, 0.05, 0.001, 0.02], [0.97, 0.95, 0.999, 0.98]],
    evidence=["Smoker", "Pollution"],
    evidence_card=[2, 2],
    state_names={
        'Cancer' : ['True', 'False'],
        'Smoker': ['True', 'False'],
        'Pollution': ['Low', 'High']
    }
)
cpd_xray = TabularCPD(
    variable="Xray",
    variable_card=2,
    values=[[0.9, 0.2], [0.1, 0.8]],
    evidence=["Cancer"],
    evidence_card=[2],
    state_names={
        'Xray' : ['Positive', 'Negative'],
        'Cancer' : ['True', 'False']
    }
)
cpd_dysp = TabularCPD(
    variable="Dyspnoea",
    variable_card=2,
    values=[[0.65, 0.3], [0.35, 0.7]],
    evidence=["Cancer"],
    evidence_card=[2],
    state_names={
        'Dyspnoea' : ['True', 'False'],
        'Cancer' : ['True', 'False']
    }
)

In [None]:
# associating the parameters with the model structure.
cancer_model.add_cpds(cpd_poll, cpd_smoke, cpd_cancer, cpd_xray, cpd_dysp)

# checking if the cpds are valid for the model.
cancer_model.check_model()

- Running basic operations on the model

In [None]:
# check for d-separation between variables
print(cancer_model.is_dconnected("Pollution", "Smoker"))
print(cancer_model.is_dconnected("Pollution", "Smoker", observed=["Cancer"]))

In [None]:
# get all d-connected nodes (to 'Pollution')
cancer_model.active_trail_nodes("Pollution")

In [None]:
# list local independencies for a node
cancer_model.local_independencies("Xray")

In [None]:
# get all independence conditions implied by the model
cancer_model.get_independencies()

- Using the model for inference

In [None]:
# importing a specific inference technique class
from pgmpy.inference import VariableElimination

# initialising the imported inference engine
infer = VariableElimination(cancer_model)

# various cancer probabilities depending on pollution and smoking
print('Cancer probability given low pollution:')
print(infer.query(['Cancer'], evidence={'Pollution': 'Low'}))
print('\nCancer probability given high pollution:')
print(infer.query(['Cancer'], evidence={'Pollution': 'High'}))
print('\nCancer probability for smokers:')
print(infer.query(['Cancer'], evidence={'Smoker': 'True'}))
print('\nCancer probability for non-smokers:')
print(infer.query(['Cancer'], evidence={'Smoker': 'False'}))
print('\nCancer probability for non-smokers living in polluted areas:')
print(infer.query(['Cancer'], evidence={'Pollution': 'High',
                                        'Smoker': 'False'}))
print('\nCancer probability for smokers living in unpolluted areas:')
print(infer.query(['Cancer'], evidence={'Pollution': 'Low',
                                        'Smoker': 'True'}))
print('\nCancer probability for smokers living in polluted areas:')
print(infer.query(['Cancer'], evidence={'Pollution': 'High',
                                        'Smoker': 'True'}))

---
<a class="anchor" name="s4"></a>
### 4. Actual ML - Two Takes on Diabetes Prediction

Trying to predict high risk of diabetes onset using two computational paradigms: classical and deep machine learning.


<a class="anchor" name="s4.1"></a>
#### 4.1 Two Paradigms of Machine Learning

__Classical machine learning__
- Machine learning (ML) typically builds a model encoding actionable knowledge from input (training) data in an autonomous way.
- ML comes in different flavours:
 - [Supervised](https://en.wikipedia.org/wiki/Supervised_learning) - learning a function that maps features (covariates) of training examples to corresponding labels (provided in the training data, missing in the "real world" and/or testing data). The labels can be identifiers of a class (or classes) the examples belong to (in a [classification](https://en.wikipedia.org/wiki/Statistical_classification) ML problem), or scalar values (in a [regression](https://en.wikipedia.org/wiki/Regression_analysis) ML problem).
 - [Unsupervised](https://en.wikipedia.org/wiki/Unsupervised_learning) - learning patterns from unlabelled data, using for instance [self-organising neural networks](https://en.wikipedia.org/wiki/Unsupervised_learning#Specific_Networks), [cluster analysis](https://en.wikipedia.org/wiki/Cluster_analysis) or [outlier detection](https://en.wikipedia.org/wiki/Outlier#Definitions_and_detection).
 - [Reinforcement learning](https://en.wikipedia.org/wiki/Reinforcement_learning) - agents taking actions in an environment, maximising a reward function.
- A convenient representation of input data is a matrix mapping the example identifiers to vectors of their specific feature values, and (for supervised learning) also a vector mapping the example identifiers to their labels.
- An example of a feature matrix:

<img src="https://www.fi.muni.cz/~novacek/courses/pv287/img/ml-data-features.png" alt="architecture" width="800px" title="An example from the diabetes prediction dataset (c.f. the code below)."/>

- An example of a label vector:

<img src="https://www.fi.muni.cz/~novacek/courses/pv287/img/ml-data-labels.png" alt="architecture" width="60px" title="An example from the diabetes prediction dataset (c.f. the code below)."/>

- An example of a general machine learning pipeline:

<img src="https://www.fi.muni.cz/~novacek/courses/pv287/img/ml-pipeline.png" alt="architecture" width="800px" title="Adapted from https://www.linkedin.com/pulse/4-stages-machine-learning-ml-modeling-cycle-maurice-chang/ (license unknown)."/>

- A number of libraries implementing classical machine learning algorithms exist, one of the most comprehensive being [scikit-learn](https://scikit-learn.org/stable/index.html).

__Deep learning__
- Deep learning consists of designing, training and validating machine learning models based on various [neural architectures](https://en.wikipedia.org/wiki/Types_of_artificial_neural_networks) that typically involve multiple (hidden) layers consisting of many neural computing units (a simple example of one unit is [perceptron](https://en.wikipedia.org/wiki/Perceptron)).
- Good at [representation](https://en.wikipedia.org/wiki/Feature_learning), [self-supervised](https://en.wikipedia.org/wiki/Self-supervised_learning) and other very practical learning tasks.
- An example of a deep learning architecture:

<img src="https://www.fi.muni.cz/~novacek/courses/pv287/img/stacked-representation.png" alt="architecture" width="550px" title="Original image source: Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. ”Deep learning.” MIT press, 2016. (Chap. 1) License: Probably OK to use for academic purposes; for any other use, contact the publisher (MIT Press)."/>

- A number of libraries seamlessly integrating with parallel computational architectures is available for developing deep learning models. Some of the popular examples are:
 - [PyTorch](https://pytorch.org/) - originally a C general-purpose ML library, now a state-of-the-art deep learning framework with relatively easy-to-use Python (and C++) abstraction layers.
 - [TensorFlow](https://www.tensorflow.org/) - a general-purpose, highly optimised library for multilinear algebra and statistical learning.
 - [Keras](https://keras.io/) - formerly a separate project, now an abstraction layer for user-friendly development of deep learning models integrated with TensorFlow.

<a class="anchor" name="s4.2"></a>
#### 4.2 Diabetes Prediction Using Classical Machine Learning
- Loading the data using [pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html)

In [None]:
# importing the library for handy data processing
import pandas as pd

# loading the data, in the CSV format, from the web
DATA_URL = 'https://www.fi.muni.cz/~novacek/courses/pv287/data/diabetes.csv'
dataframe = pd.read_csv(DATA_URL)

# checking the first few rows of the CSV
dataframe.head()

- Creating the _features_ and _labels_ data structures

In [None]:
# the features are the data minus the label vector
# - this contains the remaining features present in the data
df_features = dataframe.drop('Outcome',axis=1).values
print('The first few feature records:')
dataframe.drop('Outcome',axis=1).head()

In [None]:
# getting just the Outcome column as the vector of labels
# - note that the column contains 0, 1 values that correspond to negative 
#   (no diabetes developed) and positive (diabetes developed) example labels,
#   respectively
df_labels = dataframe.Outcome.values.astype(float)
print('The first few label records:')
print('\n'.join([str(x)+' : '+str(y) for x,y in zip(range(5),df_labels[:5])]))

- Splitting the data into train and test sets using the corresponding [function](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) of the _scikit-learn_ library

In [17]:
# importing a convenience data splitting function from scikit-learn

from sklearn.model_selection import train_test_split

# computing a random 80-20 split (80% training data, 20% of remaining
# "unseen" data for testing the model trained on the 80%)

x_train, x_test, y_train, y_test = train_test_split(df_features,df_labels,\
                                                    test_size=0.2,\
                                                    random_state=42)

- Training a model using logistic regression

In [None]:
# importing the model
from sklearn.linear_model import LogisticRegression

# initialising the model with a specific solver
logreg = LogisticRegression(solver='liblinear')

# fitting (i.e., training) the model
logreg.fit(x_train, y_train)

- Using the model for predicting of labels of the test examples

In [21]:
y_pred = logreg.predict(x_test)

- Evaluating the model - visualising a [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix)

In [None]:
# importing stuff needed to compute and visualise a confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay

# importing stuff needed to enhance the visualisation of the confusion matrix
import seaborn as sns

# getting the confusion matrix from out trained model
ConfusionMatrixDisplay.from_estimator(
    logreg, x_test, y_test, xticks_rotation="vertical"
)

- Evaluating the model - computing various comprehensive scores

In [None]:
# importing selected scoring functions from scikit-klearn

from sklearn.metrics import f1_score, precision_score, recall_score

# computing the precision, recall and F1 scores from the predictions
score_p = precision_score(y_test, y_pred, average='macro')
score_r = recall_score(y_test, y_pred, average='macro')
score_f = f1_score(y_test, y_pred, average='macro')
# using the default scoring method of the model to compute its accuracy
score_a = logreg.score(x_test, y_test)

# printing out the scores
print('Various scores of the logistic regression classifier on the test set')
# the number of correct predictions (true positives and true negatives) 
# divided by the number of all predictions
print('  - accuracy :', score_a)
# the number of patients correctly classified as high risk
# divided by the number of all patients classified as high risk
print('  - precision:', score_p) 
# the number of patients correctly classified as high risk,
# divided by the number of all patients that really are high risk
print('  - recall   :', score_r)
# aggregation of the precision and recall values
print('  - F1       :', score_f)

<a class="anchor" name="s4.3"></a>
#### 4.3 Diabetes Prediction Using Deep Learning

- Creating a [Keras](https://keras.io/) model (we reuse the training/testing data prepared before)

In [25]:
# adapted from:
#   - https://www.kaggle.com/code/atulnet/pima-diabetes-keras-implementation

# importing the basics from Keras 
from keras.models import Sequential
from keras.layers import Dense

# a model for simple sequential stacking of layers
model = Sequential()

# 1st layer: 100 fully connected nodes, matched to an input vector of size 8
model.add(Dense(100, input_dim=8, activation='sigmoid'))
# 2nd layer: the same number of nodes, a different activation function
model.add(Dense(100, activation='relu'))
# output layer: dim=1, sigmoid activation again
model.add(Dense(1, activation='sigmoid' ))

# compiling the model with the binary cross-entropy loss (predicting 0/1)
model.compile(loss='binary_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])

- Training the created model (using the testing data for validation in each epoch)

In [None]:
# simply calling the fit function, training on the training data and validating
# on the test data after each epoch
model.fit(x_train,y_train,epochs=30,\
          validation_data=(x_test, y_test))

- Interpreting the results
 - Not too great:
   - The loss is barely being optimised towards the end
   - The validation accuracy is worse then the classical ML baseline (not much better than 0.74 in most runs as opposed to nearly 0.76)
 - The reasons:
   - More or less default settings of the model with no [hyper-parameter optimisation](https://en.wikipedia.org/wiki/Hyperparameter_optimization)
   - More importantly, though, there's no preprocessing of the rather noisy and skewed input data (see for instance [this](https://towardsdatascience.com/pima-indian-diabetes-prediction-7573698bd5fe) or [this](https://www.kaggle.com/code/atulnet/pima-diabetes-keras-implementation) blog post, describing examples of exploratory analysis and various input data transformations that may help in this case)

---
<a class="anchor" name="s5"></a>
### 5. Spotting Genes and Other Beasts in Text

- Searching [PubMed](https://pubmed.ncbi.nlm.nih.gov/) using Biopython (once again, via [Entrez](https://en.wikipedia.org/wiki/Entrez))

In [27]:
# defining the PubMed query expressing our interest
pubmed_query = 'honjo[au] AND "induced expression" AND PD-1'

# a search handle with our query and some key meta-data
search_handle = Entrez.esearch(db='pubmed',
                               sort='relevance',
                               retmax='20',
                               retmode='xml',
                               term=pubmed_query)

# reading the search results into a dict-like object
search_results = Entrez.read(search_handle)

- Printing out the search results and fetching the records

In [None]:
print('Search result PubMed IDs:', '\n'+'\n'.join(search_results['IdList']))

# actually accessing the search results via the found ID(s)
fetch_handle = Entrez.efetch(db='pubmed',
                             retmode='xml',
                             id=search_results['IdList'])

# reading the results into another dict-like object
fetch_results = Entrez.read(fetch_handle)

- Extracting the article title and abstract from the first search result

In [None]:
paper = fetch_results['PubmedArticle'][0]

print('Number of articles in total:', len(fetch_results['PubmedArticle']))

title = paper['MedlineCitation']['Article']['ArticleTitle']
abstract = \
  '\n'.join(paper['MedlineCitation']['Article']['Abstract']['AbstractText'])

print('Title:', title)
print('Abstract:', abstract)

- Pretty-printing the whole paper record (just out of curiosity)

In [None]:
import json
print(json.dumps(paper, indent=2))

- Installing [ScispaCY](https://allenai.github.io/scispacy/) and one of its pre-trained biomedical NLP models

In [None]:
!pip install scispacy
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_bionlp13cg_md-0.5.1.tar.gz

- Loading the pre-trained model sing [spacy](https://spacy.io/)

In [None]:
import spacy
nlp = spacy.load("en_ner_bionlp13cg_md")

- Using the model to annotate biomedical entities in the downloaded paper

In [None]:
text = title + abstract

doc = nlp(text)

print('Extracted biomedical entities:')
print('\n'.join(set(['  '+str(x) for x in doc.ents])))

- Visualising the entities in the sample text

In [None]:
from spacy import displacy

displacy_image = displacy.render(doc, jupyter = True, style = 'ent')

---
<a class="anchor" name="s6"></a>
### 6. Checking Out a Melanoma Classification Pipeline
- An example [project](https://www.kaggle.com/c/siim-isic-melanoma-classification/discussion/175412) that won a [Kaggle](https://www.kaggle.com) [challenge](https://www.kaggle.com/competitions/siim-isic-melanoma-classification/overview) in this domain

<img src="https://www.fi.muni.cz/~novacek/courses/pv287/img/melanoma.png" alt="architecture" width="800px" title="Original source: the challenge project authors (license unknown)."/>

- Based on an [ensemble](https://en.wikipedia.org/wiki/Ensemble_learning) of deep learning probabilistic classifiers, integrating image and tabular training (meta)data
- The code for such a comprehensive, fine-tuned machine learning pipeline is a bit too bulky to be run in an interactive manner
- The details can be checked out (and played with) at the following places, though:
 - The [code](https://github.com/haqishen/SIIM-ISIC-Melanoma-Classification-1st-Place-Solution) on [GitHub](https://github.com/)
 - A textual [description](https://arxiv.org/pdf/2010.05351.pdf) on [arXiv](https://arxiv.org/)