# PA164 - Lab 6: Knowledge extraction

__Outline:__
1. Back to Shakespeare
2. Towards extracting taxonomies of Shakespeare's worlds
3. Towards relation extraction
4. (Optional) Putting it all together in a knowledge graph

---

## 1. Back to Shakespeare

"will"

### Downloading and cleaning the Shakespeare's works

In [40]:
import urllib.request # import library for opening URLs, etc.

# open a link to sample text

sample_text_link = "https://www.gutenberg.org/files/100/100-0.txt"
f = urllib.request.urlopen(sample_text_link)

# decoding the content of the link (just convert the binary string to text -
# it is already in a relatively clean plain text format)

sample_text = f.read().decode("utf-8")

# cutting the metadata in the beginning

cleaner_text = sample_text.split(' Contents')[1]

# cutting the appendix after the main story

cleaner_text = cleaner_text.split('*** END OF THE PROJECT GUTENBERG EBOOK THE COMPLETE WORKS OF WILLIAM SHAKESPEARE ***')[0]

# deleting the '\r' characters

cleaner_text = cleaner_text.replace('\r','')

### Getting the separate texts of the Shakespeare's works

In [None]:
# getting the list of titles of Shakespeare's work from the table of contents

# to split at the TOC from the bottom
splitter_bot = """THE SONNETS

 1"""

# to split at the TOC from the top
splitter_top = """VENUS AND ADONIS






"""

# list of titles from the TOC
raw_split = cleaner_text.split(splitter_bot)[0].split('\n\n')[1].split('\n ')
titles = [x.strip() for x in raw_split if len(x.strip())]

# the rest of the text after TOC
body = cleaner_text.split(splitter_top)[-1]

# printing out the list of works

print(len(titles), "Shakespeare's works:", titles)

# populating a mapping from works' titles to their texts - the KEY VARIABLE!

works = {}

for i in range(len(titles)):
 # base text - from the current title till the end of the all-in-one file
 text_down = titles[i] + '\n\n' + body.split(titles[i])[-1].strip()
 if i == len(titles) - 1: # the last text in the all-in-one file
 works[titles[i]] = text_down
 else: # other texts, enclosed between consecutive titles
 works[titles[i]] = text_down.split(titles[i+1])[0]

# printing out opening and ending samples of three selected works

print('*********** SONNETS opening sample:')
print(works['THE SONNETS'][:1000])
print('\n\n*********** SONNETS ending sample:')
print(works['THE SONNETS'][-1000:])
print('\n--------------------------------------------\n')
print('*********** AS YOU LIKE IT opening sample:')
print(works['AS YOU LIKE IT'][:1000])
print('\n\n*********** AS YOU LIKE IT ending sample:')
print(works['AS YOU LIKE IT'][-1000:])
print('\n--------------------------------------------\n')
print('*********** VENUS AND ADONIS opening sample:')
print(works['VENUS AND ADONIS'][:1000])
print('\n\n*********** VENUS AND ADONIS ending sample:')
print(works['VENUS AND ADONIS'][-1000:])
print('\n--------------------------------------------\n')

### Getting two corpora of Shakespeare's plays - one for comedies and one for tragedies

In [None]:
# the list of Shakespeare's comedies
comedy_titles = [
 'ALL’S WELL THAT ENDS WELL',
 'AS YOU LIKE IT',
 'THE COMEDY OF ERRORS',
 'LOVE’S LABOUR’S LOST',
 'MEASURE FOR MEASURE',
 'THE MERCHANT OF VENICE',
 'THE MERRY WIVES OF WINDSOR',
 'A MIDSUMMER NIGHT’S DREAM',
 'MUCH ADO ABOUT NOTHING',
 'PERICLES, PRINCE OF TYRE',
 'THE TAMING OF THE SHREW',
 'THE TEMPEST',
 'TWELFTH NIGHT; OR, WHAT YOU WILL',
 'THE TWO GENTLEMEN OF VERONA',
 'THE TWO NOBLE KINSMEN',
 'THE WINTER’S TALE',
 'CYMBELINE'
]

# the list of Shakespeare's tragedies
tragedy_titles = [
 'THE TRAGEDY OF ANTONY AND CLEOPATRA',
 'THE TRAGEDY OF CORIOLANUS',
 'THE TRAGEDY OF HAMLET, PRINCE OF DENMARK',
 'THE TRAGEDY OF JULIUS CAESAR',
 'THE TRAGEDY OF KING LEAR',
 'THE TRAGEDY OF MACBETH',
 'THE TRAGEDY OF OTHELLO, THE MOOR OF VENICE',
 'THE TRAGEDY OF ROMEO AND JULIET',
 'THE TRAGEDY OF TITUS ANDRONICUS',
 'TROILUS AND CRESSIDA',
 'THE LIFE OF TIMON OF ATHENS'
]

# the two corresponding corpora

comedies = '\n\n'.join([works[x] for x in comedy_titles])
tragedies = '\n\n'.join([works[x] for x in tragedy_titles])

print('The size of the comedy corpus (in simple tokens) :',
 len(comedies.split()))
print('The size of the tragedy corpus (in simple tokens):',
 len(tragedies.split()))

---
## 2. Towards extracting taxonomies of Shakespeare's worlds

The backbone of any knowledge representation is a taxonomy of concepts that encodes the hierarchy of entities along the generality/specificity axis (for example, mammals are a more general concept than felines, which is a more general concept than cat, etc.).

Your task in this exercise is the following:
- Represent words occurring in Shakespeare's comedies and tragedies as vectors (using one of the word embedding methods you experimented with before).
 - Optionally, you may generate vector representations of bigrams as well.
- Use the vector representations to compute a [hierarchical clustering](https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering) of the entities - this is the desired taxonomy structure, albeit without labels of particular non-leaf nodes.
- Plot the [dendrogram](https://en.wikipedia.org/wiki/Dendrogram) of your clusters (you may want to limit the depth to a few most general levels only).
- Optionally, try to assign labels to your clusters, which would let you perform some more in-depth cluster analysis and compare the conceptual structure of the comedy and tragedy worlds of the Bard. Techniques you may experiment with are for instance these:
 - Picking the term represented by a vector that is closest to the cluster centroid as the label.
 - Looking up the terms present in the cluster in [WordNet](https://wordnet.princeton.edu/) and picking the most common synset at the label (for instance, via the [NLTK API](https://www.nltk.org/howto/wordnet.html)).

In [None]:
# TODO - your code comes here



### A possible rudimentary solution

- Tokenising the texts using _nltk_

In [None]:
# need to tokenize the text to list of sentences that are themselves
# lists of individual words

import nltk
nltk.download('punkt')

sentences_comedies = [nltk.word_tokenize(sentence) for sentence in
 nltk.sent_tokenize(comedies)]
sentences_tragedies = [nltk.word_tokenize(sentence) for sentence in
 nltk.sent_tokenize(tragedies)]

print('The size of the comedy corpus (in sentences) :',
 len(sentences_comedies))
print('The size of the tragedy corpus (in sentences):',
 len(sentences_tragedies))

- Generating word embeddings using _gensim_ and _word2vec_, taking also common bigrams into account

In [None]:
# training a word2vec model separately on each corpus, reflecting also
# common bigrams

from gensim.models.word2vec import Word2Vec
from gensim.models import phrases

# getting the bigram models firts
print('Training the comedy bigram detection model...')
bigrams_comedies = phrases.Phrases(sentences_comedies)
print('Training the tragedy bigram detection model...')
bigrams_tragedies = phrases.Phrases(sentences_tragedies)

# training the embedding models on sentences ran through the bigram detection
print('Training the comedy embedding model...')
model_comedies = Word2Vec(bigrams_comedies[sentences_comedies],min_count=2,
 vector_size=200,window=5,sg=1)
print('Training the tragedy embedding model...')
model_tragedies = Word2Vec(bigrams_tragedies[sentences_tragedies],min_count=2,
 vector_size=200,window=5,sg=1)

- Mapping terms to their vectors

In [None]:
# generating maps from the comedy and tragedy terms to their vectors

term2vec_comedies = dict([(word,model_comedies.wv[word]) for word in
 model_comedies.wv.index_to_key])
term2vec_tragedies = dict([(word,model_tragedies.wv[word]) for word in
 model_tragedies.wv.index_to_key])

print('Number of comedy terms/vectors :', len(term2vec_comedies))
print('Number of tragedy terms/vectors:', len(term2vec_tragedies))

- Creating feature matrices from the mappings

In [None]:
import numpy as np

# creating lists of integer index-word pairs from the word-vector mappings
i2w_list_comedies = [(i,w) for i, w in enumerate(term2vec_comedies)]
i2w_list_tragedies = [(i,w) for i, w in enumerate(term2vec_tragedies)]

# mappings between the integer indices and words
i2w_dict_comedies = dict(i2w_list_comedies)
i2w_dict_tragedies = dict(i2w_list_tragedies)

# features matrices as numpy objects
X_comedies = np.array([term2vec_comedies[w] for _,w in i2w_list_comedies])
X_tragedies = np.array([term2vec_tragedies[w] for _,w in i2w_list_tragedies])

print('The shape of the comedy feature matrix :', X_comedies.shape)
print('The shape of the tragedy feature matrix:', X_tragedies.shape)

- Getting hierarchies of the terms in the corpora using [agglomerative clustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering)

In [None]:
from matplotlib import pyplot as plt

from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram

def plot_dendrogram(model, **kwargs):
 # Create linkage matrix and then plot the dendrogram
 # NOTE: taken from the Scikit-learn documentation, licensed by the tool
 # developers under the BSD license

 # create the counts of samples under each node
 counts = np.zeros(model.children_.shape[0])
 n_samples = len(model.labels_)
 for i, merge in enumerate(model.children_):
 current_count = 0
 for child_idx in merge:
 if child_idx < n_samples:
 current_count += 1 # leaf node
 else:
 current_count += counts[child_idx - n_samples]
 counts[i] = current_count

 linkage_matrix = np.column_stack(
 [model.children_, model.distances_, counts]
 ).astype(float)

 # Plot the corresponding dendrogram
 dendrogram(linkage_matrix, **kwargs)

model_comedies = AgglomerativeClustering(distance_threshold=0, n_clusters=None)
model_tragedies = AgglomerativeClustering(distance_threshold=0, n_clusters=None)

print('Fitting the comedy clustering model...')
clustering_comedies = model_comedies.fit(X_comedies)
print('Fitting the tragedy clustering model...')
clustering_tragedies = model_tragedies.fit(X_tragedies)

- Plotting the comedy dendrogram

In [None]:
plt.title("Hierarchical Clustering Dendrogram of the Comedy Corpus")
# plot the top five levels of the dendrogram
plot_dendrogram(model_comedies, truncate_mode="level", p=5)
plt.xlabel("Number of points in node (or index of point if no parenthesis).")
plt.show()

- Plotting the tragedy dendrogram

In [None]:
plt.title("Hierarchical Clustering Dendrogram of the Tragedy Corpus")
# plot the top five levels of the dendrogram
plot_dendrogram(model_tragedies, truncate_mode="level", p=5)
plt.xlabel("Number of points in node (or index of point if no parenthesis).")
plt.show()

---
## 3. Towards relation extraction

Once the taxonomy is sorted, one may want to extract also the "horizontal" relations between entities occurring in the input text. These are often represented as triples (or triplets), either in the `(subject, predicate, object)` or `(head, relation_type, tail)` form (both correspond to a typed oriented edge between the entities occuring as the first and third element, respectively).

Your task in this exercises is as follows:
- Do some research, trying to find a pre-trained model for extracting relations from text.
 - There are several models available via [spaCy](https://spacy.io/) or [HuggingFace](https://huggingface.co/) that might do the trick.
 - Alternatively, you can run a NER model or tool (such as NLTK) and do the relation extraction on top of the extracted named entities on your own (even simple co-occurrence analysis of entities that frequently appear within the same context can be quite useful).
- Apply your model or tool of choice on the comedy and tragedy corpora.
- Explore the results.


In [None]:
# TODO - your code comes here

### A possible rudimentary solution

- Defining a relation extraction function

In [None]:
# NOTE: the following code is based on the Hugging Face examples of the Rebel
# model usage (c.f. https://huggingface.co/Babelscape/rebel-large)

from transformers import pipeline

# Creating the pipeline
triplet_extractor = pipeline('text2text-generation',
 model='Babelscape/rebel-large',
 tokenizer='Babelscape/rebel-large')

# Function to parse the generated text and extract the triplets
def extract_triplets(text):
 triplets = []
 relation, subject, relation, object_ = '', '', '', ''
 text = text.strip()
 current = 'x'
 for token in text.replace("",
 "").replace("",
 "").replace("",
 "").split():
 if token == "":
 current = 't'
 if relation != '':
 triplets.append({'head': subject.strip(),
 'type': relation.strip(),
 'tail': object_.strip()})
 relation = ''
 subject = ''
 elif token == "":
 current = 's'
 if relation != '':
 triplets.append({'head': subject.strip(),
 'type': relation.strip(),
 'tail': object_.strip()})
 object_ = ''
 elif token == "":
 current = 'o'
 relation = ''
 else:
 if current == 't':
 subject += ' ' + token
 elif current == 's':
 object_ += ' ' + token
 elif current == 'o':
 relation += ' ' + token
 if subject != '' and relation != '' and object_ != '':
 triplets.append({'head': subject.strip(),
 'type': relation.strip(),
 'tail': object_.strip()})
 return triplets

- Tokenizing the comedy and tragedy texts

In [59]:
# We need to use the tokenizer manually since we need special tokens.
extracted_text_comedies_one_chunk = triplet_extractor.tokenizer.batch_decode(
 [triplet_extractor(comedies[:1024],
 return_tensors=True,
 return_text=False)[0]["generated_token_ids"]])
extracted_text_tragedies_one_chunk = triplet_extractor.tokenizer.batch_decode(
 [triplet_extractor(tragedies[:1024],
 return_tensors=True,
 return_text=False)[0]["generated_token_ids"]])

In [69]:
def extract_text_batched(corpus, batch_limit=1024, verbose=False):
 # getting chunks smaller than context_size
 chunks, chunk = [], ''
 for sentence in nltk.sent_tokenize(corpus):
 if len(chunk + ' ' + ' '.join(sentence)) < batch_limit:
 chunk += ' ' + ' '.join(sentence)
 else:
 chunks.append(chunk)
 chunk = ' '.join(sentence)
 if len(chunk):
 chunks.append(chunk)

 # extracting the relations from each chunk
 extracted_text = ''
 for i, chunk in enumerate(chunks):
 if verbose:
 print(f' ... processing batch {i+1} out of {len(chunks)}')
 extracted_text += ' ' + triplet_extractor.tokenizer.batch_decode(
 [triplet_extractor(chunk,
 return_tensors=True,
 return_text=False)[0]["generated_token_ids"]])[0]

 # returning the concatenated results of the batched extraction steps
 return extracted_text


- Extracting sample triples from the tokenized texts (one chunk only)

In [None]:
print('Extracting sample comedy triplets (one chunk only):')
triplets_sample_comedies = \
 extract_triplets(extracted_text_comedies_one_chunk[0])
print(' - sample:', triplets_sample_comedies)

print('Extracting sample tragedy triplets:')
triplets_sample_tragedies = \
 extract_triplets(extracted_text_tragedies_one_chunk[0])
print(' - sample:', triplets_sample_tragedies)

- Extracting sample triples from the tokenized texts (all chunks)

In [None]:
print('Extracting relations from the whole comedy corpus in batches...')
extracted_text_comedies = extract_text_batched(comedies,verbose=True)
print('Extracting relations from the whole tragedy corpus in batches...')
extracted_text_tragedies = extract_text_batched(tragedies,verbose=True)

print('Extracting sample comedy triplets (all chunks):')
triplets_sample_comedies = extract_triplets(extracted_text_comedies)
print(' - sample (up to 100 triples):', triplets_sample_comedies[:100])

print('Extracting sample tragedy triplets (all chunks):')
triplets_sample_tragedies = extract_triplets(extracted_text_tragedies)
print(' - sample (up to 100 triples):', triplets_sample_tragedies[:100])

---
## 4. (Optional) Putting it all together in a knowledge graph

Once you have the taxonomy and a set of horizontal relations, you can represent the results as a knowledge graph. This may be as simple as a CSV file with three columns corresponding to the triple elements.

The taxonomical relations can be represented as special triples, for instance as follows:
- For every two concepts A and B such that A is a parent of B in the hierarchical clustering tree, add `(B, is_a, A)` triple to the knowledge graph.
- For every two concepts C and D such that C and D are siblings (i.e., they have a common parent in the hierarchical clustering tree), add `(C, similar_to, D), (D, similar_to, C)` triples to the knowledge graph.

If you managed to get labels for your taxonomy, it should be rather straightforward to create lists of triples extracted from the two Shakesperean corpora and store them as two CSVs corresponding to the knowledge graph representations.