{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [], "collapsed_sections": [ "TIPJ8mo0NYXi", "xFA5GHlhmU5Q" ] }, "kernelspec": { "name": "python3", "display_name": "Python 3" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "Z5anb3UHd6pv" }, "source": [ "# PA164 - Lab 6: Knowledge extraction\n", "\n", "__Outline:__\n", "1. Back to Shakespeare\n", "2. Towards extracting taxonomies of Shakespeare's worlds\n", "3. Towards relation extraction\n", "4. (Optional) Putting it all together in a knowledge graph" ] }, { "cell_type": "markdown", "metadata": { "id": "M68rkmcHJlLR" }, "source": [ "---\n", "\n", "## 1. Back to Shakespeare\n", "\n", "\"will\"" ] }, { "cell_type": "markdown", "metadata": { "id": "HMvoNNRFJlLy" }, "source": [ "### Downloading and cleaning the Shakespeare's works" ] }, { "cell_type": "code", "metadata": { "id": "PtmDjwO-JlLz" }, "source": [ "import urllib.request # import library for opening URLs, etc.\n", "\n", "# open a link to sample text\n", "\n", "sample_text_link = \"https://www.gutenberg.org/files/100/100-0.txt\"\n", "f = urllib.request.urlopen(sample_text_link)\n", "\n", "# decoding the content of the link (just convert the binary string to text -\n", "# it is already in a relatively clean plain text format)\n", "\n", "sample_text = f.read().decode(\"utf-8\")\n", "\n", "# cutting the metadata in the beginning\n", "\n", "cleaner_text = sample_text.split(' Contents')[1]\n", "\n", "# cutting the appendix after the main story\n", "\n", "cleaner_text = cleaner_text.split('*** END OF THE PROJECT GUTENBERG EBOOK THE COMPLETE WORKS OF WILLIAM SHAKESPEARE ***')[0]\n", "\n", "# deleting the '\\r' characters\n", "\n", "cleaner_text = cleaner_text.replace('\\r','')" ], "execution_count": 40, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "LFFrvDzP_2DF" }, "source": [ "### Getting the separate texts of the Shakespeare's works" ] }, { "cell_type": "code", "metadata": { "id": "PpBW0Oo-iSxS" }, "source": [ "# getting the list of titles of Shakespeare's work from the table of contents\n", "\n", "# to split at the TOC from the bottom\n", "splitter_bot = \"\"\"THE SONNETS\n", "\n", " 1\"\"\"\n", "\n", "# to split at the TOC from the top\n", "splitter_top = \"\"\"VENUS AND ADONIS\n", "\n", "\n", "\n", "\n", "\n", "\n", "\"\"\"\n", "\n", "# list of titles from the TOC\n", "raw_split = cleaner_text.split(splitter_bot)[0].split('\\n\\n')[1].split('\\n ')\n", "titles = [x.strip() for x in raw_split if len(x.strip())]\n", "\n", "# the rest of the text after TOC\n", "body = cleaner_text.split(splitter_top)[-1]\n", "\n", "# printing out the list of works\n", "\n", "print(len(titles), \"Shakespeare's works:\", titles)\n", "\n", "# populating a mapping from works' titles to their texts - the KEY VARIABLE!\n", "\n", "works = {}\n", "\n", "for i in range(len(titles)):\n", " # base text - from the current title till the end of the all-in-one file\n", " text_down = titles[i] + '\\n\\n' + body.split(titles[i])[-1].strip()\n", " if i == len(titles) - 1: # the last text in the all-in-one file\n", " works[titles[i]] = text_down\n", " else: # other texts, enclosed between consecutive titles\n", " works[titles[i]] = text_down.split(titles[i+1])[0]\n", "\n", "# printing out opening and ending samples of three selected works\n", "\n", "print('*********** SONNETS opening sample:')\n", "print(works['THE SONNETS'][:1000])\n", "print('\\n\\n*********** SONNETS ending sample:')\n", "print(works['THE SONNETS'][-1000:])\n", "print('\\n--------------------------------------------\\n')\n", "print('*********** AS YOU LIKE IT opening sample:')\n", "print(works['AS YOU LIKE IT'][:1000])\n", "print('\\n\\n*********** AS YOU LIKE IT ending sample:')\n", "print(works['AS YOU LIKE IT'][-1000:])\n", "print('\\n--------------------------------------------\\n')\n", "print('*********** VENUS AND ADONIS opening sample:')\n", "print(works['VENUS AND ADONIS'][:1000])\n", "print('\\n\\n*********** VENUS AND ADONIS ending sample:')\n", "print(works['VENUS AND ADONIS'][-1000:])\n", "print('\\n--------------------------------------------\\n')" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "### Getting two corpora of Shakespeare's plays - one for comedies and one for tragedies" ], "metadata": { "id": "9k67nyL_JuvH" } }, { "cell_type": "code", "source": [ "# the list of Shakespeare's comedies\n", "comedy_titles = [\n", " 'ALL’S WELL THAT ENDS WELL',\n", " 'AS YOU LIKE IT',\n", " 'THE COMEDY OF ERRORS',\n", " 'LOVE’S LABOUR’S LOST',\n", " 'MEASURE FOR MEASURE',\n", " 'THE MERCHANT OF VENICE',\n", " 'THE MERRY WIVES OF WINDSOR',\n", " 'A MIDSUMMER NIGHT’S DREAM',\n", " 'MUCH ADO ABOUT NOTHING',\n", " 'PERICLES, PRINCE OF TYRE',\n", " 'THE TAMING OF THE SHREW',\n", " 'THE TEMPEST',\n", " 'TWELFTH NIGHT; OR, WHAT YOU WILL',\n", " 'THE TWO GENTLEMEN OF VERONA',\n", " 'THE TWO NOBLE KINSMEN',\n", " 'THE WINTER’S TALE',\n", " 'CYMBELINE'\n", "]\n", "\n", "# the list of Shakespeare's tragedies\n", "tragedy_titles = [\n", " 'THE TRAGEDY OF ANTONY AND CLEOPATRA',\n", " 'THE TRAGEDY OF CORIOLANUS',\n", " 'THE TRAGEDY OF HAMLET, PRINCE OF DENMARK',\n", " 'THE TRAGEDY OF JULIUS CAESAR',\n", " 'THE TRAGEDY OF KING LEAR',\n", " 'THE TRAGEDY OF MACBETH',\n", " 'THE TRAGEDY OF OTHELLO, THE MOOR OF VENICE',\n", " 'THE TRAGEDY OF ROMEO AND JULIET',\n", " 'THE TRAGEDY OF TITUS ANDRONICUS',\n", " 'TROILUS AND CRESSIDA',\n", " 'THE LIFE OF TIMON OF ATHENS'\n", "]\n", "\n", "# the two corresponding corpora\n", "\n", "comedies = '\\n\\n'.join([works[x] for x in comedy_titles])\n", "tragedies = '\\n\\n'.join([works[x] for x in tragedy_titles])\n", "\n", "print('The size of the comedy corpus (in simple tokens) :',\n", " len(comedies.split()))\n", "print('The size of the tragedy corpus (in simple tokens):',\n", " len(tragedies.split()))" ], "metadata": { "id": "MlyBe0SGJ93f" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "---\n", "## 2. Towards extracting taxonomies of Shakespeare's worlds\n", "\n", "The backbone of any knowledge representation is a taxonomy of concepts that encodes the hierarchy of entities along the generality/specificity axis (for example, mammals are a more general concept than felines, which is a more general concept than cat, etc.).\n", "\n", "Your task in this exercise is the following:\n", "- Represent words occurring in Shakespeare's comedies and tragedies as vectors (using one of the word embedding methods you experimented with before).\n", " - Optionally, you may generate vector representations of bigrams as well.\n", "- Use the vector representations to compute a [hierarchical clustering](https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering) of the entities - this is the desired taxonomy structure, albeit without labels of particular non-leaf nodes.\n", "- Plot the [dendrogram](https://en.wikipedia.org/wiki/Dendrogram) of your clusters (you may want to limit the depth to a few most general levels only).\n", "- Optionally, try to assign labels to your clusters, which would let you perform some more in-depth cluster analysis and compare the conceptual structure of the comedy and tragedy worlds of the Bard. Techniques you may experiment with are for instance these:\n", " - Picking the term represented by a vector that is closest to the cluster centroid as the label.\n", " - Looking up the terms present in the cluster in [WordNet](https://wordnet.princeton.edu/) and picking the most common synset at the label (for instance, via the [NLTK API](https://www.nltk.org/howto/wordnet.html))." ], "metadata": { "id": "4dtCwt2qQ51A" } }, { "cell_type": "code", "source": [ "# TODO - your code comes here" ], "metadata": { "id": "YmM2k4eO4ynt" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "TIPJ8mo0NYXi" }, "source": [ "\n", "\n", "### A possible rudimentary solution" ] }, { "cell_type": "markdown", "source": [ "- Tokenising the texts using _nltk_" ], "metadata": { "id": "aS2nU4u_2WWD" } }, { "cell_type": "code", "source": [ "# need to tokenize the text to list of sentences that are themselves\n", "# lists of individual words\n", "\n", "import nltk\n", "nltk.download('punkt')\n", "\n", "sentences_comedies = [nltk.word_tokenize(sentence) for sentence in\n", " nltk.sent_tokenize(comedies)]\n", "sentences_tragedies = [nltk.word_tokenize(sentence) for sentence in\n", " nltk.sent_tokenize(tragedies)]\n", "\n", "print('The size of the comedy corpus (in sentences) :',\n", " len(sentences_comedies))\n", "print('The size of the tragedy corpus (in sentences):',\n", " len(sentences_tragedies))" ], "metadata": { "id": "2Jg9o10K9QnO" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "- Generating word embeddings using _gensim_ and _word2vec_, taking also common bigrams into account" ], "metadata": { "id": "-0KO59rYToT0" } }, { "cell_type": "code", "metadata": { "id": "Yfyw-W1yNe2b" }, "source": [ "# training a word2vec model separately on each corpus, reflecting also\n", "# common bigrams\n", "\n", "from gensim.models.word2vec import Word2Vec\n", "from gensim.models import phrases\n", "\n", "# getting the bigram models firts\n", "print('Training the comedy bigram detection model...')\n", "bigrams_comedies = phrases.Phrases(sentences_comedies)\n", "print('Training the tragedy bigram detection model...')\n", "bigrams_tragedies = phrases.Phrases(sentences_tragedies)\n", "\n", "# training the embedding models on sentences ran through the bigram detection\n", "print('Training the comedy embedding model...')\n", "model_comedies = Word2Vec(bigrams_comedies[sentences_comedies],min_count=2,\n", " vector_size=200,window=5,sg=1)\n", "print('Training the tragedy embedding model...')\n", "model_tragedies = Word2Vec(bigrams_tragedies[sentences_tragedies],min_count=2,\n", " vector_size=200,window=5,sg=1)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "- Mapping terms to their vectors" ], "metadata": { "id": "nwNWo4qgUlbD" } }, { "cell_type": "code", "source": [ "# generating maps from the comedy and tragedy terms to their vectors\n", "\n", "term2vec_comedies = dict([(word,model_comedies.wv[word]) for word in\n", " model_comedies.wv.index_to_key])\n", "term2vec_tragedies = dict([(word,model_tragedies.wv[word]) for word in\n", " model_tragedies.wv.index_to_key])\n", "\n", "print('Number of comedy terms/vectors :', len(term2vec_comedies))\n", "print('Number of tragedy terms/vectors:', len(term2vec_tragedies))" ], "metadata": { "id": "ima01QMK8GPX" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "- Creating feature matrices from the mappings" ], "metadata": { "id": "82JewB7EgH7x" } }, { "cell_type": "code", "source": [ "import numpy as np\n", "\n", "# creating lists of integer index-word pairs from the word-vector mappings\n", "i2w_list_comedies = [(i,w) for i, w in enumerate(term2vec_comedies)]\n", "i2w_list_tragedies = [(i,w) for i, w in enumerate(term2vec_tragedies)]\n", "\n", "# mappings between the integer indices and words\n", "i2w_dict_comedies = dict(i2w_list_comedies)\n", "i2w_dict_tragedies = dict(i2w_list_tragedies)\n", "\n", "# features matrices as numpy objects\n", "X_comedies = np.array([term2vec_comedies[w] for _,w in i2w_list_comedies])\n", "X_tragedies = np.array([term2vec_tragedies[w] for _,w in i2w_list_tragedies])\n", "\n", "print('The shape of the comedy feature matrix :', X_comedies.shape)\n", "print('The shape of the tragedy feature matrix:', X_tragedies.shape)" ], "metadata": { "id": "W7xi41S6gL7x" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "- Getting hierarchies of the terms in the corpora using [agglomerative clustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering)" ], "metadata": { "id": "T7ZvjdpVx_lx" } }, { "cell_type": "code", "source": [ "from matplotlib import pyplot as plt\n", "\n", "from sklearn.cluster import AgglomerativeClustering\n", "from scipy.cluster.hierarchy import dendrogram\n", "\n", "def plot_dendrogram(model, **kwargs):\n", " # Create linkage matrix and then plot the dendrogram\n", " # NOTE: taken from the Scikit-learn documentation, licensed by the tool\n", " # developers under the BSD license\n", "\n", " # create the counts of samples under each node\n", " counts = np.zeros(model.children_.shape[0])\n", " n_samples = len(model.labels_)\n", " for i, merge in enumerate(model.children_):\n", " current_count = 0\n", " for child_idx in merge:\n", " if child_idx < n_samples:\n", " current_count += 1 # leaf node\n", " else:\n", " current_count += counts[child_idx - n_samples]\n", " counts[i] = current_count\n", "\n", " linkage_matrix = np.column_stack(\n", " [model.children_, model.distances_, counts]\n", " ).astype(float)\n", "\n", " # Plot the corresponding dendrogram\n", " dendrogram(linkage_matrix, **kwargs)\n", "\n", "model_comedies = AgglomerativeClustering(distance_threshold=0, n_clusters=None)\n", "model_tragedies = AgglomerativeClustering(distance_threshold=0, n_clusters=None)\n", "\n", "print('Fitting the comedy clustering model...')\n", "clustering_comedies = model_comedies.fit(X_comedies)\n", "print('Fitting the tragedy clustering model...')\n", "clustering_tragedies = model_tragedies.fit(X_tragedies)" ], "metadata": { "id": "DNDPR4bXyPjU" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "- Plotting the comedy dendrogram" ], "metadata": { "id": "UHS15bCz0CsG" } }, { "cell_type": "code", "source": [ "plt.title(\"Hierarchical Clustering Dendrogram of the Comedy Corpus\")\n", "# plot the top five levels of the dendrogram\n", "plot_dendrogram(model_comedies, truncate_mode=\"level\", p=5)\n", "plt.xlabel(\"Number of points in node (or index of point if no parenthesis).\")\n", "plt.show()" ], "metadata": { "id": "beInd-gH0GVb" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "- Plotting the tragedy dendrogram" ], "metadata": { "id": "jttavf1y065J" } }, { "cell_type": "code", "source": [ "plt.title(\"Hierarchical Clustering Dendrogram of the Tragedy Corpus\")\n", "# plot the top five levels of the dendrogram\n", "plot_dendrogram(model_tragedies, truncate_mode=\"level\", p=5)\n", "plt.xlabel(\"Number of points in node (or index of point if no parenthesis).\")\n", "plt.show()" ], "metadata": { "id": "asxZRIV9065L" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "---\n", "## 3. Towards relation extraction\n", "\n", "Once the taxonomy is sorted, one may want to extract also the \"horizontal\" relations between entities occurring in the input text. These are often represented as triples (or triplets), either in the `(subject, predicate, object)` or `(head, relation_type, tail)` form (both correspond to a typed oriented edge between the entities occuring as the first and third element, respectively).\n", "\n", "Your task in this exercises is as follows:\n", "- Do some research, trying to find a pre-trained model for extracting relations from text.\n", " - There are several models available via [spaCy](https://spacy.io/) or [HuggingFace](https://huggingface.co/) that might do the trick.\n", " - Alternatively, you can run a NER model or tool (such as NLTK) and do the relation extraction on top of the extracted named entities on your own (even simple co-occurrence analysis of entities that frequently appear within the same context can be quite useful).\n", "- Apply your model or tool of choice on the comedy and tragedy corpora.\n", "- Explore the results.\n" ], "metadata": { "id": "rWxfWlRjeccy" } }, { "cell_type": "code", "source": [ "# TODO - your code comes here" ], "metadata": { "id": "5lyFTA6J41Gl" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "### A possible rudimentary solution" ], "metadata": { "id": "xFA5GHlhmU5Q" } }, { "cell_type": "markdown", "source": [ "- Defining a relation extraction function" ], "metadata": { "id": "UZI-zhdWerAz" } }, { "cell_type": "code", "source": [ "# NOTE: the following code is based on the Hugging Face examples of the Rebel\n", "# model usage (c.f. https://huggingface.co/Babelscape/rebel-large)\n", "\n", "from transformers import pipeline\n", "\n", "# Creating the pipeline\n", "triplet_extractor = pipeline('text2text-generation',\n", " model='Babelscape/rebel-large',\n", " tokenizer='Babelscape/rebel-large')\n", "\n", "# Function to parse the generated text and extract the triplets\n", "def extract_triplets(text):\n", " triplets = []\n", " relation, subject, relation, object_ = '', '', '', ''\n", " text = text.strip()\n", " current = 'x'\n", " for token in text.replace(\"\",\n", " \"\").replace(\"\",\n", " \"\").replace(\"\",\n", " \"\").split():\n", " if token == \"\":\n", " current = 't'\n", " if relation != '':\n", " triplets.append({'head': subject.strip(),\n", " 'type': relation.strip(),\n", " 'tail': object_.strip()})\n", " relation = ''\n", " subject = ''\n", " elif token == \"\":\n", " current = 's'\n", " if relation != '':\n", " triplets.append({'head': subject.strip(),\n", " 'type': relation.strip(),\n", " 'tail': object_.strip()})\n", " object_ = ''\n", " elif token == \"\":\n", " current = 'o'\n", " relation = ''\n", " else:\n", " if current == 't':\n", " subject += ' ' + token\n", " elif current == 's':\n", " object_ += ' ' + token\n", " elif current == 'o':\n", " relation += ' ' + token\n", " if subject != '' and relation != '' and object_ != '':\n", " triplets.append({'head': subject.strip(),\n", " 'type': relation.strip(),\n", " 'tail': object_.strip()})\n", " return triplets" ], "metadata": { "id": "EnB2qOwEfcHm" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "- Tokenizing the comedy and tragedy texts" ], "metadata": { "id": "JvunblVoh5t3" } }, { "cell_type": "code", "source": [ "# We need to use the tokenizer manually since we need special tokens.\n", "extracted_text_comedies_one_chunk = triplet_extractor.tokenizer.batch_decode(\n", " [triplet_extractor(comedies[:1024],\n", " return_tensors=True,\n", " return_text=False)[0][\"generated_token_ids\"]])\n", "extracted_text_tragedies_one_chunk = triplet_extractor.tokenizer.batch_decode(\n", " [triplet_extractor(tragedies[:1024],\n", " return_tensors=True,\n", " return_text=False)[0][\"generated_token_ids\"]])" ], "metadata": { "id": "rdIrcW2ag0Eb" }, "execution_count": 59, "outputs": [] }, { "cell_type": "code", "source": [ "def extract_text_batched(corpus, batch_limit=1024, verbose=False):\n", " # getting chunks smaller than context_size\n", " chunks, chunk = [], ''\n", " for sentence in nltk.sent_tokenize(corpus):\n", " if len(chunk + ' ' + ' '.join(sentence)) < batch_limit:\n", " chunk += ' ' + ' '.join(sentence)\n", " else:\n", " chunks.append(chunk)\n", " chunk = ' '.join(sentence)\n", " if len(chunk):\n", " chunks.append(chunk)\n", "\n", " # extracting the relations from each chunk\n", " extracted_text = ''\n", " for i, chunk in enumerate(chunks):\n", " if verbose:\n", " print(f' ... processing batch {i+1} out of {len(chunks)}')\n", " extracted_text += ' ' + triplet_extractor.tokenizer.batch_decode(\n", " [triplet_extractor(chunk,\n", " return_tensors=True,\n", " return_text=False)[0][\"generated_token_ids\"]])[0]\n", "\n", " # returning the concatenated results of the batched extraction steps\n", " return extracted_text\n" ], "metadata": { "id": "LS0ovNtUPLxk" }, "execution_count": 69, "outputs": [] }, { "cell_type": "markdown", "source": [ "- Extracting sample triples from the tokenized texts (one chunk only)" ], "metadata": { "id": "uA112vw5h_Sl" } }, { "cell_type": "code", "source": [ "print('Extracting sample comedy triplets (one chunk only):')\n", "triplets_sample_comedies = \\\n", " extract_triplets(extracted_text_comedies_one_chunk[0])\n", "print(' - sample:', triplets_sample_comedies)\n", "\n", "print('Extracting sample tragedy triplets:')\n", "triplets_sample_tragedies = \\\n", " extract_triplets(extracted_text_tragedies_one_chunk[0])\n", "print(' - sample:', triplets_sample_tragedies)" ], "metadata": { "id": "nqS-4UrAiCuz" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "- Extracting sample triples from the tokenized texts (all chunks)" ], "metadata": { "id": "4icpGeWEW2fy" } }, { "cell_type": "code", "source": [ "print('Extracting relations from the whole comedy corpus in batches...')\n", "extracted_text_comedies = extract_text_batched(comedies,verbose=True)\n", "print('Extracting relations from the whole tragedy corpus in batches...')\n", "extracted_text_tragedies = extract_text_batched(tragedies,verbose=True)\n", "\n", "print('Extracting sample comedy triplets (all chunks):')\n", "triplets_sample_comedies = extract_triplets(extracted_text_comedies)\n", "print(' - sample (up to 100 triples):', triplets_sample_comedies[:100])\n", "\n", "print('Extracting sample tragedy triplets (all chunks):')\n", "triplets_sample_tragedies = extract_triplets(extracted_text_tragedies)\n", "print(' - sample (up to 100 triples):', triplets_sample_tragedies[:100])" ], "metadata": { "id": "UV1P6mWDW8Mx" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "---\n", "## 4. (Optional) Putting it all together in a knowledge graph\n", "\n", "Once you have the taxonomy and a set of horizontal relations, you can represent the results as a knowledge graph. This may be as simple as a CSV file with three columns corresponding to the triple elements.\n", "\n", "The taxonomical relations can be represented as special triples, for instance as follows:\n", "- For every two concepts A and B such that A is a parent of B in the hierarchical clustering tree, add `(B, is_a, A)` triple to the knowledge graph.\n", "- For every two concepts C and D such that C and D are siblings (i.e., they have a common parent in the hierarchical clustering tree), add `(C, similar_to, D), (D, similar_to, C)` triples to the knowledge graph.\n", "\n", "If you managed to get labels for your taxonomy, it should be rather straightforward to create lists of triples extracted from the two Shakesperean corpora and store them as two CSVs corresponding to the knowledge graph representations." ], "metadata": { "id": "YKbKZ8lXenz7" } } ] }