{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# TF-IDF pro výpočet klíčových slov\n", "Tento notebook ukazuje jak TF-IDF (https://cs.wikipedia.org/wiki/Tf-idf) počítá klíčová slova. Je to technika vhodná pro všechny jazyky, které umíme tokenizovat (rozdělit na slova), je o něco vhodnější pro jazyky s méně bohatou flexí. Pro jazyky s bohatou flexí se dá počítat TF-IDF na lemmatech, samotná lemmatizace ale může být problém a může vnášet do výpočtu chyby.\n", "\n", "Využijeme balíčky Scikit Learn (https://scikit-learn.org/stable/) pro strojové učení a pandas (https://pandas.pydata.org/) pro datovou analytiku." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: sklearn in /home/popelucha/.local/lib/python3.5/site-packages (0.0)\n", "Requirement already satisfied: scikit-learn in /home/popelucha/.local/lib/python3.5/site-packages (from sklearn) (0.20.3)\n", "Requirement already satisfied: numpy>=1.8.2 in /usr/local/lib/python3.5/dist-packages (from scikit-learn->sklearn) (1.16.2)\n", "Requirement already satisfied: scipy>=0.13.3 in /home/popelucha/.local/lib/python3.5/site-packages (from scikit-learn->sklearn) (1.2.1)\n", "\u001b[33mYou are using pip version 19.0.3, however version 19.1 is available.\n", "You should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\n", "Requirement already satisfied: pandas in /home/popelucha/.local/lib/python3.5/site-packages (0.24.2)\n", "Requirement already satisfied: python-dateutil>=2.5.0 in /usr/local/lib/python3.5/dist-packages (from pandas) (2.8.0)\n", "Requirement already satisfied: numpy>=1.12.0 in /usr/local/lib/python3.5/dist-packages (from pandas) (1.16.2)\n", "Requirement already satisfied: pytz>=2011k in /home/popelucha/.local/lib/python3.5/site-packages (from pandas) (2019.1)\n", "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.5/dist-packages (from python-dateutil>=2.5.0->pandas) (1.12.0)\n", "\u001b[33mYou are using pip version 19.0.3, however version 19.1 is available.\n", "You should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\n" ] } ], "source": [ "!pip install sklearn --user\n", "!pip install pandas --user" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Jako datovou sadu jsme zvolili Ztracený ráj od Johna Miltona (https://cs.wikipedia.org/wiki/Ztracen%C3%BD_r%C3%A1j ). \n", "\n", "V první ukázce porovnáme čtyři knihy z hlediska klíčových slov.\n", "\n", "## TF-IDF na málo datech" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "import pandas as pd\n", "import os\n", "\n", "characters = ['Adam', 'Eve', 'God', 'Satan']\n", "#for root, subdirectories, characters in os.walk('sp/'):\n", "# print(list(characters))\n", "files = ['sp/' + character for character in characters]\n", "contents = [open(file, encoding='utf-8', errors='ignore').read() \n", " for file in files]\n", "\n", "vectorizer = TfidfVectorizer(sublinear_tf=True)\n", "tfidf_matrix = vectorizer.fit_transform(contents)\n", "feature_names = vectorizer.get_feature_names()\n", "dense = tfidf_matrix.todense()\n", "denselist = dense.tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "TfidfVectorizer vytvořil vektory pro každou knihu, je vidět, že pokud slovo v knize není, má hodnotu 0, jinak má hodnotu vypočítaného TF-IDF." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abandonabhorabideabjectabjureableabodeabolishabominableabortive...yokeyonyonderyouyoungeryouryoursyouthzodiaczone
Adam0.0000000.0123930.0000000.0000000.0157190.0169880.0000000.0157190.0000000.012393...0.0123930.0209830.0195750.000000.0157190.0082030.0000000.0157190.0157190.015719
Eve0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0140440.000000.0000000.0140440.0000000.0000000.0000000.000000
God0.0000000.0000000.0211240.0000000.0000000.0171010.0000000.0000000.0000000.000000...0.0000000.0000000.0139820.000000.0000000.0236730.0000000.0000000.0000000.000000
Satan0.0184670.0145600.0145600.0312680.0000000.0117870.0184670.0000000.0184670.014560...0.0246520.0145600.0096370.060990.0000000.0420880.0387560.0000000.0000000.000000
\n", "

4 rows × 4091 columns

\n", "
" ], "text/plain": [ " abandon abhor abide abject abjure able abode \\\n", "Adam 0.000000 0.012393 0.000000 0.000000 0.015719 0.016988 0.000000 \n", "Eve 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "God 0.000000 0.000000 0.021124 0.000000 0.000000 0.017101 0.000000 \n", "Satan 0.018467 0.014560 0.014560 0.031268 0.000000 0.011787 0.018467 \n", "\n", " abolish abominable abortive ... yoke yon yonder \\\n", "Adam 0.015719 0.000000 0.012393 ... 0.012393 0.020983 0.019575 \n", "Eve 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.014044 \n", "God 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.013982 \n", "Satan 0.000000 0.018467 0.014560 ... 0.024652 0.014560 0.009637 \n", "\n", " you younger your yours youth zodiac zone \n", "Adam 0.00000 0.015719 0.008203 0.000000 0.015719 0.015719 0.015719 \n", "Eve 0.00000 0.000000 0.014044 0.000000 0.000000 0.000000 0.000000 \n", "God 0.00000 0.000000 0.023673 0.000000 0.000000 0.000000 0.000000 \n", "Satan 0.06099 0.000000 0.042088 0.038756 0.000000 0.000000 0.000000 \n", "\n", "[4 rows x 4091 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(denselist, columns=feature_names, index=characters)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Když vypíšeme slova s nejvyšším TF-IDF, nedostaneme nic zajímavého. Jedinou výjimkou je hell, pokud výsledky seřadíme podle God. Vyzkoušejte různé hodnoty (více než 10 výsledků, seřadit podle jiných kritérií)." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Adam Eve God Satan\n", "and 0.057672 0.081629 0.086032 0.063905\n", "to 0.055622 0.080806 0.079828 0.063113\n", "the 0.053973 0.075934 0.076582 0.059465\n", "of 0.053495 0.077397 0.074895 0.059893\n", "hell 0.000000 0.000000 0.071776 0.061426\n", "in 0.049627 0.066868 0.070753 0.055280\n", "all 0.045729 0.061812 0.068678 0.049564\n", "his 0.047336 0.056118 0.065903 0.049866\n", "my 0.046457 0.061812 0.065204 0.045657\n", "thee 0.043509 0.068117 0.065204 0.042088\n" ] } ], "source": [ "s = [pd.Series(df.loc['Adam']),\n", " pd.Series(df.loc['Eve']),\n", " pd.Series(df.loc['God']),\n", " pd.Series(df.loc['Satan'])]\n", "print(pd.concat(s, axis=1, sort=False).sort_values(by='God', ascending=False)[:10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Neúspěch metody je způsoben malým počtem dat (jen čtyři knihy). Můžeme jej mírně vyvážit vyřazením stop-slov. Ve sklearn je pouze anglický stoplist a ani tak není tato metoda příliš doporučovaná. Nicméně, je-li málo dat, může stoplist velmi pomoci.\n", "\n", "### Použití stoplistu" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Adam Eve God Satan\n", "hell 0.000000 0.000000 0.084177 0.069080\n", "thee 0.048275 0.080265 0.076469 0.047332\n", "powers 0.000000 0.000000 0.076288 0.057062\n", "shall 0.042209 0.065277 0.076043 0.047332\n", "thou 0.049098 0.076748 0.076043 0.053660\n", "son 0.023363 0.000000 0.075663 0.034591\n", "redeem 0.000000 0.000000 0.074982 0.000000\n", "thy 0.047234 0.079546 0.074695 0.052565\n", "heaven 0.039430 0.056232 0.067810 0.054844\n", "man 0.041461 0.034730 0.067082 0.038636\n" ] } ], "source": [ "vectorizer = TfidfVectorizer(sublinear_tf=True, stop_words='english')\n", "tfidf_matrix = vectorizer.fit_transform(contents)\n", "feature_names = vectorizer.get_feature_names()\n", "dense = tfidf_matrix.todense()\n", "denselist = dense.tolist()\n", "df = pd.DataFrame(denselist, columns=feature_names, index=characters)\n", "s = [pd.Series(df.loc['Adam']),\n", " pd.Series(df.loc['Eve']),\n", " pd.Series(df.loc['God']),\n", " pd.Series(df.loc['Satan'])]\n", "print(pd.concat(s, axis=1, sort=False).sort_values(by='God', ascending=False)[:10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Stoplist způsobil, že slova jako and, to, the se vůbec nepočítají. Nicméně, stále zbývají slova jako thou, thee - běžná v Miltonově angličtině, nicméně nepřítomná v dnešních stoplistech. Jediné rozumné řešení je použít víc dat, tj. všechny knihy Ztraceného ráje.\n", "## TF-IDF na více datech" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Death', 'Zophiel', 'God', 'Nisroc', 'Beelzebub', 'Abdiel', 'Raphael', 'Ithuriel', 'Adam_and_Eve', 'Noah', 'Chaos', 'Adam', 'narr', 'Michael', 'Sin', 'SatanSerpent', 'Zephon', 'Gabriel', 'Satan', 'Eve', 'good_angels', 'the_Son', 'Voice_of_the_Apocalypse', 'Moloch', 'Belial', 'Uriel', 'Mammon']\n" ] } ], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "import pandas as pd\n", "import os\n", "\n", "#characters = ['Adam', 'Eve', 'God', 'Satan']\n", "for root, subdirectories, characters in os.walk('sp/'):\n", " print(list(characters))\n", "files = ['sp/' + character for character in characters]\n", "contents = [open(file, encoding='utf-8', errors='ignore').read() \n", " for file in files]\n", "\n", "vectorizer = TfidfVectorizer(sublinear_tf=True)\n", "tfidf_matrix = vectorizer.fit_transform(contents)\n", "feature_names = vectorizer.get_feature_names()\n", "dense = tfidf_matrix.todense()\n", "denselist = dense.tolist()\n", "df = pd.DataFrame(denselist, columns=feature_names, index=characters)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Adam Eve God Satan\n", "redeem 0.000000 0.000000 0.065231 0.000000\n", "my 0.042892 0.056493 0.061486 0.042870\n", "themselves 0.000000 0.000000 0.059956 0.024956\n", "son 0.018825 0.000000 0.058914 0.027968\n", "man 0.037453 0.029244 0.058556 0.035020\n", "friend 0.000000 0.000000 0.057367 0.000000\n", "heir 0.000000 0.000000 0.057367 0.000000\n", "powers 0.000000 0.000000 0.055579 0.043167\n", "thee 0.035744 0.055396 0.054711 0.035164\n", "free 0.031021 0.033409 0.054004 0.034560\n" ] } ], "source": [ "s = [pd.Series(df.loc['Adam']),\n", " pd.Series(df.loc['Eve']),\n", " pd.Series(df.loc['God']),\n", " pd.Series(df.loc['Satan'])]\n", "print(pd.concat(s, axis=1, sort=False).sort_values(by='God', ascending=False)[:10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Poslední výsledek dává nejvíc smysl, i bez použití stoplistu. Je to proto, že slova jako and, to, thou, thee, která se vyskytují skoro ve všech knihách, mají nízkou váhu (IDF je malé číslo, protože DF je velké číslo). Můžete také vypsat klíčová slova pro každou knihu zvlášť (zkuste i u předchozích výpočtů):" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "she 0.043716\n", "eve 0.042968\n", "my 0.042892\n", "me 0.041819\n", "bone 0.040974\n", "love 0.039381\n", "nature 0.038875\n", "death 0.038535\n", "why 0.038443\n", "much 0.038343\n", "Name: Adam, dtype: float64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(df.loc['Adam']).sort_values(ascending=False)[:10]" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "adam 0.064612\n", "forbids 0.063266\n", "love 0.060787\n", "me 0.059527\n", "death 0.058278\n", "my 0.056493\n", "early 0.055850\n", "fruit 0.055478\n", "thee 0.055396\n", "tree 0.055166\n", "Name: Eve, dtype: float64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(df.loc['Eve']).sort_values(ascending=False)[:10]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 2 }