# TF-IDF pro výpočet klíčových slov
Tento notebook ukazuje jak TF-IDF (https://cs.wikipedia.org/wiki/Tf-idf) počítá klíčová slova. Je to technika vhodná pro všechny jazyky, které umíme tokenizovat (rozdělit na slova), je o něco vhodnější pro jazyky s méně bohatou flexí. Pro jazyky s bohatou flexí se dá počítat TF-IDF na lemmatech, samotná lemmatizace ale může být problém a může vnášet do výpočtu chyby.

Využijeme balíčky Scikit Learn (https://scikit-learn.org/stable/) pro strojové učení a pandas (https://pandas.pydata.org/) pro datovou analytiku.

In [1]:
!pip install sklearn --user
!pip install pandas --user

[33mYou are using pip version 19.0.3, however version 19.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 19.0.3, however version 19.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


Jako datovou sadu jsme zvolili Ztracený ráj od Johna Miltona (https://cs.wikipedia.org/wiki/Ztracen%C3%BD_r%C3%A1j ). 

V první ukázce porovnáme čtyři knihy z hlediska klíčových slov.

## TF-IDF na málo datech

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import os

characters = ['Adam', 'Eve', 'God', 'Satan']
#for root, subdirectories, characters in os.walk('sp/'):
#    print(list(characters))
files = ['sp/' + character for character in characters]
contents = [open(file, encoding='utf-8', errors='ignore').read() 
            for file in files]

vectorizer = TfidfVectorizer(sublinear_tf=True)
tfidf_matrix = vectorizer.fit_transform(contents)
feature_names = vectorizer.get_feature_names()
dense = tfidf_matrix.todense()
denselist = dense.tolist()

TfidfVectorizer vytvořil vektory pro každou knihu, je vidět, že pokud slovo v knize není, má hodnotu 0, jinak má hodnotu vypočítaného TF-IDF.

In [3]:
df = pd.DataFrame(denselist, columns=feature_names, index=characters)
df

Unnamed: 0,abandon,abhor,abide,abject,abjure,able,abode,abolish,abominable,abortive,...,yoke,yon,yonder,you,younger,your,yours,youth,zodiac,zone
Adam,0.0,0.012393,0.0,0.0,0.015719,0.016988,0.0,0.015719,0.0,0.012393,...,0.012393,0.020983,0.019575,0.0,0.015719,0.008203,0.0,0.015719,0.015719,0.015719
Eve,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.014044,0.0,0.0,0.014044,0.0,0.0,0.0,0.0
God,0.0,0.0,0.021124,0.0,0.0,0.017101,0.0,0.0,0.0,0.0,...,0.0,0.0,0.013982,0.0,0.0,0.023673,0.0,0.0,0.0,0.0
Satan,0.018467,0.01456,0.01456,0.031268,0.0,0.011787,0.018467,0.0,0.018467,0.01456,...,0.024652,0.01456,0.009637,0.06099,0.0,0.042088,0.038756,0.0,0.0,0.0


Když vypíšeme slova s nejvyšším TF-IDF, nedostaneme nic zajímavého. Jedinou výjimkou je hell, pokud výsledky seřadíme podle God. Vyzkoušejte různé hodnoty (více než 10 výsledků, seřadit podle jiných kritérií).

In [4]:
s = [pd.Series(df.loc['Adam']),
     pd.Series(df.loc['Eve']),
     pd.Series(df.loc['God']),
     pd.Series(df.loc['Satan'])]
print(pd.concat(s, axis=1, sort=False).sort_values(by='God', ascending=False)[:10])

          Adam       Eve       God     Satan
and   0.057672  0.081629  0.086032  0.063905
to    0.055622  0.080806  0.079828  0.063113
the   0.053973  0.075934  0.076582  0.059465
of    0.053495  0.077397  0.074895  0.059893
hell  0.000000  0.000000  0.071776  0.061426
in    0.049627  0.066868  0.070753  0.055280
all   0.045729  0.061812  0.068678  0.049564
his   0.047336  0.056118  0.065903  0.049866
my    0.046457  0.061812  0.065204  0.045657
thee  0.043509  0.068117  0.065204  0.042088


Neúspěch metody je způsoben malým počtem dat (jen čtyři knihy). Můžeme jej mírně vyvážit vyřazením stop-slov. Ve sklearn je pouze anglický stoplist a ani tak není tato metoda příliš doporučovaná. Nicméně, je-li málo dat, může stoplist velmi pomoci.

### Použití stoplistu

In [5]:
vectorizer = TfidfVectorizer(sublinear_tf=True, stop_words='english')
tfidf_matrix = vectorizer.fit_transform(contents)
feature_names = vectorizer.get_feature_names()
dense = tfidf_matrix.todense()
denselist = dense.tolist()
df = pd.DataFrame(denselist, columns=feature_names, index=characters)
s = [pd.Series(df.loc['Adam']),
     pd.Series(df.loc['Eve']),
     pd.Series(df.loc['God']),
     pd.Series(df.loc['Satan'])]
print(pd.concat(s, axis=1, sort=False).sort_values(by='God', ascending=False)[:10])

            Adam       Eve       God     Satan
hell    0.000000  0.000000  0.084177  0.069080
thee    0.048275  0.080265  0.076469  0.047332
powers  0.000000  0.000000  0.076288  0.057062
shall   0.042209  0.065277  0.076043  0.047332
thou    0.049098  0.076748  0.076043  0.053660
son     0.023363  0.000000  0.075663  0.034591
redeem  0.000000  0.000000  0.074982  0.000000
thy     0.047234  0.079546  0.074695  0.052565
heaven  0.039430  0.056232  0.067810  0.054844
man     0.041461  0.034730  0.067082  0.038636


Stoplist způsobil, že slova jako and, to, the se vůbec nepočítají. Nicméně, stále zbývají slova jako thou, thee - běžná v Miltonově angličtině, nicméně nepřítomná v dnešních stoplistech. Jediné rozumné řešení je použít víc dat, tj. všechny knihy Ztraceného ráje.
## TF-IDF na více datech

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import os

#characters = ['Adam', 'Eve', 'God', 'Satan']
for root, subdirectories, characters in os.walk('sp/'):
    print(list(characters))
files = ['sp/' + character for character in characters]
contents = [open(file, encoding='utf-8', errors='ignore').read() 
            for file in files]

vectorizer = TfidfVectorizer(sublinear_tf=True)
tfidf_matrix = vectorizer.fit_transform(contents)
feature_names = vectorizer.get_feature_names()
dense = tfidf_matrix.todense()
denselist = dense.tolist()
df = pd.DataFrame(denselist, columns=feature_names, index=characters)

['Death', 'Zophiel', 'God', 'Nisroc', 'Beelzebub', 'Abdiel', 'Raphael', 'Ithuriel', 'Adam_and_Eve', 'Noah', 'Chaos', 'Adam', 'narr', 'Michael', 'Sin', 'SatanSerpent', 'Zephon', 'Gabriel', 'Satan', 'Eve', 'good_angels', 'the_Son', 'Voice_of_the_Apocalypse', 'Moloch', 'Belial', 'Uriel', 'Mammon']


In [7]:
s = [pd.Series(df.loc['Adam']),
     pd.Series(df.loc['Eve']),
     pd.Series(df.loc['God']),
     pd.Series(df.loc['Satan'])]
print(pd.concat(s, axis=1, sort=False).sort_values(by='God', ascending=False)[:10])

                Adam       Eve       God     Satan
redeem      0.000000  0.000000  0.065231  0.000000
my          0.042892  0.056493  0.061486  0.042870
themselves  0.000000  0.000000  0.059956  0.024956
son         0.018825  0.000000  0.058914  0.027968
man         0.037453  0.029244  0.058556  0.035020
friend      0.000000  0.000000  0.057367  0.000000
heir        0.000000  0.000000  0.057367  0.000000
powers      0.000000  0.000000  0.055579  0.043167
thee        0.035744  0.055396  0.054711  0.035164
free        0.031021  0.033409  0.054004  0.034560


Poslední výsledek dává nejvíc smysl, i bez použití stoplistu. Je to proto, že slova jako and, to, thou, thee, která se vyskytují skoro ve všech knihách, mají nízkou váhu (IDF je malé číslo, protože DF je velké číslo). Můžete také vypsat klíčová slova pro každou knihu zvlášť (zkuste i u předchozích výpočtů):

In [10]:
pd.Series(df.loc['Adam']).sort_values(ascending=False)[:10]

she       0.043716
eve       0.042968
my        0.042892
me        0.041819
bone      0.040974
love      0.039381
nature    0.038875
death     0.038535
why       0.038443
much      0.038343
Name: Adam, dtype: float64

In [11]:
pd.Series(df.loc['Eve']).sort_values(ascending=False)[:10]

adam       0.064612
forbids    0.063266
love       0.060787
me         0.059527
death      0.058278
my         0.056493
early      0.055850
fruit      0.055478
thee       0.055396
tree       0.055166
Name: Eve, dtype: float64