# Similarity Search with Gensim

## Jupyter Notebook tutorial

[Jupyter Notebook](https://jupyter.org/) allows you to document your code using the [Markdown markup](https://en.wikipedia.org/wiki/Markdown). Double-click this cell and try editing the markup. When you are happy with your changes, press *Ctrl+Enter* to render the markup. Press *Ctrl+S* or select “File”, and “Save and Checkpoint” from the horizontal menu to update the Jupyter Notebook project file.

By selecting “Insert”, and “Insert Cell Below” from the horizontal menu, you can insert your own Markdown and code cells. The type of the cell can be set by selecting “Cell”, and “Cell Type” from the horizontal menu or by using the drop-down list below the horizontal menu. You can execute a code cell by pressing *Ctrl+Enter*. Try it with the cell below!

In [1]:
print("Hello, World!")

Hello, World!


## Set up logging
Logging allows us to see messages from the packages we are using.

In [2]:
import logging
logging.basicConfig(format="%(message)s", level=logging.INFO)

In [3]:
logger = logging.getLogger(__name__)
logger.info("This is what a message from a package looks like.")

This is what a message from a package looks like.


## Import Python modules

In [4]:
import sys, os
import re
import json
import gensim
from gensim.parsing.preprocessing import STOPWORDS
from gensim.utils import chunkize
from gensim import corpora, models, similarities
from smart_open import open as smart_open



Notice that all module imports and definitions are global. For instance, you can declare a variable in one cell and access it in another cell:

In [5]:
a = "Hello"
b = "World"

In [6]:
print("%s, %s!" % (a, b))

Hello, World!


To reset the state of the Python interpreter, select “Kernel”, and “Restart” from the horizontal menu.

## Preprocessing documents

### Loading
The input documents are stored in the *tab-separated-value (TSV)* file named `wiki-tabbed.tsv` in the following format:

```
title_1[TAB]segment_title[TAB]segment_body[TAB]....[TAB]segment_title[TAB]segment_body[NEWLINE]
title_2[TAB]segment_title[TAB]segment_body[TAB]....[TAB]segment_title[TAB]segment_body[NEWLINE]
....
title_k[TAB]segment_title[TAB]segment_body[TAB]....[TAB]segment_title[TAB]segment_body[NEWLINE]
```

We will use the file to produce Python objects in the following format:

``` python
{
    "title": chunks[0],
    "content": " ".join([
        " ".join([parse_chunk(segment_title), parse_chunk(segment_body)])
        for segment_title, segment_body in chunkize(chunks[1:], 2)
    ])
}
```

where `chunks = line.strip().split("\t")` and `line` is a line in the text file.

In [7]:
input_filepath = "wiki-tabbed.tsv"

def parse_chunk(chunk):
    segment = json.loads(chunk)
    segment = re.sub("\s+", " ", segment, flags=re.MULTILINE).strip()
    return segment

def parse_input_line(line):
    chunks = line.strip().split("\t")
    title = parse_chunk(chunks[0])
    segments = []
    for (segment_title, segment_body) in chunkize(chunks[1:], 2):
        segments.append(" ".join([
            parse_chunk(segment_title),
            parse_chunk(segment_body)
        ]))
    return title, " ".join(segments)

def yield_documents(input_filepath):
    """Iterate over input TSV file and yield parsed documents one-by-one"""
    with smart_open(input_filepath, "rb") as f:
        for line in f:
            title, text = parse_input_line(line.decode("utf-8"))
            yield {
                "title": title,
                "content": text,
            }

Below is a preview of the produced Python objects:

In [8]:
for _, doc in zip(range(10), yield_documents(input_filepath)):
    print("%s: %s …" % (doc["title"], doc["content"][:40]))
print("⋮")

1904 in baseball: Introduction  Champions *American League …
1932 U.S. National Championships – Men's Singles: Introduction First-seeded Ellsworth Vine …
1936 Wimbledon Championships – Men's Singles: Introduction Fred Perry (GBR) defeated G …
1938 Wimbledon Championships – Men's Singles: Introduction Don Budge (USA) defeated Bu …
1995–96 United States network television schedule (Saturday morning): Introduction This was the United States  …
1999 in home video: Introduction The following events occurr …
2001 Laurence Olivier Awards: Introduction The '''2001 Laurence Olivie …
2007 in home video: Introduction '''2007 in home video''' wa …
2007 MOJO Awards: Introduction The 2007 MOJO Honours List  …
2010 South Sydney Rabbitohs season: Introduction The '''2010 South Sydney Ra …
⋮


### Tokenization

We can use several approaches to tokenization. Below are shown just a basic few:

In [9]:
test_text = "Hello World! How is it going?! Nonexistentword, 21"

print(
    "Simple preprocess:\n\n    %s\n" %
    ", ".join(gensim.utils.simple_preprocess(
        test_text,
        deacc=True,
        min_len=2,
        max_len=15,
    ))
)
print(
    "Simple preprocess using English stopwords:\n\n    %s" %
    ", ".join([
        token for token in gensim.utils.simple_preprocess(
            test_text,
            deacc=True,
            min_len=2,
            max_len=15,
        )
        if token not in STOPWORDS
    ])
)

Simple preprocess:

    hello, world, how, is, it, going, nonexistentword

Simple preprocess using English stopwords:

    hello, world, going, nonexistentword


We will use the above file to produce Python objects, which will contain tokens instead of just raw text:

In [10]:
def yield_tokenized_docs(input_filepath):
    """Iterate over input TSV file and yield processed token lists for every document
    
    For every document (line in TSV file) yields:
    {
        "title": "article title",
        'tokens': ["list", "of", "tokens", "in", "the", "article"]
    }
      
    English stop words are filtered out from the token list.
    """
    for doc in yield_documents(input_filepath):
        yield {
            "title": doc["title"],
            "tokens": [
                token
                for token
                in gensim.utils.simple_preprocess(
                    doc["title"] + doc["content"],
                    deacc=True,
                    min_len=2,
                    max_len=15,
                )
                if token not in STOPWORDS
            ]
        }

Below is a preview of the produced Python objects:

In [11]:
for _, doc in zip(range(10), yield_tokenized_docs(input_filepath)):
    print("%s:%s …" % (doc["title"], ", ".join(doc["tokens"][:4])))
print("⋮")

1904 in baseball:champions, american, league, boston …
1932 U.S. National Championships – Men's Singles:national, championships, men, seeded …
1936 Wimbledon Championships – Men's Singles:wimbledon, championships, men, fred …
1938 Wimbledon Championships – Men's Singles:wimbledon, championships, men, budge …
1995–96 United States network television schedule (Saturday morning):united, states, network, television …
1999 in home video:home, following, events, occurred …
2001 Laurence Olivier Awards:laurence, olivier, laurence, olivier …
2007 in home video:home, home, video, characterized …
2007 MOJO Awards:mojo, mojo, honours, list …
2010 South Sydney Rabbitohs season:south, sydney, rabbitohs, south …
⋮


## Indexing documents

### Building a dictionary

We will use the tokenized documents to build a dictionary and store it in the file named `wiki-tabbed.dict` in following format:

```
num_docs
id_1[TAB]word_1[TAB]document_frequency_1[NEWLINE]
id_2[TAB]word_2[TAB]document_frequency_2[NEWLINE]
....
id_k[TAB]word_k[TAB]document_frequency_k[NEWLINE]
```

In [12]:
def yield_tokens(input_filepath):
    """Iterate over input TSV file and yield processed token lists for every document
    
    For every document (line in TSV file) yields:
    ["list", "of", "tokens", "in", "the", "article"]
      
    English stop words are filtered out from the token list.
    """
    for doc in yield_tokenized_docs(input_filepath):
        yield doc["tokens"]

In [13]:
dict_filepath = "wiki-tabbed.dict"

dictionary = corpora.Dictionary(yield_tokens(input_filepath))
dictionary.save_as_text(dict_filepath)

adding document #0 to Dictionary(0 unique tokens: [])
built Dictionary(37668 unique tokens: ['abe', 'addie', 'akers', 'al', 'alleghenys']...) from 250 documents (total 247438 corpus positions)
saving dictionary mapping to wiki-tabbed.dict


Below is a preview of the contents of the stored dictionary file:

In [14]:
with open(dict_filepath, "rt") as f:
    for line in f.readlines()[:30]:
        print(line[:-1])
print("⋮")

250
37290	aa	1
35333	aaa	2
5811	aacharya	1
26572	aad	1
20337	aadhavan	1
31328	aadi	1
5812	aadmi	1
22035	aaero	1
29313	aag	1
31329	aagadu	1
27858	aah	1
29314	aaiye	1
25918	aaliyah	2
27989	aalon	1
20338	aalu	1
18184	aamar	1
20339	aambala	1
3175	aami	1
29315	aansoo	1
29316	aaram	1
36814	aardman	1
1908	aardvark	3
30523	aarohanam	1
10711	aaron	9
29710	aarons	1
22036	aaru	1
20340	aaruyire	1
20341	aasai	1
20342	aasal	1
⋮


### Building a corpus

We will now use the tokenized documents, and the dictionary to build a corpus and store it in the file named `wiki-tabbed.mm` in the Matrix Market format.

In [15]:
corpus_filepath = "wiki-tabbed.mm"

corpus = [dictionary.doc2bow(token_list) for token_list in yield_tokens(input_filepath)]
corpora.MmCorpus.serialize(corpus_filepath, corpus)

storing corpus in Matrix Market format to wiki-tabbed.mm
saving sparse matrix to wiki-tabbed.mm
PROGRESS: saving document #0
saved 250x37668 matrix, density=1.259% (118555/9417000)
saving MmCorpus index to wiki-tabbed.mm.index


Below is a preview of the produced corpus:

In [16]:
for _, doc in zip(range(10), corpus):
    print("%s …" % ", ".join(str(term) for term in doc[:10]))
print("⋮")

(0, 1), (1, 1), (2, 1), (3, 1), (4, 3), (5, 2), (6, 1), (7, 1), (8, 1), (9, 4) …
(26, 1), (61, 1), (76, 2), (141, 1), (146, 1), (149, 1), (156, 1), (198, 2), (257, 2), (305, 1) …
(12, 1), (26, 1), (61, 1), (76, 2), (141, 1), (149, 3), (190, 1), (253, 1), (305, 1), (373, 1) …
(26, 2), (61, 2), (76, 2), (141, 1), (305, 1), (381, 1), (384, 1), (398, 2), (401, 1), (402, 1) …
(44, 2), (60, 2), (61, 2), (113, 2), (194, 2), (261, 10), (264, 1), (322, 1), (328, 1), (391, 4) …
(9, 8), (12, 1), (16, 10), (18, 1), (25, 11), (26, 1), (61, 3), (68, 1), (73, 1), (75, 1) …
(43, 2), (61, 1), (110, 1), (146, 1), (154, 5), (184, 1), (195, 1), (197, 1), (230, 2), (257, 28) …
(9, 4), (11, 1), (12, 3), (16, 9), (25, 8), (29, 1), (40, 2), (43, 1), (44, 2), (49, 1) …
(43, 2), (197, 1), (199, 1), (202, 1), (215, 1), (247, 1), (271, 1), (305, 1), (337, 2), (379, 1) …
(74, 1), (76, 8), (138, 1), (156, 4), (179, 3), (198, 1), (210, 1), (223, 25), (227, 4), (257, 3) …
⋮


#### Applying the TF-IDF transformation
The above corpus contains directly the term frequencies. To take the rarity of terms into account, we will multiply these by the inverse document frequencies:

In [17]:
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

collecting document frequencies
PROGRESS: processing document #0
calculating IDF weights for 250 documents and 37667 features (118555 matrix non-zeros)


Below is a preview of the transformed corpus:

In [18]:
for _, doc in zip(range(10), corpus_tfidf):
    print("%s …" % ", ".join(str(term) for term in doc[:4]))
print("⋮")

(0, 0.023596182123613318), (1, 0.02912298065150573), (2, 0.033303842340781896), (3, 0.014940170700745938) …
(26, 0.05172170917148428), (61, 0.004884230592229086), (76, 0.10344341834296857), (141, 0.02906244118159573) …
(12, 0.05321942907370798), (26, 0.05761356119551912), (61, 0.005440615219915433), (76, 0.11522712239103824) …
(26, 0.12186543068089575), (61, 0.011508105091681037), (76, 0.12186543068089575), (141, 0.03423810782712893) …
(44, 0.06988466131833326), (60, 0.09717401059690085), (61, 0.006171739462649299), (113, 0.04259531203976567) …
(9, 0.00610023503385314), (12, 0.002136327689000559), (16, 0.012544293217431167), (18, 0.0012842286442065526) …
(43, 0.01840421468567951), (61, 0.0012306200522273192), (110, 0.017048196453743784), (146, 0.009462849835089581) …
(9, 0.005454665081894212), (11, 0.004647484097734765), (12, 0.011461478435510045), (16, 0.02019018153543175) …
(43, 0.06325890628156312), (197, 0.02881908429682542), (199, 0.036096023710590404), (202, 0.023465052357321443)

#### Computing a low-rank approximation
To tackle the issues of synonymy, we will use the *latent semantic analysis (LSA)* to reduce the rank of our corpus viewed as a sparse term-document matrix to the four most significant eigenvectors. In practice, we would use low hundreds of eigenvectors. We will store the LSI model in the file named `wiki-tabbed.model.lsi`.

In [19]:
lsi_model_filepath = 'wiki-tabbed.model.lsi'

lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=4)
lsi.save(lsi_model_filepath)

using serial LSI version on this node
updating model with new documents
preparing a new chunk of documents
using 100 extra samples and 2 power iterations
1st phase: constructing (37668, 104) action matrix
orthonormalizing (37668, 104) action matrix
2nd phase: running dense svd on (104, 250) matrix
computing the final decomposition
keeping 4 factors (discarding 89.820% of energy spectrum)
processed documents up to #250
topic #0(2.149): 0.304*"bugs" + 0.203*"rabbit" + 0.138*"rabbits" + 0.129*"film" + 0.089*"daffy" + 0.087*"hare" + 0.083*"cartoons" + 0.083*"elmer" + 0.082*"series" + 0.082*"cartoon"
topic #1(1.773): -0.565*"bugs" + -0.162*"daffy" + -0.150*"marvin" + -0.148*"elmer" + -0.133*"hare" + -0.118*"looney" + -0.108*"cartoon" + -0.107*"tunes" + 0.101*"lucas" + -0.100*"cartoons"
topic #2(1.676): 0.477*"rabbit" + 0.465*"rabbits" + 0.190*"flemish" + 0.133*"giant" + 0.130*"breeds" + 0.108*"pigs" + 0.107*"breed" + 0.095*"dirty" + -0.092*"lucas" + 0.088*"palomino"
topic #3(1.529): -0.748*

Below are the four most significant eigenvectors:

In [20]:
%%capture
lsi.print_topics()

topic #0(2.149): 0.304*"bugs" + 0.203*"rabbit" + 0.138*"rabbits" + 0.129*"film" + 0.089*"daffy" + 0.087*"hare" + 0.083*"cartoons" + 0.083*"elmer" + 0.082*"series" + 0.082*"cartoon"
topic #1(1.773): -0.565*"bugs" + -0.162*"daffy" + -0.150*"marvin" + -0.148*"elmer" + -0.133*"hare" + -0.118*"looney" + -0.108*"cartoon" + -0.107*"tunes" + 0.101*"lucas" + -0.100*"cartoons"
topic #2(1.676): 0.477*"rabbit" + 0.465*"rabbits" + 0.190*"flemish" + 0.133*"giant" + 0.130*"breeds" + 0.108*"pigs" + 0.107*"breed" + 0.095*"dirty" + -0.092*"lucas" + 0.088*"palomino"
topic #3(1.529): -0.748*"lucas" + -0.307*"hampshire" + -0.195*"surrey" + -0.130*"match" + -0.127*"gentlemen" + -0.123*"class" + -0.101*"matches" + -0.096*"gayvn" + -0.086*"charles" + -0.085*"scored"


We will now use the LSI model to reduce the dimensionality of each document in our corpus:

In [21]:
corpus_lsi = lsi[corpus_tfidf]

Below is a preview of the transformed corpus:

In [22]:
for _, doc in zip(range(10), corpus_lsi):
    print("%s …" % ", ".join(str(topic) for topic in doc))
print("⋮")

(0, 0.20122455464418448), (1, 0.16584014445272607), (2, -0.13987366063146495), (3, 0.06014460695310348) …
(0, 0.05518082967244984), (1, 0.04095958093575281), (2, -0.03359174734204516), (3, -0.041494521845983234) …
(0, 0.053136039668064614), (1, 0.04443766655294992), (2, -0.03973746367375546), (3, -0.05514651550621913) …
(0, 0.04530633894161795), (1, 0.035826158385849405), (2, -0.03772194189179351), (3, -0.05808979690410896) …
(0, 0.13077581407610134), (1, 0.014702914261107973), (2, -0.050080614878756674), (3, -0.0014690083157765392) …
(0, 0.18110015819997258), (1, 0.009573050837211702), (2, -0.07920221191448526), (3, 0.025066607865057555) …
(0, 0.11239321512153737), (1, 0.0797958883368783), (2, -0.04299012096135939), (3, 0.01894507243297536) …
(0, 0.26219502499664754), (1, 0.050157414416113176), (2, -0.11226207009887028), (3, 0.02118290047111367) …
(0, 0.12351095243928138), (1, 0.08765190313728988), (2, -0.04590565090052868), (3, 0.02330877532747441) …
(0, 0.05024350023800776), (1, 0.0

## Similarity Search

### Building an index

Using the above corpus, we will build an index for our similarity queries and store it in the file named `wiki-tabbed.index`.

In [23]:
index_filepath = "wiki-tabbed.index"
index = similarities.MatrixSimilarity(corpus_lsi)
index.save(index_filepath)

scanning corpus to determine the number of features (consider setting `num_features` explicitly)
creating matrix with 250 documents and 4 features
saving MatrixSimilarity object under wiki-tabbed.index, separately None
saved wiki-tabbed.index


### Submitting queries

First, we will choose a query document:

In [24]:
query_doc = "Rabbits are the best pets."

print(query_doc)

Rabbits are the best pets.


Next, we will vectorize the query document:

In [25]:
query_vec_bow = dictionary.doc2bow(query_doc.lower().split())
query_vec_tfidf = tfidf[query_vec_bow]
query_vec_lsi = lsi[query_vec_tfidf]

print("%s" % ", ".join(str(topic) for topic in query_vec_lsi))

(0, 0.14857679111102948), (1, 0.07574671487640808), (2, 0.4310759718122766), (3, -0.03194901546203065)


We will now compute the cosine similarity between the query document vector and every document in our corpus:

In [26]:
sims = index[query_vec_lsi]

Below are the similarities for the first twenty documents in our corpus:

In [27]:
print(
    "\n".join([
        " %.6f\t%s" % (document_similarity, document["title"])
        for document_similarity, document
        in zip(sims, yield_documents(input_filepath))
    ][:20])
)
print("⋮")

 -0.141234	1904 in baseball
 -0.045977	1932 U.S. National Championships – Men's Singles
 -0.091345	1936 Wimbledon Championships – Men's Singles
 -0.118703	1938 Wimbledon Championships – Men's Singles
 -0.015292	1995–96 United States network television schedule (Saturday morning)
 -0.079103	1999 in home video
 0.053446	2001 Laurence Olivier Awards
 -0.046940	2007 in home video
 0.060133	2007 MOJO Awards
 0.051320	2010 South Sydney Rabbitohs season
 0.157468	2015–16 South Dakota State Jackrabbits men's basketball team
 0.063630	31st AVN Awards
 0.465357	3D modeling
 0.288960	Action Woman
 0.014020	Akira Watase
 -0.014162	Alexander Conti
 0.945664	Alpaca
 -0.029543	Anna Massey
 0.087325	Arvind Gaur
 -0.054713	A Star Is Bored
⋮


Below are the ten most similar documents and the ten least similar documents:

In [28]:
print(
    "\n".join([
        " %.6f\t%s" % (document_similarity, document["title"])
        for document_similarity, document
        in sorted(zip(sims, yield_documents(input_filepath)), reverse=True)
    ][:10])
)
print("⋮")
print(
    "\n".join([
        "%.6f\t%s" % (document_similarity, document["title"])
        for document_similarity, document
        in sorted(zip(sims, yield_documents(input_filepath)), reverse=False)
    ][10:0:-1])
)

 0.997099	British Giant rabbit
 0.996052	Flemish Giant rabbit
 0.995826	Svendborg Rabbits
 0.994238	Domestic rabbit
 0.993525	Harlequin rabbit
 0.992173	Rabbit
 0.988328	Palomino rabbit
 0.978473	Pig
 0.978121	Guinea pig
 0.977824	Dirty Little Rabbits (album)
⋮
-0.127274	Celebrity doll
-0.128262	Manfred Zapatka
-0.134380	Victor Yerrid
-0.138270	Eclipse (greyhounds)
-0.141234	1904 in baseball
-0.152070	List of download-only PlayStation 4 games
-0.177394	List of Disney Channel series
-0.257799	Scottish Greyhound Derby
-0.282571	Select Stakes (greyhounds)
-0.304105	Enticement (1925 film)
