# Similarity Search with Gensim

## Jupyter Notebook tutorial

[Jupyter Notebook](https://jupyter.org/) allows you to document your code using the [Markdown markup](https://en.wikipedia.org/wiki/Markdown). Double-click this cell and try editing the markup. When you are happy with your changes, press *Ctrl+Enter* to render the markup. Press *Ctrl+S* or select “File”, and “Save and Checkpoint” from the horizontal menu to update the Jupyter Notebook project file.

By selecting “Insert”, and “Insert Cell Below” from the horizontal menu, you can insert your own Markdown and code cells. The type of the cell can be set by selecting “Cell”, and “Cell Type” from the horizontal menu or by using the drop-down list below the horizontal menu. You can execute a code cell by pressing *Ctrl+Enter*. Try it with the cell below!

In [None]:
print("Hello, World!")

## Setting up logging

In [None]:
import logging
logging.basicConfig(format='%(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

## Importing Python modules

In [None]:
import sys, os
import re
import json
import gensim
from gensim.parsing.preprocessing import STOPWORDS
from gensim.utils import chunkize
from gensim import corpora, models, similarities
from smart_open import smart_open

Notice that all module imports and definitions are global. For instance, you can declare a variable in one cell and access it in another cell. To reset the state of the Python interpreter, select “Kernel”, and “Restart” from the horizontal menu.

## Preprocessing documents

### Loading
The input documents are stored in the *tab-separated-value (TSV)* file named `wiki-tabbed.tsv` in the following format:

```
title_1[TAB]segment_title[TAB]segment_body[TAB]....[TAB]segment_title[TAB]segment_body[NEWLINE]
title_2[TAB]segment_title[TAB]segment_body[TAB]....[TAB]segment_title[TAB]segment_body[NEWLINE]
....
title_k[TAB]segment_title[TAB]segment_body[TAB]....[TAB]segment_title[TAB]segment_body[NEWLINE]
```

We will use the file to produce Python objects in the following format:

``` python
{
    'title': chunks[0],
    'content': ' '.join([
        ' '.join([parse_chunk(segment_title), parse_chunk(segment_body)])
        for segment_title, segment_body in chunkize(chunks[1:], 2)
    ])
}
```

where `chunks = line.strip().split('\t')` and `line` is a line in the text file.

In [None]:
input_filepath = 'wiki-tabbed.tsv'

def parse_chunk(chunk):
    segment = json.loads(chunk)
    segment = re.sub("\s+", " ", segment, flags=re.MULTILINE).strip()
    return segment

def parse_input_line(line):
    chunks = line.strip().split(u'\t')
    title = parse_chunk(chunks[0])
    segments = []
    for (segment_title, segment_body) in chunkize(chunks[1:], 2):
        segments.append(u' '.join([parse_chunk(segment_title), parse_chunk(segment_body)]))
    return title, u' '.join(segments)

def yield_documents(input_filepath):
    """Iterate over input TSV file and yield parsed documents one-by-one"""
    with smart_open(input_filepath, 'rb') as f:
        for line in f:
            title, text = parse_input_line(line.decode('utf-8'))
            yield {
                'title': title,
                'content': text,
            }

Below is a preview of the produced Python objects:

In [None]:
for _, doc in zip(range(10), yield_documents(input_filepath)):
    logger.info("%s: %s" % (doc['title'], doc['content'][:40] + u'...'))
logger.info(u'...')

### Tokenization

We can use several approaches to tokenization. Below are shown just a basic few:

In [None]:
test_text = 'Hello World! How is it going?! Nonexistentword, 21'

logger.info("Simple preprocess:\n%s\n" %
    gensim.utils.simple_preprocess(test_text, deacc=True, min_len=2, max_len=15))
logger.info("Simple preprocess without English stopwords:\n%s\n" % [
    token for token in gensim.utils.simple_preprocess(test_text, deacc=True, min_len=2, max_len=15)
    if token not in STOPWORDS])
logger.info("Lemmatization:\n%s\n" %
    gensim.utils.lemmatize(test_text, min_length=2, max_length=15))
logger.info("Lemmatization without English stopwords:\n%s" %
    gensim.utils.lemmatize(test_text, stopwords=STOPWORDS, min_length=2, max_length=15))

We will use the above file to produce Python objects, which will contain tokens instead of just raw text:

In [None]:
def yield_tokenized_docs(input_filepath):
    """Iterate over input TSV file and yield processed token lists for every document
    
    For every document (line in TSV file) yields:
    { 'title': u'article title',
      'tokens': [u'list', u'of', u'tokens', u'of', u'the', u'article'] }
      
    English stop words are filtered out from the token list.
    """
    for doc in yield_documents(input_filepath):
        yield {
            'title': doc['title'],
            'tokens': [token
                       for token
                       in gensim.utils.simple_preprocess(
                           doc['title'] + doc['content'],
                           deacc=True, min_len=2, max_len=15)
                       if token not in STOPWORDS]
        }

Below is a preview of the produced Python objects:

In [None]:
for _, doc in zip(range(10), yield_tokenized_docs(input_filepath)):
    logger.info("%s: %s ..." % (doc['title'], ", ".join(doc['tokens'][:5])))
logger.info('...')

## Indexing documents

### Building a dictionary

We will use the tokenized documents to build a dictionary and store it in the file named `wiki-tabbed.dict` in following format:

```
num_docs
id_1[TAB]word_1[TAB]document_frequency_1[NEWLINE]
id_2[TAB]word_2[TAB]document_frequency_2[NEWLINE]
....
id_k[TAB]word_k[TAB]document_frequency_k[NEWLINE]
```

In [None]:
def yield_tokens(input_filepath):
    """Iterate over input TSV file and yield processed token lists for every document
    
    For every document (line in TSV file) yields:
    [u'list', u'of', u'tokens', u'of', u'the', u'article']
      
    English stop words are filtered out from the token list.
    """
    for doc in yield_tokenized_docs(input_filepath):
        yield doc['tokens']

In [None]:
dict_filepath = 'wiki-tabbed.dict'

dictionary = corpora.Dictionary(yield_tokens(input_filepath))
dictionary.save_as_text(dict_filepath)

Below is a preview of the contents of the stored dictionary file:

In [None]:
logger.info("%s..." % u"".join(open(dict_filepath, 'rt').readlines()[:30]))

### Building a corpus

Now that we have a dictionary that assigns each token a unique id, we can convert our documents to the *bag-of-words (BOW)* format:

In [None]:
new_doc = "Rabbit is a favorite pet"
new_vec = dictionary.doc2bow(new_doc.lower().split())
logger.info(new_vec)

Since we omitted stopwords ‘is’, and ‘a’ when we preprocessed our input documents, the dictionary contains only the words ‘rabbit’, ‘favorite’, and ‘pet’. Tokens outside the dictionary are ignored:

In [None]:
for w in new_doc.lower().split():
    logger.info("%s: %s" % (w, dictionary.token2id[w] if w in dictionary.token2id else '<not in dictionary>'))

We will now use the tokenized documents, and the dictionary to build a corpus and store it in the file named `wiki-tabbed.mm` in the Matrix Market format.

In [None]:
corpus_filepath = 'wiki-tabbed.mm'

corpus = [dictionary.doc2bow(token_list) for token_list in yield_tokens(input_filepath)]
corpora.MmCorpus.serialize(corpus_filepath, corpus)

Below is a preview of the produced corpus:

In [None]:
for _, doc in zip(range(10), corpus):
    logger.info("%s ..." % u', '.join(unicode(term) for term in doc[:10]))
logger.info(u'...')

#### Applying the TF-IDF transformation
The above corpus contains directly the term frequencies. To take the rarity of terms into account, we will multiply these by the inverse document frequencies:

In [None]:
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

Below is a preview of the transformed corpus:

In [None]:
for _, doc in zip(range(10), corpus_tfidf):
    logger.info("%s ..." % u', '.join(unicode(term) for term in doc[:4]))
logger.info(u'...')

#### Computing a low-rank approximation
To tackle the issues of synonymy, we will use the *latent semantic analysis (LSA)* to reduce the rank of our corpus viewed as a sparse term-document matrix to the four most significant eigenvectors. In practice, we would use low hundreds of eigenvectors. We will store the LSI model in the file named `wiki-tabbed.model.lsi`.

In [None]:
lsi_model_filepath = 'wiki-tabbed.model.lsi'

lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=4)
lsi.save(lsi_model_filepath)

Below are the four most significant eigenvectors:

In [None]:
logger.debug(lsi.print_topics())

We will now use the LSI model to reduce the dimensionality of each document in our corpus:

In [None]:
corpus_lsi = lsi[corpus_tfidf]

Below is a preview of the transformed corpus:

In [None]:
for _, doc in zip(range(10), corpus_lsi):
    logger.info("%s" % ", ".join(unicode(topic) for topic in doc))
logger.info(u'...')

## Similarity Search

### Building an index

Using the above corpus, we will build an index for our similarity queries and store it in the file named `wiki-tabbed.index`.

In [None]:
index_filepath = 'wiki-tabbed.index'
index = similarities.MatrixSimilarity(corpus_lsi)
index.save(index_filepath)

### Submitting queries

First, we will choose a query document:

In [None]:
query_doc = u"Rabbits are the best pets."

logger.info(query_doc)

Next, we will vectorize the query document:

In [None]:
query_vec_bow = dictionary.doc2bow(query_doc.lower().split())
query_vec_tfidf = tfidf[query_vec_bow]
query_vec_lsi = lsi[query_vec_tfidf]

logger.info(query_vec_lsi)

We will now compute the cosine similarity between the query document vector and every document in our corpus:

In [None]:
sims = index[query_vec_lsi]

Below are the similarities for the first twenty documents in our corpus:

In [None]:
logger.info("\n".join([
    "%.6f\t%s" % (document_similarity, document["title"])
    for document_similarity, document
    in zip(sims, yield_documents(input_filepath))][:20] + [u'...']))

Below are the ten most similar documents and the ten least similar documents:

In [None]:
logger.info("\n".join([
    " %.6f\t%s" % (document_similarity, document["title"])
    for document_similarity, document
    in sorted(zip(sims, yield_documents(input_filepath)), reverse=True)][:10]))
logger.info(u" ...")
logger.info("\n".join(sorted([
    "%.6f\t%s" % (document_similarity, document["title"])
    for document_similarity, document
    in sorted(zip(sims, yield_documents(input_filepath)), reverse=False)][:10], reverse=False)))