Inverted Indexing
for Information retrieval
Adam Hadraba
Table of contents
• Motivation
• Inverted index
• Vector space model
• Constructing scalable inverted index
Motivation
• Sequential scan is not efficient
• Looking for structure that provides background for:
• Processing large number of queries over massive
amount of data each second
• Data set changes
• Different kinds of queries
• Ranking results
• “Fast” response & dealing w/ hardware limitations
Inverted index
• Most common indexing method used in IR systems
• Way to avoid linearly scanning the texts
• Index in advace
• Widely used in search engines
• Normally, documents – lists of words
• Inverted index — for each word lists of documents
Creating Inverted index
1. Collect documents to be indexed
2. Tokenize the text
3. Preprocessing
4. Indexing
Size
• Dictionary:
• Heap‘s law: V = O(nβ), 0.4 < β < 0.6
• TREC – 2: 1GB text, 5MB dictionary
• Postings
• Worst case – one per ocurrence of a word in a text: O(n)
• Inverted index are big – typically 10-100% the size of
collection of documents
► Most of the time compression is needed
Search subsystem
Index
database
parse query
terms
query tokens
operations*
ranking*
relevant
document set
ranked
document set
retrieved
document set
* optional
operation
Preprocessing*
Index
database
Search subsystem
parse query
terms
query tokens
operations*
ranking*
relevant
document set
ranked
document set
retrieved
document set
* optional
operation
Preprocessing*
Parsed query:
Laboratoř AND DISA
Index
database
Search subsystem
parse query
terms
query tokens
operations*
ranking*
relevant
document set
ranked
document set
retrieved
document set
* optional
operation
Preprocessing*
Preprocessed query:
lab AND disa
Index
database
Search subsystem
parse query
terms
query tokens
operations*
ranking*
relevant
document set
ranked
document set
retrieved
document set
* optional
operation
Preprocessing*
Relevant document set:
1, 2, 5, 10…
2, 4, 5, 7…
Index
database
Search subsystem
parse query
terms
query tokens
operations*
ranking*
relevant
document set
ranked
document set
retrieved
document set
* optional
operation
Preprocessing*
Operation intersection
Documents containing “lab”:
1 | 2 | 5 | 10 …
Documents containing “disa”:
2 | 4 | 5 | 7 …
Relevant document set:
1, 2, 5, 10…
2, 4, 5, 7…
Index
database
Search subsystem
parse query
terms
query tokens
operations*
ranking*
relevant
document set
ranked
document set
retrieved
document set
* optional
operation
Preprocessing*
Retrived document set:
2, 5
Not simple as that
• Support for different queries:
• Boolean, proximity, phrase, wildcard…
• More complex postings
• Additional indexes
► increase in complexity
• Retrieved document set could be huge
• We need to rank them relevantly
Vector space model
Vector space model
• Idea: A user‘s query can be viewed as a short document
• Documents and queries are represented as vectors in term
space (both in the same space)
► We are able to measure proximity — rank retrieved
documents
Vector space model cont.
• Two documents are similar, if they contain some of the same terms.
• We can take into account / weighting:
• Lenght of documents
• Number of terms in common
• Unusual or common words
• How many times each term appears
• Documents are represented as “bag of words”
• Words are terms with no order
► Thus the document
John is quicker than Mary.
Is indistinguishable from
Mary is quicker than John.
Vector space model cont.
• Term vector space
• n-dimensional space
• n – number of different terms/tokens used to index a set of documents
• Vector
• Document i, di, represented by a vector. Its magnitude in dimension j is
wij, where:
wij > 0 if term j occurs in document i
wij = 0 otherwise
• wij is the weight of term j in document i.
Documents in 3-dimensional
term vector space
Assumption: Documents that are “close together” in space are closer
in meaning
t1
t2
d1
t13
t12
t11
t3
Measuring similarity
• Eg. Cosine angle between the docs d1 and d2 determines
doc similarity
cos () =
cos () = 1 – documents exactly the same;
cos () = 0 – totally different.
d1.d2
|d1| |d2|
Constructing inverted index
Hardware dependencies
• Memory is faster than disc
• Seek time, transfer from disc
► As much data as possible in memory
• Better with SSD
• Disc to memory handled by system bus, not processor
• Reading compressed data and uncompressing
usually faster than
reading uncompressed data
Blocked Sort-Based Indexing (BSBI)
• Memory is insufficient, we need to use disc
• Map term to termID
1. Divide documents collection into blocks
• Each block fits into main memory
2. For each block
• Sort the termID-docID pairs
• Store intermediate sorted result on disc
3. Merge all intermediate results into the final result
• Maintaining small read and write buffers
• Assumption: dictionary fits into main memory, termID available
online for each document
Blocked Sort-Based Indexing (BSBI)
Blocked Sort-Based Indexing (BSBI)
• Problems:
• Dictionary must fit into memory
• We need dictionary to map a term to termID
• term-docID postings instead of termID-docID
• But intermediate files would become very large.
• Scalable, but slow.
Single Pass In-Memory Indexing (SPIMI)
• Dictionary won‘t fit into memory
1. Dictionary for each block
2. Add a posting directly to its posting list
• No sorting
• No storage of termID-docID pairs
• Posting list doubles allocated space each time it‘s full
► Complete inverted index for each block
3. Merge into one big index
• Compression makes SPIMI more efficient
• Postings
• Dictionary terms
► Processing larger blocks
Distributed indexing
• For web-scale indexing
• Distributed computer cluster
• Individual machines are fault-prone
• Maintain a master machine directing the indexing job
• Break up indexing into set of (parallel) tasks
• Master machine assigns tasks
Distributed indexing cont.
• Two sets of tasks
• Parsers
• Inventers
• Braking the input documents into splits (corresponding to
blocks in BSBI/SPIMI)
Distributed indexing
• Parsers
• Master assign split to an idle parser machine
• Parser reads a document and emits term-doc pairs
• Parser writes pairs into j partitions
• Each partition is for a range terms‘ first letter
• (e.g. a-f, g-p, q-z) — here j=3
• Inverter
• Collects all term-doc pairs from one term-partition
• Sorts and writes to postings lists
Distributed indexing – data flow
splits
Parser
Parser
Parser
Master
a-f g-p q-z
a-f g-p q-z
a-f g-p q-z
Inverter
Inverter
Inverter
Postings
a-f
g-p
q-z
assign assign
Map
phase
Segment files Reduce
phase
Dynamic indexing
• Untill now, we assumed that collections are static
• New documents need to be iserted
• Documents are deleted and modified
► Postings upades for terms already in dictionary
► New terms added to dictionary
Dynamic indexing
• “Big” main index
• New documents go into „small“ auxiliary index
• Search across both, merge results
• Deletions
• Invalidation bit-vector
• Periodically, re-index into one main index
Dynamic indexing
• Problems:
• Poor performance during merge
• If we have separate files for each postings list, merging is
efficient (simple append)
• Lots of files — not efficient for O/S
• In reality: somewhere in between
• Split large postings lists
• Collect postings list of lenght 1 in one file etc.
We covered
• Structure of inverted index
• Ranking – Vector Space Model
• Constructing of inverted indexes
Thank You.