Knowledge of the basic principles of data processing is assumed.
Course Enrolment Limitations
The course is also offered to the students of the fields other than those the course is directly associated with.
Fields of study the course is directly associated with
there are 47 fields of study the course is directly associated with, display
The objective of the course is to explain the problems of information retrieval in large collections of unstructured data, such as text documents or multimedia objects. The main emphasis will be given on describing basic principles of distributed algorithms for processing large volumes of data, e.g., Locality-sensitive hashing, MapReduce or PageRank. The algorithms for processing stream data will be introduced as well. The students will also acquire practical experience by applying the presented algorithms to the specific tasks.
After completing the course students are able to:
- Describe algorithmic-based differences between processing offline data collections and online data streams;
- Understand the basic principles of distributed algorithms for processing large volumes of data;
- Evaluate the results of algorithms by several metrics;
- Apply presented algorithms, such as K-Means, Locality-sensitive hashing, MapReduce or PageRank, to the specific tasks.
Introduction – What is searching, Things useful to know
Support for Distributed Processing – Distributed file system, MapReduce, Algorithms using MapReduce, Cost model and performance evaluation
Retrieval Operators and Result Evaluations – Common similarity search operators, Retrieval metrics
Clustering – K-means algorithms, Clustering in non-Euclidean spaces, Clustering for streams and parallelism
Finding Frequent Item Sets – Handling large datasets in main memory, Counting frequent items in a stream
Finding Similar Items – Applications of near-neighbor search, Shingling of documents, Similarity-preserving summaries of sets, Locality sensitive hashing
Searching in Data Streams – The stream data model, Filtering streams
Link Analysis – Page Rank, Topic sensitive, Link spam
Search Applications – Advertising on the web, Recommendation systems (collaborative filtering), Mining social-network graphs
Seznam.cz – A Search Engine in Practice
P, Deepak and Prasad M. DESHPANDE. Operators for similarity search : semantics, techniques and usage scenarios. Cham: Springer, 2015. xi, 115. ISBN 9783319212562. info
LESKOVEC, Jurij, Anand RAJARAMAN and Jeffrey D. ULLMAN. Mining of massive datasets. 2nd ed. Cambridge: Cambridge University Press, 2014. xi, 467. ISBN 9781107077232. info
BAEZA-YATES, R. and Berthier de Araújo Neto RIBEIRO. Modern information retrieval : the concepts and technology behind search. 2nd ed. Harlow: Pearson, 2011. xxx, 913. ISBN 9780321416919. info
Lectures with slides in English. The approach combines theory, algorithms and practical examples.
The final exam consists of a written and oral part. The student is asked several questions to verify their knowledge obtained during the course lectures.