👷 Introduction to Information Retrieval

Latent semantic representations: Introduction to LLM, matrix decompositions, LSI, and distributed word representations 3. 4. 2024

Lecture

We are approaching a revolution in the representations of meaning (Large Language Models, LLM) that allow for disruptive changes in the IR for the crowd. 

Distributed Word Representations for Information Retrieval
Stanford course CS 276 / LING 286 slides

Readings

Matrix decompositions and latent semantic indexing
Chapter 18 from the Introduction to Information Retrieval book by Manning et al. (2008)
Gensim
A Wikipedia article about the free, open-source library Gensim for topic modeling and distributed word representations developed and co-maintained by the Faculty of Informatics, Masaryk University.
Modeling Science
Slides for a lecture from 2018 by Blei (topical modeling with Latent Dirichlet Allocation from LDA author)
DFR-browser
Interactive visualization of a topic model for the JSTOR digital library
GloVe: Global Vectors for Word Representation
A 2014 paper by Pennington et al.
Neural Word Embedding as Implicit Matrix Factorization
A 2014 paper by Levy and Goldberg
Evaluation of Extended Word Embeddings
Papers about word embeddings by the Math Information Retrieval (MIR) research group at Masaryk University

Seminar

Matrix decomposition, latent semantic indexing, and distributed word representations
Exercise solution for seminars in the seventh week. Exercise 18/2 continues exercises 6/1, 6/2, and 6/3 from the fourth week.
Matrix decompositions and latent semantic indexing
Google Colaboratory code for seminars in the seventh week
Distributed word representations
Google Colaboratory code for seminars in the seventh week
Scoring, term weighting, and the vector space model
Notes from the seminar 03 on week 4

Finding similar documents with word2vec and soft cosine measure
A tutorial for similarity search using distributed word representations
Finding similar documents with word2vec and word mover's distance
A tutorial for similarity search using distributed word representations
Whiteboard images
Images of the whiteboard (PV211 runs of 2021, 2022)