word2vec Seems magical. “Neural computation, just like in the brain!” Seems magical. “Neural computation, just like in the brain!” How does this actually work? How does word2vec work? word2vec implements several different algorithms: Two training methods Negative Sampling Hierarchical Softmax Two context representations Continuous Bag of Words (CBOW) Skip-grams How does word2vec work? word2vec implements several different algorithms: Two training methods Negative Sampling Hierarchical Softmax Two context representations Continuous Bag of Words (CBOW) Skip-grams We’ll focus on skip-grams with negative sampling. intuitions apply for other models as well. How does word2vec work? Represent each word as a d dimensional vector. Represent each context as a d dimensional vector. Initalize all vectors to random weights. Arrange vectors in two matrices, W and C. How does word2vec work? While more text: Extract a word window: A springer is [ a cow or heifer close to calving ] . c1 c2 c3 w c4 c5 c6 w is the focus word vector (row in W). ci are the context word vectors (rows in C). How does word2vec work? While more text: Extract a word window: A springer is [ a cow or heifer close to calving ] . c1 c2 c3 w c4 c5 c6 Try setting the vector values such that: σ(w· c1)+σ(w· c2)+σ(w· c3)+σ(w· c4)+σ(w· c5)+σ(w· c6) is high How does word2vec work? While more text: Extract a word window: A springer is [ a cow or heifer close to calving ] . c1 c2 c3 w c4 c5 c6 Try setting the vector values such that: σ(w· c1)+σ(w· c2)+σ(w· c3)+σ(w· c4)+σ(w· c5)+σ(w· c6) is high Create a corrupt example by choosing a random word w [ a cow or comet close to calving ] c1 c2 c3 w c4 c5 c6 Try setting the vector values such that: σ(w · c1)+σ(w · c2)+σ(w · c3)+σ(w · c4)+σ(w · c5)+σ(w · c6) is low How does word2vec work? The training procedure results in: w · c for good word-context pairs is high. w · c for bad word-context pairs is low. w · c for ok-ish word-context pairs is neither high nor low. As a result: Words that share many contexts get close to each other. Contexts that share many words get close to each other. At the end, word2vec throws away C and returns W. Reinterpretation Imagine we didn’t throw away C. Consider the product WC Reinterpretation Imagine we didn’t throw away C. Consider the product WC The result is a matrix M in which: Each row corresponds to a word. Each column corresponds to a context. Each cell correspond to w · c, an association measure between a word and a context. Reinterpretation Does this remind you of something? Reinterpretation Does this remind you of something? Very similar to SVD over distributional representation: