word embeddings what, how and whither Yoav Goldberg Bar Ilan University understanding word2vec word2vec Seems magical. “Neural computation, just like in the brain!” Seems magical. “Neural computation, just like in the brain!” How does this actually work? How does word2vec work? word2vec implements several different algorithms: Two training methods Negative Sampling Hierarchical Softmax Two context representations Continuous Bag of Words (CBOW) Skip-grams How does word2vec work? word2vec implements several different algorithms: Two training methods Negative Sampling Hierarchical Softmax Two context representations Continuous Bag of Words (CBOW) Skip-grams We’ll focus on skip-grams with negative sampling. intuitions apply for other models as well. How does word2vec work? Represent each word as a d dimensional vector. Represent each context as a d dimensional vector. Initalize all vectors to random weights. Arrange vectors in two matrices, W and C. How does word2vec work? While more text: Extract a word window: A springer is [ a cow or heifer close to calving ] . c1 c2 c3 w c4 c5 c6 w is the focus word vector (row in W). ci are the context word vectors (rows in C). How does word2vec work? While more text: Extract a word window: A springer is [ a cow or heifer close to calving ] . c1 c2 c3 w c4 c5 c6 Try setting the vector values such that: σ(w· c1)+σ(w· c2)+σ(w· c3)+σ(w· c4)+σ(w· c5)+σ(w· c6) is high How does word2vec work? While more text: Extract a word window: A springer is [ a cow or heifer close to calving ] . c1 c2 c3 w c4 c5 c6 Try setting the vector values such that: σ(w· c1)+σ(w· c2)+σ(w· c3)+σ(w· c4)+σ(w· c5)+σ(w· c6) is high Create a corrupt example by choosing a random word w [ a cow or comet close to calving ] c1 c2 c3 w c4 c5 c6 Try setting the vector values such that: σ(w · c1)+σ(w · c2)+σ(w · c3)+σ(w · c4)+σ(w · c5)+σ(w · c6) is low How does word2vec work? The training procedure results in: w · c for good word-context pairs is high. w · c for bad word-context pairs is low. w · c for ok-ish word-context pairs is neither high nor low. As a result: Words that share many contexts get close to each other. Contexts that share many words get close to each other. At the end, word2vec throws away C and returns W. Reinterpretation Imagine we didn’t throw away C. Consider the product WC Reinterpretation Imagine we didn’t throw away C. Consider the product WC The result is a matrix M in which: Each row corresponds to a word. Each column corresponds to a context. Each cell correspond to w · c, an association measure between a word and a context. Reinterpretation Does this remind you of something? Reinterpretation Does this remind you of something? Very similar to SVD over distributional representation: context matters What’s in a Context? • Importing ideas from embeddings improves distributional methods • Can distributional ideas also improve embeddings? • Idea: change SGNS’s default BoW contexts into dependency contexts “Dependency-Based Word Embeddings” Levy & Goldberg, ACL 2014 Australian scientist discovers star with telescope Example “Dependency-Based Word Embeddings” Levy & Goldberg, ACL 2014 Australian scientist discovers star with telescope Target Word “Dependency-Based Word Embeddings” Levy & Goldberg, ACL 2014 Australian scientist discovers star with telescope Bag of Words (BoW) Context “Dependency-Based Word Embeddings” Levy & Goldberg, ACL 2014 Australian scientist discovers star with telescope Bag of Words (BoW) Context “Dependency-Based Word Embeddings” Levy & Goldberg, ACL 2014 Australian scientist discovers star with telescope Bag of Words (BoW) Context “Dependency-Based Word Embeddings” Levy & Goldberg, ACL 2014 Australian scientist discovers star with telescope Syntactic Dependency Context “Dependency-Based Word Embeddings” Levy & Goldberg, ACL 2014 Australian scientist discovers star with telescope Syntactic Dependency Context prep_withnsubj dobj “Dependency-Based Word Embeddings” Levy & Goldberg, ACL 2014 Australian scientist discovers star with telescope Syntactic Dependency Context prep_withnsubj dobj “Dependency-Based Word Embeddings” Levy & Goldberg, ACL 2014 Embedding Similarity with Different Contexts Target Word Bag of Words (k=5) Dependencies Dumbledore Sunnydale hallows Collinwood Hogwarts half-blood Calarts (Harry Potter’s school) Malfoy Greendale Snape Millfield Related to Harry Potter Schools “Dependency-Based Word Embeddings” Levy & Goldberg, ACL 2014 Embedding Similarity with Different Contexts Target Word Bag of Words (k=5) Dependencies nondeterministic Pauling non-deterministic Hotelling Turing computability Heting (computer scientist) deterministic Lessing finite-state Hamming Related to computability Scientists “Dependency-Based Word Embeddings” Levy & Goldberg, ACL 2014 Embedding Similarity with Different Contexts Target Word Bag of Words (k=5) Dependencies singing singing dance rapping dancing dances breakdancing (dance gerund) dancers miming tap-dancing busking Related to dance Gerunds “Dependency-Based Word Embeddings” Levy & Goldberg, ACL 2014 What is the effect of different context types? • Thoroughly studied in distributional methods • Lin (1998), Padó and Lapata (2007), and many others… General Conclusion: • Bag-of-words contexts induce topical similarities • Dependency contexts induce functional similarities • Share the same semantic type • Cohyponyms • Holds for embeddings as well “Dependency-Based Word Embeddings” Levy & Goldberg, ACL 2014 • Same algorithm, different inputs -- very different kinds of similarity. • Inputs matter much more than algorithm. • Think about your inputs.