word embeddings
what, how and whither
Yoav Goldberg
Bar Ilan University
understanding
word2vec
word2vec
Seems magical.
“Neural computation, just like in the brain!”
Seems magical.
“Neural computation, just like in the brain!”
How does this actually work?
How does word2vec work?
word2vec implements several different algorithms:
Two training methods
Negative Sampling
Hierarchical Softmax
Two context representations
Continuous Bag of Words (CBOW)
Skip-grams
How does word2vec work?
word2vec implements several different algorithms:
Two training methods
Negative Sampling
Hierarchical Softmax
Two context representations
Continuous Bag of Words (CBOW)
Skip-grams
We’ll focus on skip-grams with negative sampling.
intuitions apply for other models as well.
How does word2vec work?
Represent each word as a d dimensional vector.
Represent each context as a d dimensional vector.
Initalize all vectors to random weights.
Arrange vectors in two matrices, W and C.
How does word2vec work?
While more text:
Extract a word window:
A springer is [ a cow or heifer close to calving ] .
c1 c2 c3 w c4 c5 c6
w is the focus word vector (row in W).
ci are the context word vectors (rows in C).
How does word2vec work?
While more text:
Extract a word window:
A springer is [ a cow or heifer close to calving ] .
c1 c2 c3 w c4 c5 c6
Try setting the vector values such that:
σ(w· c1)+σ(w· c2)+σ(w· c3)+σ(w· c4)+σ(w· c5)+σ(w· c6)
is high
How does word2vec work?
While more text:
Extract a word window:
A springer is [ a cow or heifer close to calving ] .
c1 c2 c3 w c4 c5 c6
Try setting the vector values such that:
σ(w· c1)+σ(w· c2)+σ(w· c3)+σ(w· c4)+σ(w· c5)+σ(w· c6)
is high
Create a corrupt example by choosing a random word w
[ a cow or comet close to calving ]
c1 c2 c3 w c4 c5 c6
Try setting the vector values such that:
σ(w · c1)+σ(w · c2)+σ(w · c3)+σ(w · c4)+σ(w · c5)+σ(w · c6)
is low
How does word2vec work?
The training procedure results in:
w · c for good word-context pairs is high.
w · c for bad word-context pairs is low.
w · c for ok-ish word-context pairs is neither high nor low.
As a result:
Words that share many contexts get close to each other.
Contexts that share many words get close to each other.
At the end, word2vec throws away C and returns W.
Reinterpretation
Imagine we didn’t throw away C. Consider the product WC
Reinterpretation
Imagine we didn’t throw away C. Consider the product WC
The result is a matrix M in which:
Each row corresponds to a word.
Each column corresponds to a context.
Each cell correspond to w · c, an association measure
between a word and a context.
Reinterpretation
Does this remind you of something?
Reinterpretation
Does this remind you of something?
Very similar to SVD over distributional representation:
context matters
What’s in a Context?
• Importing ideas from embeddings improves distributional methods
• Can distributional ideas also improve embeddings?
• Idea: change SGNS’s default BoW contexts into dependency contexts
“Dependency-Based Word Embeddings”
Levy & Goldberg, ACL 2014
Australian scientist discovers star with telescope
Example
“Dependency-Based Word Embeddings”
Levy & Goldberg, ACL 2014
Australian scientist discovers star with telescope
Target Word
“Dependency-Based Word Embeddings”
Levy & Goldberg, ACL 2014
Australian scientist discovers star with telescope
Bag of Words (BoW) Context
“Dependency-Based Word Embeddings”
Levy & Goldberg, ACL 2014
Australian scientist discovers star with telescope
Bag of Words (BoW) Context
“Dependency-Based Word Embeddings”
Levy & Goldberg, ACL 2014
Australian scientist discovers star with telescope
Bag of Words (BoW) Context
“Dependency-Based Word Embeddings”
Levy & Goldberg, ACL 2014
Australian scientist discovers star with telescope
Syntactic Dependency Context
“Dependency-Based Word Embeddings”
Levy & Goldberg, ACL 2014
Australian scientist discovers star with telescope
Syntactic Dependency Context
prep_withnsubj
dobj
“Dependency-Based Word Embeddings”
Levy & Goldberg, ACL 2014
Australian scientist discovers star with telescope
Syntactic Dependency Context
prep_withnsubj
dobj
“Dependency-Based Word Embeddings”
Levy & Goldberg, ACL 2014
Embedding Similarity with Different Contexts
Target Word Bag of Words (k=5) Dependencies
Dumbledore Sunnydale
hallows Collinwood
Hogwarts half-blood Calarts
(Harry Potter’s school) Malfoy Greendale
Snape Millfield
Related to
Harry Potter
Schools
“Dependency-Based Word Embeddings”
Levy & Goldberg, ACL 2014
Embedding Similarity with Different Contexts
Target Word Bag of Words (k=5) Dependencies
nondeterministic Pauling
non-deterministic Hotelling
Turing computability Heting
(computer scientist) deterministic Lessing
finite-state Hamming
Related to
computability
Scientists
“Dependency-Based Word Embeddings”
Levy & Goldberg, ACL 2014
Embedding Similarity with Different Contexts
Target Word Bag of Words (k=5) Dependencies
singing singing
dance rapping
dancing dances breakdancing
(dance gerund) dancers miming
tap-dancing busking
Related to
dance
Gerunds
“Dependency-Based Word Embeddings”
Levy & Goldberg, ACL 2014
What is the effect of different context types?
• Thoroughly studied in distributional methods
• Lin (1998), Padó and Lapata (2007), and many others…
General Conclusion:
• Bag-of-words contexts induce topical similarities
• Dependency contexts induce functional similarities
• Share the same semantic type
• Cohyponyms
• Holds for embeddings as well
“Dependency-Based Word Embeddings”
Levy & Goldberg, ACL 2014
• Same algorithm, different inputs -- very different
kinds of similarity.
• Inputs matter much more than algorithm.
• Think about your inputs.