word2vec
Seems magical.
“Neural computation, just like in the brain!”
Seems magical.
“Neural computation, just like in the brain!”
How does this actually work?
How does word2vec work?
word2vec implements several different algorithms:
Two training methods
Negative Sampling
Hierarchical Softmax
Two context representations
Continuous Bag of Words (CBOW)
Skip-grams
How does word2vec work?
word2vec implements several different algorithms:
Two training methods
Negative Sampling
Hierarchical Softmax
Two context representations
Continuous Bag of Words (CBOW)
Skip-grams
We’ll focus on skip-grams with negative sampling.
intuitions apply for other models as well.
How does word2vec work?
Represent each word as a d dimensional vector.
Represent each context as a d dimensional vector.
Initalize all vectors to random weights.
Arrange vectors in two matrices, W and C.
How does word2vec work?
While more text:
Extract a word window:
A springer is [ a cow or heifer close to calving ] .
c1 c2 c3 w c4 c5 c6
w is the focus word vector (row in W).
ci are the context word vectors (rows in C).
How does word2vec work?
While more text:
Extract a word window:
A springer is [ a cow or heifer close to calving ] .
c1 c2 c3 w c4 c5 c6
Try setting the vector values such that:
σ(w· c1)+σ(w· c2)+σ(w· c3)+σ(w· c4)+σ(w· c5)+σ(w· c6)
is high
How does word2vec work?
While more text:
Extract a word window:
A springer is [ a cow or heifer close to calving ] .
c1 c2 c3 w c4 c5 c6
Try setting the vector values such that:
σ(w· c1)+σ(w· c2)+σ(w· c3)+σ(w· c4)+σ(w· c5)+σ(w· c6)
is high
Create a corrupt example by choosing a random word w
[ a cow or comet close to calving ]
c1 c2 c3 w c4 c5 c6
Try setting the vector values such that:
σ(w · c1)+σ(w · c2)+σ(w · c3)+σ(w · c4)+σ(w · c5)+σ(w · c6)
is low
How does word2vec work?
The training procedure results in:
w · c for good word-context pairs is high.
w · c for bad word-context pairs is low.
w · c for ok-ish word-context pairs is neither high nor low.
As a result:
Words that share many contexts get close to each other.
Contexts that share many words get close to each other.
At the end, word2vec throws away C and returns W.
Reinterpretation
Imagine we didn’t throw away C. Consider the product WC
Reinterpretation
Imagine we didn’t throw away C. Consider the product WC
The result is a matrix M in which:
Each row corresponds to a word.
Each column corresponds to a context.
Each cell correspond to w · c, an association measure
between a word and a context.
Reinterpretation
Does this remind you of something?
Reinterpretation
Does this remind you of something?
Very similar to SVD over distributional representation: