Word Embeddings (PA153)
Pavel Rychlý
Continuous space representation
words represented by a vector of numbers
similar words are closer each other
more dimensions = more features
tens to hundreds, up to 1000
continue = [0.286, 0.792, −0.177, −0.107, 0.109, −0.542, 0.349]
Simple vector learning
each word has two vectors
node vector (nodew )
context vector (ctxw )
generate (node, context) pairs from text
for example from bigrams: w1, w2
w1 is context, w2 is node
move closer ctxw1 and nodew2
Word2vec
command line tool for creating word embeddings
two models:
CWOB = Continuous back of words
SKIP-GRAM
many parameters
window size
dimension of vectors
alpha (learning rate)
min-count for words
sub-sampling limit
Word2vec
simple tokenization = space separated
lines = paragraphs (never crossed by window)
negative sampling
sub-sampling
fast computation on multiple CPU
compact, cryptic C
GloVe
several (independent) modules
clean C
can save both node and context vectors
FastText
includes character n-grams
handling of unknown words, low-frequent words
tangled, many-class C++
many pre-trained models