D 2021

Precomputed Word Embeddings for 15+ Languages

HERMAN, Ondřej

Basic information

Original name

Precomputed Word Embeddings for 15+ Languages

Authors

HERMAN, Ondřej (203 Czech Republic, guarantor, belonging to the institution)

Edition

Brno, Recent Advances in Slavonic Natural Language Processing (RASLAN 2021), p. 41-46, 6 pp. 2021

Publisher

Tribun EU

Other information

Language

English

Type of outcome

Stať ve sborníku

Field of Study

10200 1.2 Computer and information sciences

Country of publisher

Czech Republic

Confidentiality degree

není předmětem státního či obchodního tajemství

Publication form

printed version "print"

RIV identification code

RIV/00216224:14330/21:00123246

Organization unit

Faculty of Informatics

ISBN

978-80-263-1670-1

ISSN

Keywords in English

Word embeddings; Sketch Engine; Corpora
Změněno: 15/5/2024 02:13, RNDr. Pavel Šmerk, Ph.D.

Abstract

V originále

Word embeddings serve as an useful resource for many downstream natural language processing tasks. The embeddings map or embed the lexicon of a language onto a vector space, in which various operations can be carried out easily using the established machinery of linear algebra. The unbounded nature of the language can be problematic and word embeddings provide a way of compressing the words into a manageable dense space. The position of a word in the vector space is given by the context the word appears in, or, as the distributional hypothesis postulates, a word is characterized by the company it keeps [2]. As similar words appear in similar contexts, their positions will also be close to each other in the embedding vector space. Because of this many useful semantical properties of words are preserved in the embedding vector space.

Links

LM2018101, research and development project
Name: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy (Acronym: LINDAT/CLARIAH-CZ)
Investor: Ministry of Education, Youth and Sports of the CR