D 2016

European Union Language Resources in Sketch Engine

BAISA, Vít, Jan MICHELFEIT, Marek MEDVEĎ and Miloš JAKUBÍČEK

Basic information

Original name

European Union Language Resources in Sketch Engine

Authors

BAISA, Vít (203 Czech Republic, belonging to the institution), Jan MICHELFEIT (203 Czech Republic, belonging to the institution), Marek MEDVEĎ (703 Slovakia, belonging to the institution) and Miloš JAKUBÍČEK (203 Czech Republic, belonging to the institution)

Edition

Portorož, Slovenia, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), p. 2799-2803, 5 pp. 2016

Publisher

European Language Resources Association (ELRA)

Other information

Language

English

Type of outcome

Stať ve sborníku

Field of Study

10201 Computer sciences, information science, bioinformatics

Country of publisher

Slovenia

Confidentiality degree

není předmětem státního či obchodního tajemství

Publication form

electronic version available online

References:

RIV identification code

RIV/00216224:14330/16:00087949

Organization unit

Faculty of Informatics

ISBN

978-2-9517408-9-1

Keywords in English

JRC-Acquis; DCEP; DGT-TM; Europarl; EUR-Lex; Sketch Engine; parallel corpus; word sketch; parallel concordance

Tags

Tags

International impact, Reviewed
Změněno: 3/1/2017 11:12, RNDr. Marek Medveď, Ph.D.

Abstract

V originále

Several parallel corpora built from European Union language resources are presented here. They were processed by state-of-the-art tools and made available for researchers in the Sketch Engine corpus management system. A completely new resource is introduced: EUR-Lex corpus, being one of the largest parallel corpus available at the moment, containing 840 million tokens of English and having the largest language pair (English-French) with more than 25 million aligned segments (paragraphs).

Links

GA15-13277S, research and development project
Name: Hyperintensionální logika pro analýzu přirozeného jazyka
Investor: Czech Science Foundation
LM2015071, research and development project
Name: Jazyková výzkumná infrastruktura v České republice (Acronym: LINDAT-Clarin)
Investor: Ministry of Education, Youth and Sports of the CR
MUNI/A/0945/2015, interní kód MU
Name: Rozsáhlé výpočetní systémy: modely, aplikace a verifikace V.
Investor: Masaryk University, Category A
7F14047, research and development project
Name: Harvesting big text data for under-resourced languages (Acronym: HaBiT)
Investor: Ministry of Education, Youth and Sports of the CR