Detailed Information on Publication Record
2014
HindEnCorp – Hindi-English and Hindi-only Corpus for Machine Translation
BOJAR, Ondřej, Vojtěch DIATKA, Pavel RYCHLÝ, Pavel STRAŇÁK, Vít SUCHOMEL et. al.Basic information
Original name
HindEnCorp – Hindi-English and Hindi-only Corpus for Machine Translation
Authors
BOJAR, Ondřej (203 Czech Republic), Vojtěch DIATKA (203 Czech Republic), Pavel RYCHLÝ (203 Czech Republic, belonging to the institution), Pavel STRAŇÁK (203 Czech Republic), Vít SUCHOMEL (203 Czech Republic, guarantor, belonging to the institution), Aleš TAMCHYNA (203 Czech Republic) and Daniel ZEMAN (203 Czech Republic)
Edition
Reykjavik, Iceland, Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), p. 3550-3555, 6 pp. 2014
Publisher
European Language Resources Association (ELRA)
Other information
Language
English
Type of outcome
Stať ve sborníku
Field of Study
10201 Computer sciences, information science, bioinformatics
Country of publisher
Luxembourg
Confidentiality degree
není předmětem státního či obchodního tajemství
Publication form
electronic version available online
References:
RIV identification code
RIV/00216224:14330/14:00076251
Organization unit
Faculty of Informatics
ISBN
978-2-9517408-8-4
UT WoS
000355611005028
Keywords in English
Machine Translation; SpeechToSpeech Translation; Metadata
Tags
Tags
International impact, Reviewed
Změněno: 1/11/2017 11:02, RNDr. Vít Suchomel, Ph.D.
Abstract
V originále
We present HindEnCorp, a parallel corpus of Hindi and English, and HindMonoCorp, a monolingual corpus of Hindi in their release version 0.5. Both corpora were collected from web sources and preprocessed primarily for the training of statistical machine translation systems. HindEnCorp consists of 274k parallel sentences (3.9 million Hindi and 3.8 million English tokens). HindMonoCorp amounts to 787 million tokens in 44 million sentences. Both the corpora are freely available for non-commercial research and their preliminary release has been used by numerous participants of the WMT 2014 shared translation task.
Links
LM2010013, research and development project |
|