BOJAR, Ondřej, Vojtěch DIATKA, Pavel RYCHLÝ, Pavel STRAŇÁK, Vít SUCHOMEL, Aleš TAMCHYNA and Daniel ZEMAN. HindEnCorp – Hindi-English and Hindi-only Corpus for Machine Translation. In Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Hrafn Loftsson and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14). Reykjavik, Iceland: European Language Resources Association (ELRA). p. 3550-3555. ISBN 978-2-9517408-8-4. 2014.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name HindEnCorp – Hindi-English and Hindi-only Corpus for Machine Translation
Authors BOJAR, Ondřej (203 Czech Republic), Vojtěch DIATKA (203 Czech Republic), Pavel RYCHLÝ (203 Czech Republic, belonging to the institution), Pavel STRAŇÁK (203 Czech Republic), Vít SUCHOMEL (203 Czech Republic, guarantor, belonging to the institution), Aleš TAMCHYNA (203 Czech Republic) and Daniel ZEMAN (203 Czech Republic).
Edition Reykjavik, Iceland, Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), p. 3550-3555, 6 pp. 2014.
Publisher European Language Resources Association (ELRA)
Other information
Original language English
Type of outcome Proceedings paper
Field of Study 10201 Computer sciences, information science, bioinformatics
Country of publisher Luxembourg
Confidentiality degree is not subject to a state or trade secret
Publication form electronic version available online
WWW URL
RIV identification code RIV/00216224:14330/14:00076251
Organization unit Faculty of Informatics
ISBN 978-2-9517408-8-4
UT WoS 000355611005028
Keywords in English Machine Translation; SpeechToSpeech Translation; Metadata
Tags firank_B
Tags International impact, Reviewed
Changed by Changed by: RNDr. Vít Suchomel, Ph.D., učo 139723. Changed: 1/11/2017 11:02.
Abstract
We present HindEnCorp, a parallel corpus of Hindi and English, and HindMonoCorp, a monolingual corpus of Hindi in their release version 0.5. Both corpora were collected from web sources and preprocessed primarily for the training of statistical machine translation systems. HindEnCorp consists of 274k parallel sentences (3.9 million Hindi and 3.8 million English tokens). HindMonoCorp amounts to 787 million tokens in 44 million sentences. Both the corpora are freely available for non-commercial research and their preliminary release has been used by numerous participants of the WMT 2014 shared translation task.
Links
LM2010013, research and development projectName: LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat (Acronym: LINDAT-Clarin)
Investor: Ministry of Education, Youth and Sports of the CR
PrintDisplayed: 28/3/2024 17:10