GRÁC, Marek. Building Annotated Corpora without Experts. In Daniela Majchráková, Radiovan Garabík. Natural Language Processing, Multilinguality. Bratislava, Slovensko: Slovak National Corpus, 2011, p. 81-88. ISBN 978-80-263-0049-6.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Building Annotated Corpora without Experts
Authors GRÁC, Marek (703 Slovakia, guarantor, belonging to the institution).
Edition Bratislava, Slovensko, Natural Language Processing, Multilinguality, p. 81-88, 8 pp. 2011.
Publisher Slovak National Corpus
Other information
Original language English
Type of outcome Proceedings paper
Field of Study 10201 Computer sciences, information science, bioinformatics
Country of publisher Slovakia
Confidentiality degree is not subject to a state or trade secret
RIV identification code RIV/00216224:14330/11:00049482
Organization unit Faculty of Informatics
ISBN 978-80-263-0049-6
Keywords (in Czech) korpus anotování
Keywords in English corpus annotation crowdsourcing
Tags International impact, Reviewed
Changed by Changed by: Mgr. Marek Grác, Ph.D., učo 50728. Changed: 25/11/2011 12:27.
Abstract
In this paper, we present a low-cost approach of building a multi-purpose language resource for Czech, based on currently available results of previous work done by various teams. We focus on the first phase that consists of verifying validity of automatically discovered syntactic elements in 10 000 sentences by 47 human annotators. Due to the number of annotators and very limited time for training, existing heavy-weight techniques for building annotated corpora were not applicable. We have decided to avoid using experts when results between annotators differed. This means that our corpus does not offer ultimate answers, but raw data and models for obtaining ``correct'' answer tailored to user's application. Finally we discuss the currently achieved results and future plans.
Links
LC536, research and development projectName: Centrum komputační lingvistiky
Investor: Ministry of Education, Youth and Sports of the CR, Centrum komputační lingvistiky
1ET100300419, research and development projectName: Inteligentní modely, algoritmy, metody a nástroje pro vytváření sémantického webu
Investor: Academy of Sciences of the Czech Republic, Intelligent Models, Algorithms, Methods and Tools for the Semantic Web (realization)
PrintDisplayed: 26/4/2024 18:19