Genre Annotation of Web Corpora: Scheme and Issues

SUCHOMEL, Vít. Genre Annotation of Web Corpora: Scheme and Issues. In Kohei Arai, Supriya Kapoor, Rahul Bhatia. Proceedings of the Future Technologies Conference (FTC) 2020, Volume 1. Vancouver, Canada: Springer Nature Switzerland AG, 2021, p. 738-754. ISBN 978-3-030-63127-7. Available from: https://dx.doi.org/10.1007/978-3-030-63128-4_55.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Genre Annotation of Web Corpora: Scheme and Issues
Authors	SUCHOMEL, Vít (203 Czech Republic, guarantor, belonging to the institution).
Edition	Vancouver, Canada, Proceedings of the Future Technologies Conference (FTC) 2020, Volume 1, p. 738-754, 17 pp. 2021.
Publisher	Springer Nature Switzerland AG

Other information
Original language	English
Type of outcome	Proceedings paper
Field of Study	60203 Linguistics
Country of publisher	United Kingdom of Great Britain and Northern Ireland
Confidentiality degree	is not subject to a state or trade secret
Publication form	printed version "print"
WWW	Elektronická verze sborníku
RIV identification code	RIV/00216224:14330/21:00118741
Organization unit	Faculty of Informatics
ISBN	978-3-030-63127-7
ISSN	2194-5357
Doi	http://dx.doi.org/10.1007/978-3-030-63128-4_55
Keywords in English	Corpus annotation; Inter-annotator agreement; Text genre; Web corpora
Tags	best, firank_B
Tags	International impact, Reviewed
Changed by	Changed by: RNDr. Vít Suchomel, Ph.D., učo 139723. Changed: 10/1/2023 11:49.

Abstract

Unlike traditional corpora made from printed media in the past decades, sources of web corpora are not categorised and described well, thus making it difficult to control the content of the corpus. This paper presents an attempt to classify genres in a large English web corpus through supervised learning. A set of genres suitable for web corpora users is defined based on a research of related work. A genre annotation scheme with active learning rounds is introduced. A collection of web pages representing various genres that was created for this task and a scheme of consequent human annotation of the data set is described. Measuring the inter-annotator agreement revealed that either the problem may not be well defined, or that our expectations concerning the precision and recall of the classifier cannot be met. Eventually, the project was postponed at that point. Possible solutions of the issue are discussed at the end of the paper.

Links
GA18-23891S, research and development project	Name: Hyperintensionální usuzování nad texty přirozeného jazyka
GA18-23891S, research and development project	Investor: Czech Science Foundation
LM2018101, research and development project	Name: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy (Acronym: LINDAT/CLARIAH-CZ)
LM2018101, research and development project	Investor: Ministry of Education, Youth and Sports of the CR

PrintDisplayed: 17/7/2024 05:16

Genre Annotation of Web Corpora: Scheme and Issues

Other applications