D 2021

Genre Annotation of Web Corpora: Scheme and Issues

SUCHOMEL, Vít

Basic information

Original name

Genre Annotation of Web Corpora: Scheme and Issues

Authors

SUCHOMEL, Vít (203 Czech Republic, guarantor, belonging to the institution)

Edition

Vancouver, Canada, Proceedings of the Future Technologies Conference (FTC) 2020, Volume 1, p. 738-754, 17 pp. 2021

Publisher

Springer Nature Switzerland AG

Other information

Language

English

Type of outcome

Stať ve sborníku

Field of Study

60203 Linguistics

Country of publisher

United Kingdom of Great Britain and Northern Ireland

Confidentiality degree

není předmětem státního či obchodního tajemství

Publication form

printed version "print"

RIV identification code

RIV/00216224:14330/21:00118741

Organization unit

Faculty of Informatics

ISBN

978-3-030-63127-7

ISSN

Keywords in English

Corpus annotation; Inter-annotator agreement; Text genre; Web corpora

Tags

Tags

International impact, Reviewed
Změněno: 10/1/2023 11:49, RNDr. Vít Suchomel, Ph.D.

Abstract

V originále

Unlike traditional corpora made from printed media in the past decades, sources of web corpora are not categorised and described well, thus making it difficult to control the content of the corpus. This paper presents an attempt to classify genres in a large English web corpus through supervised learning. A set of genres suitable for web corpora users is defined based on a research of related work. A genre annotation scheme with active learning rounds is introduced. A collection of web pages representing various genres that was created for this task and a scheme of consequent human annotation of the data set is described. Measuring the inter-annotator agreement revealed that either the problem may not be well defined, or that our expectations concerning the precision and recall of the classifier cannot be met. Eventually, the project was postponed at that point. Possible solutions of the issue are discussed at the end of the paper.

Links

GA18-23891S, research and development project
Name: Hyperintensionální usuzování nad texty přirozeného jazyka
Investor: Czech Science Foundation
LM2018101, research and development project
Name: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy (Acronym: LINDAT/CLARIAH-CZ)
Investor: Ministry of Education, Youth and Sports of the CR