Detailed Information on Publication Record

LEWANDOWSKA-TOMASZCZYK, Barbara, Geidre VALUNAITE OLESKEVICIENĖ, Slavko ŽITNIK, Anna BĄCZKOWSKA, Paul WILSON, Marcin TROJSZCZAK, Ana OSTROŠKI ANIĆ, Ivana BRAČ, Lobel FILIPIĆ, Olga DONTCHEVA-NAVRÁTILOVÁ, Agnieszka BOROWIAK, Chaya LIEBESKIND, Kristina DESPOT and Jelena MITROVIĆ. Annotation scheme and evaluation: the case of OFFENSIVE language. Rasprave Instituta za Hrvatski Jezik i Jezikoslovlje. CROATIA: Institute of Croatian Language and Linguistics, 2023, vol. 49, No 1, p. 155-175. ISSN 1331-6745. Available from: https://dx.doi.org/10.31724/rihjj.49.1.8.

Other formats: BibTeX LaTeX RIS

TY  - JOUR
ID  - 2349819
AU  - Lewandowska-Tomaszczyk, Barbara - Valunaite Oleskevicienė, Geidre - Žitnik, Slavko - Bączkowska, Anna - Wilson, Paul - Trojszczak, Marcin - Ostroški Anić, Ana - Brač, Ivana - Filipić, Lobel - Dontcheva-Navrátilová, Olga - Borowiak, Agnieszka - Liebeskind, Chaya - Despot, Kristina - Mitrović, Jelena
PY  - 2023
TI  - Annotation scheme and evaluation: the case of OFFENSIVE language
JF  - Rasprave Instituta za Hrvatski Jezik i Jezikoslovlje
VL  - 49
IS  - 1
SP  - 155-175
EP  - 155-175
PB  - Institute of Croatian Language and Linguistics
SN  - 13316745
KW  - annotation
KW  - annotators
KW  - curators
KW  - explicit implicit
KW  - offensive language
KW  - questionnaire
KW  - word embeddings
UR  - https://hrcak.srce.hr/clanak/444602
N2  - The present paper focuses on the presentation and discussion of aspects of OFFENSIVE LANGUAGE linguistic annotation, including creation, annotation practice, curation, and evaluation of an OFFENSIVE LANGUAGE annotation taxonomy scheme, first proposed in Lewandowska-Tomaszczyk et al. (2021). An extended offensive language ontology comprising 17 categories, structured in terms of 4 hierarchical levels, has been shown to represent the encoding of the defined offensive language schema, trained in terms of non-contextual word embeddings – i.e., Word2Vec and Fast Text, and eventually juxtaposed to the data acquired by using a pairwise training and testing analysis for existing categories in the HateBERT model (Lewandowska-Tomaszczyk et al. submitted). The study reports on the annotation practice in WG 4.1.1. Incivility in media and social media in the context of COST Action CA 18209 European network for Web-centred linguistic data science (Nexus Linguarum) with 2 the INCEpTION tool (https://github.com/inception-project/inception) – a semantic annotation platform offering assistance in annotation. The results partly support the proposed ontology of explicit offence and positive implicitness types to provide more variance among widely recognized types of figurative language (e.g., metaphorical, metonymic, ironic, etc.). The use of the annotation system and the representation of linguistic data have also been evaluated in a series of the annotators’ comments, using a questionnaire method and in an open discussion. The annotation results and the questionnaire showed that for some of the categories, there was low or medium inter-annotator agreement, and it was more challenging for annotators to distinguish between category items than between aspect items, with the category items of offensive, insulting and abusive being the most difficult in this respect. The need for taxonomic simplification measures in this respect has been recognized for further annotation practices.
ER  -

Basic information
Original name	Annotation scheme and evaluation: the case of OFFENSIVE language
Authors	LEWANDOWSKA-TOMASZCZYK, Barbara (100 Bulgaria), Geidre VALUNAITE OLESKEVICIENĖ (440 Lithuania), Slavko ŽITNIK (191 Croatia), Anna BĄCZKOWSKA (616 Poland), Paul WILSON (616 Poland), Marcin TROJSZCZAK (616 Poland), Ana OSTROŠKI ANIĆ (191 Croatia), Ivana BRAČ (191 Croatia), Lobel FILIPIĆ (191 Croatia), Olga DONTCHEVA-NAVRÁTILOVÁ (100 Bulgaria, guarantor, belonging to the institution), Agnieszka BOROWIAK (616 Poland), Chaya LIEBESKIND, Kristina DESPOT (191 Croatia) and Jelena MITROVIĆ (688 Serbia).
Edition	Rasprave Instituta za Hrvatski Jezik i Jezikoslovlje, CROATIA, Institute of Croatian Language and Linguistics, 2023, 1331-6745.

Other information
Original language	English
Type of outcome	Article in a journal
Field of Study	60203 Linguistics
Country of publisher	Croatia
Confidentiality degree	is not subject to a state or trade secret
WWW	URL
Impact factor	Impact factor: 0.200 in 2022
RIV identification code	RIV/00216224:14410/23:00132528
Organization unit	Faculty of Education
Doi	http://dx.doi.org/10.31724/rihjj.49.1.8
UT WoS	001153374200005
Keywords in English	annotation; annotators; curators; explicit implicit; offensive language; questionnaire; word embeddings
Tags	International impact, Reviewed
Changed by	Changed by: Mgr. Daniela Marcollová, učo 111148. Changed: 29/7/2024 15:24.

Abstract
The present paper focuses on the presentation and discussion of aspects of OFFENSIVE LANGUAGE linguistic annotation, including creation, annotation practice, curation, and evaluation of an OFFENSIVE LANGUAGE annotation taxonomy scheme, first proposed in Lewandowska-Tomaszczyk et al. (2021). An extended offensive language ontology comprising 17 categories, structured in terms of 4 hierarchical levels, has been shown to represent the encoding of the defined offensive language schema, trained in terms of non-contextual word embeddings – i.e., Word2Vec and Fast Text, and eventually juxtaposed to the data acquired by using a pairwise training and testing analysis for existing categories in the HateBERT model (Lewandowska-Tomaszczyk et al. submitted). The study reports on the annotation practice in WG 4.1.1. Incivility in media and social media in the context of COST Action CA 18209 European network for Web-centred linguistic data science (Nexus Linguarum) with 2 the INCEpTION tool (https://github.com/inception-project/inception) – a semantic annotation platform offering assistance in annotation. The results partly support the proposed ontology of explicit offence and positive implicitness types to provide more variance among widely recognized types of figurative language (e.g., metaphorical, metonymic, ironic, etc.). The use of the annotation system and the representation of linguistic data have also been evaluated in a series of the annotators’ comments, using a questionnaire method and in an open discussion. The annotation results and the questionnaire showed that for some of the categories, there was low or medium inter-annotator agreement, and it was more challenging for annotators to distinguish between category items than between aspect items, with the category items of offensive, insulting and abusive being the most difficult in this respect. The need for taxonomic simplification measures in this respect has been recognized for further annotation practices.

Abstract

The present paper focuses on the presentation and discussion of aspects of OFFENSIVE LANGUAGE linguistic annotation, including creation, annotation practice, curation, and evaluation of an OFFENSIVE LANGUAGE annotation taxonomy scheme, first proposed in Lewandowska-Tomaszczyk et al. (2021). An extended offensive language ontology comprising 17 categories, structured in terms of 4 hierarchical levels, has been shown to represent the encoding of the defined offensive language schema, trained in terms of non-contextual word embeddings – i.e., Word2Vec and Fast Text, and eventually juxtaposed to the data acquired by using a pairwise training and testing analysis for existing categories in the HateBERT model (Lewandowska-Tomaszczyk et al. submitted). The study reports on the annotation practice in WG 4.1.1. Incivility in media and social media in the context of COST Action CA 18209 European network for Web-centred linguistic data science (Nexus Linguarum) with 2 the INCEpTION tool (https://github.com/inception-project/inception) – a semantic annotation platform offering assistance in annotation. The results partly support the proposed ontology of explicit offence and positive implicitness types to provide more variance among widely recognized types of figurative language (e.g., metaphorical, metonymic, ironic, etc.). The use of the annotation system and the representation of linguistic data have also been evaluated in a series of the annotators’ comments, using a questionnaire method and in an open discussion. The annotation results and the questionnaire showed that for some of the categories, there was low or medium inter-annotator agreement, and it was more challenging for annotators to distinguish between category items than between aspect items, with the category items of offensive, insulting and abusive being the most difficult in this respect. The need for taxonomic simplification measures in this respect has been recognized for further annotation practices.