Corpus Annotation Pipeline for Non-standard Texts

D 2018

Corpus Annotation Pipeline for Non-standard Texts

PELIKÁNOVÁ, Zuzana a Zuzana NEVĚŘILOVÁ

Základní údaje

Originální název

Corpus Annotation Pipeline for Non-standard Texts

Autoři

PELIKÁNOVÁ, Zuzana (203 Česká republika, garant, domácí) a Zuzana NEVĚŘILOVÁ (203 Česká republika, domácí)

Vydání

Switzerland, Text, Speech, and Dialogue, 21st International Conference, TSD 2018, od s. 304-312, 9 s. 2018

Nakladatel

Springer International Publishing

Další údaje

Jazyk

angličtina

Typ výsledku

Stať ve sborníku

Obor

10201 Computer sciences, information science, bioinformatics

Stát vydavatele

Švýcarsko

Utajení

není předmětem státního či obchodního tajemství

Forma vydání

tištěná verze "print"

Kód RIV

RIV/00216224:14330/18:00104585

Organizační jednotka

Fakulta informatiky

ISBN

978-3-030-00794-2

DOI

http://dx.doi.org/10.1007/978-3-030-00794-2_32

UT WoS

000611532300032

Klíčová slova anglicky

Non-standard language; Interlingual homographs; Corpora annotation

Štítky

firank_B

Příznaky

Mezinárodní význam, Recenzováno

Změněno: 2. 5. 2019 06:28, RNDr. Pavel Šmerk, Ph.D.

Anotace

V originále

According to some estimations (e.g. [9]), web corpora contain over 6% of foreign material (borrowings, language mixing, named entities). Since annotation pipelines are usually built upon standard and correct data, the resulting annotation of web corpora often contains serious errors. We studied in depth annotation errors of the web corpus czTenTen 12 and proposed an extension to the tagger desamb that had been used for czTenTen annotation. First, the subcorpus was made using the most problematic documents from czTenTen. Second, measures were established for the most frequent annotation errors. Third, we established several experiments in which we extended the annotation pipeline so it could annotate foreign material and multi-word expressions. Finally, we compared the new annotations of the subcorpus with the original ones.

Návaznosti

MUNI/33/55939/2017, interní kód MU

Název: Ověření úspěšnosti technik zpracování přirozeného jazyka pro extrakci informací ze skenovaných dokumentů

Investor: Masarykova univerzita, Ověření úspěšnosti technik zpracování přirozeného jazyka pro extrakci informací ze skenovaných dokumentů

Citovat

PELIKÁNOVÁ, Zuzana a Zuzana NEVĚŘILOVÁ. Corpus Annotation Pipeline for Non-standard Texts. In P. Sojka, A. Horák, I. Kopeček, K. Pala. Text, Speech, and Dialogue, 21st International Conference, TSD 2018. Switzerland: Springer International Publishing, 2018, s. 304-312. ISBN 978-3-030-00794-2. Dostupné z: https://dx.doi.org/10.1007/978-3-030-00794-2_32.

@inproceedings{1471077,
   author = {Pelikánová, Zuzana and Nevěřilová, Zuzana},
   address = {Switzerland},
   booktitle = {Text, Speech, and Dialogue, 21st International Conference, TSD 2018},
   doi = {http://dx.doi.org/10.1007/978-3-030-00794-2_32},
   editor = {P. Sojka, A. Horák, I. Kopeček, K. Pala},
   keywords = {Non-standard language; Interlingual homographs; Corpora annotation},
   howpublished = {tištěná verze "print"},
   language = {eng},
   location = {Switzerland},
   isbn = {978-3-030-00794-2},
   pages = {304-312},
   publisher = {Springer International Publishing},
   title = {Corpus Annotation Pipeline for Non-standard Texts},
   year = {2018}
}

TY  - JOUR
ID  - 1471077
AU  - Pelikánová, Zuzana - Nevěřilová, Zuzana
PY  - 2018
TI  - Corpus Annotation Pipeline for Non-standard Texts
PB  - Springer International Publishing
CY  - Switzerland
SN  - 9783030007942
KW  - Non-standard language
KW  - Interlingual homographs
KW  - Corpora annotation
N2  - According to some estimations (e.g. [9]), web corpora contain over 6% of foreign material (borrowings, language mixing, named entities). Since annotation pipelines are usually built upon standard and correct data, the resulting annotation of web corpora often contains serious errors. We studied in depth annotation errors of the web corpus czTenTen 12 and proposed an extension to the tagger desamb that had been used for czTenTen annotation. First, the subcorpus was made using the most problematic documents from czTenTen. Second, measures were established for the most frequent annotation errors. Third, we established several experiments in which we extended the annotation pipeline so it could annotate foreign material and multi-word expressions. Finally, we compared the new annotations of the subcorpus with the original ones.
ER  -

PELIKÁNOVÁ, Zuzana a Zuzana NEVĚŘILOVÁ. Corpus Annotation Pipeline for Non-standard Texts. In P. Sojka, A. Horák, I. Kopeček, K. Pala. \textit{Text, Speech, and Dialogue, 21st International Conference, TSD 2018}. Switzerland: Springer International Publishing, 2018, s.~304-312. ISBN~978-3-030-00794-2. Dostupné z: https://dx.doi.org/10.1007/978-3-030-00794-2\_{}32.

Podrobný výpis o publikaci