Neural Tagger for Czech Language: Capturing Linguistic
Phenomena in Web Corpora

D 2019

Neural Tagger for Czech Language: Capturing Linguistic Phenomena in Web Corpora

NEVĚŘILOVÁ, Zuzana a Marie STARÁ

Základní údaje

Originální název

Neural Tagger for Czech Language: Capturing Linguistic Phenomena in Web Corpora

Autoři

NEVĚŘILOVÁ, Zuzana (203 Česká republika, domácí) a Marie STARÁ (203 Česká republika, domácí)

Vydání

Brno, Proceedings of the Thirteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2019, od s. 23-32, 10 s. 2019

Nakladatel

Tribun EU

Další údaje

Jazyk

angličtina

Typ výsledku

Stať ve sborníku

Obor

10201 Computer sciences, information science, bioinformatics

Stát vydavatele

Česká republika

Utajení

není předmětem státního či obchodního tajemství

Forma vydání

tištěná verze "print"

Odkazy

URL

Kód RIV

RIV/00216224:14330/19:00111625

Organizační jednotka

Fakulta informatiky

ISBN

978-80-263-1517-9

ISSN

UT WoS

000604899800003

Klíčová slova anglicky

Czech Tagger; Multi-word Expressions; Pretrained WordEmbeddings

Příznaky

Mezinárodní význam, Recenzováno

Změněno: 16. 5. 2022 15:20, Mgr. Michal Petr

Anotace

V originále

We propose a new tagger for the Czech language and particu-larly for the tagset used for annotation of corpora of the TenTen family.The tagger is based on neural networks with pretrained word embed-dings. We selected the newest Czech Web corpus of the TenTen familyas training data, but we removed sentences with phenomena that wereoften annotated incorrectly. We let the tagger to learn the annotation ofthese phenomena on its own. We also experimented with the recognitionof multi-word expressions since this information can support the correcttagging.We evaluated the tagger on 6,950 sentences (84,023 tokens) from thecstenten17corpus and achieved 75.25% accuracy when compared bytags. When compared by attributes, we achieved 91.62% accuracy; theaccuracy of POS tag prediction is 96.5%.

Návaznosti

EF16_013/0001781, projekt VaV

Název: LINDAT/CLARIN - Výzkumná infrastruktura pro jazykové technologie - rozšíření repozitáře a výpočetní kapacity

LM2015071, projekt VaV

Název: Jazyková výzkumná infrastruktura v České republice (Akronym: LINDAT-Clarin)

Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, Projekt LINDAT-Clarin - Vybudování a provoz českého uzlu pan-evropské infrastruktury pro výzkum

Citovat

NEVĚŘILOVÁ, Zuzana a Marie STARÁ. Neural Tagger for Czech Language: Capturing Linguistic Phenomena in Web Corpora. In Aleš Horák, Pavel Rychlý, Adam Rambousek. Proceedings of the Thirteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2019. Brno: Tribun EU, 2019, s. 23-32. ISBN 978-80-263-1517-9.

@inproceedings{1589940,
   author = {Nevěřilová, Zuzana and Stará, Marie},
   address = {Brno},
   booktitle = {Proceedings of the Thirteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2019},
   editor = {Aleš Horák, Pavel Rychlý, Adam Rambousek},
   keywords = {Czech Tagger; Multi-word Expressions; Pretrained WordEmbeddings},
   howpublished = {tištěná verze "print"},
   language = {eng},
   location = {Brno},
   isbn = {978-80-263-1517-9},
   pages = {23-32},
   publisher = {Tribun EU},
   title = {Neural Tagger for Czech Language: Capturing Linguistic Phenomena in Web Corpora},
   url = {https://nlp.fi.muni.cz/raslan/2019/paper10-neverilova.pdf},
   year = {2019}
}

TY  - JOUR
ID  - 1589940
AU  - Nevěřilová, Zuzana - Stará, Marie
PY  - 2019
TI  - Neural Tagger for Czech Language: Capturing Linguistic Phenomena in Web Corpora
PB  - Tribun EU
CY  - Brno
SN  - 9788026315179
KW  - Czech Tagger
KW  - Multi-word Expressions
KW  - Pretrained WordEmbeddings
UR  - https://nlp.fi.muni.cz/raslan/2019/paper10-neverilova.pdf
L2  - https://nlp.fi.muni.cz/raslan/2019/paper10-neverilova.pdf
N2  - We propose a new tagger for the Czech language and particu-larly for the tagset used for annotation of corpora of the TenTen family.The tagger is based on neural networks with pretrained word embed-dings. We selected the newest Czech Web corpus of the TenTen familyas training data, but we removed sentences with phenomena that wereoften annotated incorrectly. We let the tagger to learn the annotation ofthese phenomena on its own. We also experimented with the recognitionof multi-word expressions since this information can support the correcttagging.We evaluated the tagger on 6,950 sentences (84,023 tokens) from thecstenten17corpus and achieved 75.25% accuracy when compared bytags. When compared by attributes, we achieved 91.62% accuracy; theaccuracy of POS tag prediction is 96.5%.
ER  -

NEVĚŘILOVÁ, Zuzana a Marie STARÁ. Neural Tagger for Czech Language: Capturing Linguistic Phenomena in Web Corpora. In Aleš Horák, Pavel Rychlý, Adam Rambousek. \textit{Proceedings of the Thirteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2019}. Brno: Tribun EU, 2019, s.~23-32. ISBN~978-80-263-1517-9.

Podrobný výpis o publikaci