Utilizing Linguistic Resources: Theory and Practical Experience

NĚMČÍK, Václav. Utilizing Linguistic Resources: Theory and Practical Experience. In Proceedings of Recent Advances in Slavonic Natural Language Processing 2010. Brno: Masarykova Univerzita, 2010, p. 47-51. ISBN 978-80-7399-246-0.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Utilizing Linguistic Resources: Theory and Practical Experience
Name in Czech	Využití lingvistických zdrojů: teorie a praktické zkušenosti
Authors	NĚMČÍK, Václav (203 Czech Republic, guarantor, belonging to the institution).
Edition	Brno, Proceedings of Recent Advances in Slavonic Natural Language Processing 2010, p. 47-51, 5 pp. 2010.
Publisher	Masarykova Univerzita

Other information
Original language	English
Type of outcome	Proceedings paper
Field of Study	10201 Computer sciences, information science, bioinformatics
Country of publisher	Czech Republic
Confidentiality degree	is not subject to a state or trade secret
Publication form	printed version "print"
WWW	URL
RIV identification code	RIV/00216224:14330/10:00051587
Organization unit	Faculty of Informatics
ISBN	978-80-7399-246-0
Keywords (in Czech)	lingvistické zdroje; korpusy; teorie; praxe
Keywords in English	linguistic resources; corpora; theory; practice
Tags	annotation, corpora, lingustic resources, practice, theory
Changed by	Changed by: Mgr. Václav Němčík, učo 39616. Changed: 26/7/2021 01:21.

Abstract

The Prague Dependency Treebank (henceforth PDT) is a large collection of texts in Czech. It contains several layers of rich annotation, ranging from morphology to deep syntax. It is unique in its size and theoretical background, especially for a language like Czech, which can be, with regard to the number of its speakers, considered a small language. In this article, we use PDT 2.0 to demonstrate that within real NLP systems, complex annotations may cut both ways. We present several issues that might pose problems when extracting data from PDT, and complex structures in general, and hint on possible solutions.

Abstract (in Czech)

Prague Dependency Treebank (dále PDT) je rozsáhlý soubor textů v českém jazyce. Obsahuje bohatou anotaci na několika rovinách, od morfologie po hloubkovou syntax. Jde o unikát co do svého rozsahu i teoretického pozadí, tím větší, že byl vytvořen pro češtinu, která je co do počtu mluvčích malým jazykem. V tomto článku uvádíme PDT 2.0 jako příklad, že komplexnost anotace s sebou může nést jak výhody, tak nevýhody. Zmiňujeme problémy, které mohou nastat při extrakci některých typů dat z PDT a korpusů s komplexní anotační strukturou obecně. Naznačujeme možné alternativní přístupy.

Links
LC536, research and development project	Name: Centrum komputační lingvistiky
LC536, research and development project	Investor: Ministry of Education, Youth and Sports of the CR, Centrum komputační lingvistiky
2C06009, research and development project	Name: Prostředky tvorby komplexní báze znalostí pro komunikaci se sémantickým webem v přirozeném jazyce (Acronym: COT-SEWing)
2C06009, research and development project	Investor: Ministry of Education, Youth and Sports of the CR

PrintDisplayed: 22/5/2024 19:56

Utilizing Linguistic Resources: Theory and Practical Experience

Other applications