SOJKA, Petr. Segmentation from 97% to 100%: Is It Time for Some Linguistics? In Aleš Horák, Pavel Rychlý. Sixth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2012. první. Brno: Tribun EU, 2012, p. 121--131. ISBN 978-80-263-0313-8.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Segmentation from 97% to 100%: Is It Time for Some Linguistics?
Name in Czech Segmentace z 97% na 100%: není čas pro trochu lingvistiky?
Authors SOJKA, Petr (203 Czech Republic, guarantor, belonging to the institution).
Edition první. Brno, Sixth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2012, p. 121--131, 11 pp. 2012.
Publisher Tribun EU
Other information
Original language English
Type of outcome Proceedings paper
Field of Study 10201 Computer sciences, information science, bioinformatics
Country of publisher Czech Republic
Confidentiality degree is not subject to a state or trade secret
Publication form printed version "print"
WWW Slides full paper in PDF Workshop web page
RIV identification code RIV/00216224:14330/12:00062085
Organization unit Faculty of Informatics
ISBN 978-80-263-0313-8
Keywords (in Czech) soutěživé vzory;segmentace;dělení slov;NP úplné problémy;generování vzorů;patgen;kontextově závislé vzory;strojové učení;jazykové inženýrství;EuDML
Keywords in English competing patterns;segmentation;hyphenation;NP problems;pattern generation;patgen;context-sensitive patterns;machine learning;natural language engineering;EuDML
Tags International impact, Reviewed
Changed by Changed by: RNDr. Pavel Šmerk, Ph.D., učo 3880. Changed: 23/4/2013 07:21.
Abstract
Many tasks in natural language processing (NLP) require \emph{segmentation} algorithms: segmentation of paragraph into sentences, segmentation of sentences into words is needed in languages like Chinese or Thai, segmentation of words into syllables (\emph{hyphenation}) or into morphological parts (e.g.\ getting word stem for indexing), and many other tasks (e.g.\ tagging) could be formulated as segmentation problems. We evaluate methodology of using \emph{competing patterns} for these tasks and decide on the complexity of creation of space-optimal (minimal) patterns that completely (100\,\%) implement the segmentation task. We formally define this task and prove that it is in the class of \emph{non-polynomial} optimization problems. However, finding space-efficient competing patterns for real NLP tasks is feasible and gives efficient scalable solutions of segmentation task: segmentation is done in \emph{constant} time with respect to the size of segmented dictionary. Constant time of access to segmentations makes competing patterns attractive data structure for many NLP tasks.
Links
LA09016, research and development projectName: Účast ČR v European Research Consortium for Informatics and Mathematics (ERCIM) (Acronym: ERCIM)
Investor: Ministry of Education, Youth and Sports of the CR, Czech Republic membership in the European Research Consortium for Informatics and Mathematics
250503, interní kód MUName: The European Digital Mathematics Library (Acronym: EuDML)
Investor: European Union
PrintDisplayed: 5/10/2024 20:00