SOJKA, Petr and David ANTOŠ. Context Sensitive Pattern Based Segmentation: A Thai Challenge. In Proceedings of EACL 2003 workshop Computational Linguistics for South Asian Languages -- Expanding Synergies with Europe. Budapest: Association for Computational Linguistics, 2003, p. 65-72. ISBN 1-932432-02-7.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Context Sensitive Pattern Based Segmentation: A Thai Challenge
Authors SOJKA, Petr (203 Czech Republic, guarantor) and David ANTOŠ (203 Czech Republic).
Edition Budapest, Proceedings of EACL 2003 workshop Computational Linguistics for South Asian Languages -- Expanding Synergies with Europe, p. 65-72, 8 pp. 2003.
Publisher Association for Computational Linguistics
Other information
Original language English
Type of outcome Proceedings paper
Field of Study 10201 Computer sciences, information science, bioinformatics
Country of publisher Hungary
Confidentiality degree is not subject to a state or trade secret
WWW URL of Proceedings
RIV identification code RIV/00216224:14330/03:00008605
Organization unit Faculty of Informatics
ISBN 1-932432-02-7
Keywords in English segmentation Thai competing patterns
Tags segmentation Thai competing patterns
Tags International impact, Reviewed
Changed by Changed by: doc. RNDr. Petr Sojka, Ph.D., učo 2378. Changed: 13/2/2007 23:05.
Abstract
A Thai written text is a string of symbols without explicit word boundaries. A method for a development of a segmentation tool from a corpus of already segmented text is described. The methodology is based on the technology of competing patterns. A new UNICODE pattern generation program, OPATGEN, is used for the learning phase. We have shown feasibility of our methodology by generating patterns for Thai segmentation from already segmented text of the Thai corpus ORCHID: the segmentation algorithm quickly reaches F-score of 93 %. Finally, we enumerate possible new applications based on the pattern technique, and conclude with the suggestion of a general Pattern Translation Process. The technology is general and can be used for any other segmentation tasks as phonetic, morphologic segmentation, word hyphenation, sentence segmentation and text topic segmentation for any language.
Links
MSM 143300003, plan (intention)Name: Interakce člověka s počítačem, dialogové systémy a asistivní technologie
Investor: Ministry of Education, Youth and Sports of the CR, Human-computer interaction, dialog systems and assistive technologies
PrintDisplayed: 21/5/2024 00:28