Punctuation Detection with Full Syntactic Parsing

JAKUBÍČEK, Miloš and Aleš HORÁK. Punctuation Detection with Full Syntactic Parsing. Research in Computing Science, Special issue: Natural Language Processing and its Applications. Mexiko: Instituto Politécnico Nacional, 2010, vol. 46, March 2010, p. 335-343. ISSN 1870-4069.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Punctuation Detection with Full Syntactic Parsing
Name in Czech	Detekce interpunkce pomocí hloubkové syntaktické analýzy
Authors	JAKUBÍČEK, Miloš (203 Czech Republic, guarantor) and Aleš HORÁK (203 Czech Republic).
Edition	Research in Computing Science, Special issue: Natural Language Processing and its Applications, Mexiko, Instituto Politécnico Nacional, 2010, 1870-4069.

Other information
Original language	English
Type of outcome	Article in a journal
Field of Study	10201 Computer sciences, information science, bioinformatics
Country of publisher	Czech Republic
Confidentiality degree	is not subject to a state or trade secret
WWW	URL
RIV identification code	RIV/00216224:14330/10:00043533
Organization unit	Faculty of Informatics
Keywords (in Czech)	interpunkce; korektor pravopisu; synaktická analýza; syntaktická struktura
Keywords in English	punctuation; grammar checking; parsing; syntactic analysis
Tags	International impact, Reviewed
Changed by	Changed by: doc. RNDr. Aleš Horák, Ph.D., učo 1648. Changed: 10/11/2010 11:12.

Abstract

The correct placement of punctuation characters is in many languages, including Czech, driven by complex guidelines. Although those guidelines use information of morphology, syntax and semantics, state-of-art systems for punctuation detection and correction are limited to simple rule-based backbones. In this paper we present a syntax-based approach by utilizing the Czech parser synt. This parser uses an adapted chart parsing technique for building the chart structure for the sentence. synt can then process the chart and provide several kinds of output information. The implemented punctuation detection technique utilizes the synt output in the form of automatic and unambiguous extraction of optimal syntactic structures from the sentence (noun phrases, verb phrases, clauses, relative clauses or inserted clauses). Using this feature it is possible to obtain information about syntactic structures related to expected punctuation placement. We also present experiments proving that this method makes it possible to cover most syntactic phenomena needed for punctuation detection or correction.

Abstract (in Czech)

Správné užívání interpunkčních znamének podléhá v mnoha jazycích, včetně češtiny, složitým pravidlům. Ačkoliv tato pravidla vycházejí z morfologie, syntaxe i sémantiky, současné aplikace pro detekci a korekci interpunkce se omezují na jednoduché pravidlové systémy. V tomto článku představujeme způsob založený na využití syntaktického analyzátoru (parseru) pro češtinu jménem synt. Tento parser používá při analýze strukturu typu chart, ze které lze dále získat různé druhy výstupů. Implementovaná technika detekce interpunkce využívá výstupu ve formě jednoznačných syntaktických struktur (jmenných a slovesných frází, jednoduchých, mj. vztažných, či obecně vložených vět). Tato funkcionalita umožňuje získání syntaktických struktur relevantních pro vkládání interpunkce. Závěrem jsou demonstrovány experimenty prokazující, že tato technika je použitelná pro pokrytí většiny syntaktických fenoménů souvisejících s detekcí interpunkce.

Links
GAP401/10/0792, research and development project	Name: Temporální aspekty znalostí a informací
GAP401/10/0792, research and development project	Investor: Czech Science Foundation
LC536, research and development project	Name: Centrum komputační lingvistiky
LC536, research and development project	Investor: Ministry of Education, Youth and Sports of the CR, Centrum komputační lingvistiky
2C06009, research and development project	Name: Prostředky tvorby komplexní báze znalostí pro komunikaci se sémantickým webem v přirozeném jazyce (Acronym: COT-SEWing)
2C06009, research and development project	Investor: Ministry of Education, Youth and Sports of the CR

PrintDisplayed: 26/4/2024 16:42

Punctuation Detection with Full Syntactic Parsing

Other applications