First-order Frequent Patterns in Text Mining

D 2005

First-order Frequent Patterns in Text Mining

BLAŤÁK, Jan

Basic information

Original name

First-order Frequent Patterns in Text Mining

Name in Czech

Prvořádové časté vzory v dolování v textu

Authors

BLAŤÁK, Jan (203 Czech Republic, guarantor)

Edition

1. vyd. Covilha, Portugal, EPIA'05, 12th Portuguese Conference on Artificial Intelligence, p. 344-350, 7 pp. 2005

Publisher

Institute of Electrical and Electronics Engineers, Inc.

Other information

Language

English

Type of outcome

Proceedings paper

Field of Study

10201 Computer sciences, information science, bioinformatics

Country of publisher

Portugal

Confidentiality degree

is not subject to a state or trade secret

RIV identification code

RIV/00216224:14330/05:00014356

Organization unit

Faculty of Informatics

ISBN

0-7803-9365-1

UT WoS

000245387100063

Keywords in English

machine learning; first-order frequent patterns; text mining; distributed mining

Abstract

ORIG CZ

V originále

In this paper a universal framework for mining long first-order frequent patterns in text data is presented. It consists of RAP, an ILP system for mining maximal first-order frequent patterns, and two types of redefined background knowledge. Two methods of using generated patterns for solving text mining tasks are described: propositionalization and CBA (class based association). A new variant of the CBA rule based classifier is proposed. The framework is used for solving three text mining tasks: information extraction from biomedical texts, context-sensitive text correction of English and morphological disambiguation of Czech. The distributed mining of frequent patterns is described and its influence on mining in text is discussed. It is shown that frequent patterns as new features for propositionalization usually provide better results than CBA.

In Czech

V tomto článku představíme nové univerzální rozhraní využívající prvořádové časté vzory pro řešení úloh dolování v textu. Sestává ze systému RAP, což je systém ILP určený pro hledání maximálních častých vzorů, a dvou typů doménové znalosti. Jsou popsány dvě metody využití nalezených vzorů pro dolování v textu: propozicionalizace a CBA. Je představena nová verze CBA klasifikátoru. Použití systému je demonstrováno na třech úlohách z dolování textu: extrakci informace z biologických textů, kontextové kontrole pravopisu a morfologické desambiguaci. Diskutujeme také přínos distribuovaného vyhledávání častých vzorů. Je ukázáno, že časté vzory použité jako nové rysy v propozicionalizaci poskytují lepší výsledky než CBA.

Links

MSM0021622418, plan (intention)

Name: DYNAMICKÁ GEOVIZUALIZACE V KRIZOVÉM MANAGEMENTU

Investor: Ministry of Education, Youth and Sports of the CR, Dynamic Geovisualisation in Crises Management

Detailed Information on Publication Record