Detailed Information on Publication Record

HROZA, Jiří and Jan ŽIŽKA. Mining Relevant Text Documents Using Ranking-Based k-NN Algorithms Trained by Only Positive Examples. In Znalosti 2005, sborník příspěvků. 1st ed. Ostrava: VŠB--Technická univerzita Ostrava, 2005, p. 29-40. ISBN 80-248-0755-6.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Mining Relevant Text Documents Using Ranking-Based k-NN Algorithms Trained by Only Positive Examples
Name in Czech	Dolování relevantních textových dokumentů algoritmem k-NN trénovaným pouze pomocí pozitivních příkladů
Authors	HROZA, Jiří (203 Czech Republic, guarantor) and Jan ŽIŽKA (203 Czech Republic).
Edition	1. vyd. Ostrava, Znalosti 2005, sborník příspěvků, p. 29-40, 12 pp. 2005.
Publisher	VŠB--Technická univerzita Ostrava

Other information
Original language	English
Type of outcome	Proceedings paper
Field of Study	10201 Computer sciences, information science, bioinformatics
Country of publisher	Czech Republic
Confidentiality degree	is not subject to a state or trade secret
RIV identification code	RIV/00216224:14330/05:00013631
Organization unit	Faculty of Informatics
ISBN	80-248-0755-6
Keywords in English	ranking; text categorization; k-NN
Tags	k-NN, ranking, text categorization
Changed by	Changed by: RNDr. Jiří Hroza, učo 3800. Changed: 2/3/2005 14:30.

Abstract
The problem of mining relevant information from large numbers of unstructured text documents is often handled with various machine learning algorithms trained using both positive and negative examples that were prepared by an expert in a~given specific domain. However, when just positive examples are available, the task requires algorithms adapted to the different situation. A~modified k-nearest neighbors algorithm, trained using only positive examples, can classify by way of ranking unlabeled instances depending on their similarity to training examples. This procedure provides a~significant part of unlabeled positive instances with high precision. The main objective is to find a~method for mining relevant documents from large volumes (hundreds or thousands) of similar medical text files. Experiments and comparisons with various real data obtained from several Internet resources and represented as a bag of words provided---under specific conditions---quite acceptable results from the precision-recall point of view.

Abstract

The problem of mining relevant information from large numbers of unstructured text documents is often handled with various machine learning algorithms trained using both positive and negative examples that were prepared by an expert in a~given specific domain. However, when just positive examples are available, the task requires algorithms adapted to the different situation. A~modified k-nearest neighbors algorithm, trained using only positive examples, can classify by way of ranking unlabeled instances depending on their similarity to training examples. This procedure provides a~significant part of unlabeled positive instances with high precision. The main objective is to find a~method for mining relevant documents from large volumes (hundreds or thousands) of similar medical text files. Experiments and comparisons with various real data obtained from several Internet resources and represented as a bag of words provided---under specific conditions---quite acceptable results from the precision-recall point of view.

Abstract (in Czech)
Problém dolování relevantních informací z velkého množství nestrukturovaných textů je často řešen pomocí metod strojového učení, které jsou trénovány na pozitivních i negativních příkladech připravených expertem dané oblasti. Avšak pokud jsou k dispozici pouze pozitivní příklady, je třeba tyto algoritmy modifikovat. Metoda k-NN modifikovaná pro učení se pouze z pozitivních příkladů umožňuje klasifikovat neznámé dokumenty formou seřazení na základě jejich podobnosti. Tímto způsobem je možné získat dostatek relevantních dokumentů s velmi vysokou přesností. Hlavním cílem bylo nalézt metodu umožňující dolovat relevantní dokumenty z velkého množství (stovek či tísíců) podobných lékařských textů. Experimenty s reálnými datovými sadami poskytují -- za daných podmínek -- přijatelné výsledky z pohledu závislosti přesnosti na pokrytí.

Abstract (in Czech)

Problém dolování relevantních informací z velkého množství nestrukturovaných textů je často řešen pomocí metod strojového učení, které jsou trénovány na pozitivních i negativních příkladech připravených expertem dané oblasti. Avšak pokud jsou k dispozici pouze pozitivní příklady, je třeba tyto algoritmy modifikovat. Metoda k-NN modifikovaná pro učení se pouze z pozitivních příkladů umožňuje klasifikovat neznámé dokumenty formou seřazení na základě jejich podobnosti. Tímto způsobem je možné získat dostatek relevantních dokumentů s velmi vysokou přesností. Hlavním cílem bylo nalézt metodu umožňující dolovat relevantní dokumenty z velkého množství (stovek či tísíců) podobných lékařských textů. Experimenty s reálnými datovými sadami poskytují -- za daných podmínek -- přijatelné výsledky z pohledu závislosti přesnosti na pokrytí.

Links
MSM 143300003, plan (intention)	Name: Interakce člověka s počítačem, dialogové systémy a asistivní technologie
MSM 143300003, plan (intention)	Investor: Ministry of Education, Youth and Sports of the CR, Human-computer interaction, dialog systems and assistive technologies