Filtering Very Similar Text Documents: A Case Study

HROZA, Jiří, Jan ŽIŽKA and Aleš BOUREK. Filtering Very Similar Text Documents: A Case Study. Online. In Computational linguistics and Intelligent Text Processing. Germany: Springer-Verlag Berlin Heidelberg, 2004. p. 511-520. ISBN 3-540-21006-7. [citováno 2024-04-24]

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Filtering Very Similar Text Documents: A Case Study
Name in Czech	Filtrace velmi podobných textových dokumentů: Studie případu.
Authors	HROZA, Jiří (203 Czech Republic, guarantor), Jan ŽIŽKA (203 Czech Republic) and Aleš BOUREK (203 Czech Republic)
Edition	Germany, Computational linguistics and Intelligent Text Processing, p. 511-520, 10 pp. 2004.
Publisher	Springer-Verlag Berlin Heidelberg

Other information
Original language	English
Type of outcome	Proceedings paper
Field of Study	10201 Computer sciences, information science, bioinformatics
Country of publisher	Germany
Confidentiality degree	is not subject to a state or trade secret
RIV identification code	RIV/00216224:14330/04:00009948
Organization unit	Faculty of Informatics
ISBN	3-540-21006-7
UT WoS	000189417900064
Keywords in English	machine learning; text categorization; text filtration; text similarity
Tags	machine learning, text categorization, text filtration, text similarity
Changed by	Changed by: doc. Ing. Jan Žižka, CSc., učo 2431. Changed: 21/1/2005 18:31.

Abstract

This paper describes problems with classification and filtration of similar relevant and irrelevant real medical documents from one very specific domain, obtained from the Internet resources. Besides the similarity, the documents are often unbalanced-a lack of irrelevant documents for the training. A definition of similarity is suggested. For the classification, six algorithms are tested from the document similarity point of view. The best results are provided by the back propagation-based neural network and by the radial basis function-based support vector machine.

Abstract (in Czech)

Článek popisuje problémy s klasifikací a filtrací podobných relevantních a nerelevantních reálných textových dokumentů z jedné velmi specifické domény, získané z internetových zdrojů. Kromě podobnosti jsou dokumenty často nevyváženy -- nedostatek nerelevantních dokumentů pro trénování. Je navržena definice podobnosti. Klasifikace byla testována pomocí šesti algoritmů z hlediska podobnosti textů. Nejlepší výsledky poskytly neuronové sítě založené na backpropagation a support vector machines s radiálními bázovými funkcemi.

Links
MSM 143300003, plan (intention)	Name: Interakce člověka s počítačem, dialogové systémy a asistivní technologie
MSM 143300003, plan (intention)	Investor: Ministry of Education, Youth and Sports of the CR, Human-computer interaction, dialog systems and assistive technologies

PrintDisplayed: 24/4/2024 08:57

Filtering Very Similar Text Documents: A Case Study

Other applications