Automatic Web Page Classification

D 2008

Automatic Web Page Classification

MATERNA, Jiří

Basic information

Original name

Automatic Web Page Classification

Name in Czech

Automatické určení domény a klíčových slov stránky

Authors

MATERNA, Jiří (203 Czech Republic, guarantor, belonging to the institution)

Edition

Brno, Recent Advances in Slavonic Natural Language Processing, 10 pp. 2008

Publisher

Faculty of Informatics, Masaryk University

Other information

Language

English

Type of outcome

Stať ve sborníku

Field of Study

10201 Computer sciences, information science, bioinformatics

Country of publisher

Czech Republic

Confidentiality degree

není předmětem státního či obchodního tajemství

Publication form

electronic version available online

References:

URL

RIV identification code

RIV/00216224:14330/08:00042213

Organization unit

Faculty of Informatics

ISBN

978-80-210-4741-9

UT WoS

000302212600014

Keywords (in Czech)

automatická klasifikace dokumentů; strojové učení; thesaurus

Keywords in English

automatic classification; machine learning; thesaurus

Abstract

ORIG CZ

V originále

Aim of this paper is to describe a method of automatic web page classification to semantic domains and its evaluation. The classification method exploits machine learning algorithms and several morphological as well as semantical text processing tools. In contrast to general text document classification, in the web document classification, there are often problems with short web pages. In this paper we proposed two approaches to eliminate the lack of information. In the first one we consider a wider context of a web page. That means we analyze web pages referenced from the investigated page. The second approach is based on sophisticated term clustering by their similar grammatical context. This is done using statistic corpora tool the Sketch Engine.

In Czech

Cílem této práce je navrhnout a otestovat přístup, který umožní automatickou klasifikaci webových stránek do domén a určení klíčových slov stránky. Klasifikace stránek je založena na použití strojového učení. Hlavním problémem je však malý rozsah webových stránek, který užití metod strojového učení znesnadňuje. V práci jsou navrženy dva přístupy, které se snaží tento nedostatek minimalizovat. Prvním z nich je zohledňování širšího kontextu webové stránky, to znamená, že se analyzují i stránky, umístěné ve stejné internetové doméně, které jsou ze zkoumané stránky odkazovány. Druhou metodou je shlukování termů dokumentu na základě jejich podobného gramatického kontextu. Pro tyto účely je vytvořen poměrně rozsáhlý thesaurus a z něho shlukový slovník.

Links

LC536, research and development project

Name: Centrum komputační lingvistiky

Investor: Ministry of Education, Youth and Sports of the CR, Centrum komputační lingvistiky

Detailed Information on Publication Record