Identifikace jazyka textu statistickými charakteristikami

BLAHUŠ, Marek. Identifikace jazyka textu statistickými charakteristikami (Novel Investigations for N-Gram-Based Automatic Identification of Written Language). Uherské Hradiště: Gymnázium Uherské Hradiště, 2004, 23 pp.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Identifikace jazyka textu statistickými charakteristikami
Name (in English)	Novel Investigations for N-Gram-Based Automatic Identification of Written Language
Authors	BLAHUŠ, Marek.
Edition	Uherské Hradiště, 23 pp. 2004.
Publisher	Gymnázium Uherské Hradiště

Other information
Type of outcome	Book on a specialized topic
Confidentiality degree	is not subject to a state or trade secret
WWW	URL
Organization unit	Faculty of Informatics
Changed by	Changed by: Mgr. Marek Blahuš, učo 172464. Changed: 21/2/2006 21:45.

Abstract

Language Detector je počítačový program určený k identifikaci jazyka neznámého textu na základě porovnávání jeho statistických charakteristik, a to především frekvencí n-gramů (jednotlivých písmen nebo jejich skupin). Uživatelem vložený text je analyzován a porovnáním jeho statistických charakteristik se známými informacemi o jazycích je identifikován jazyk vloženého textu. Program podporuje množství nejrozšířenějších jazyků, dalších lze snadno doplnit prostřednictvím výukového modulu. Statistické charakteristiky je možno prohlížet a jazyky mezi sebou porovnávat. Prostřednictvím jednotného uživatelského rozhraní lze po identifikaci jazyka textu využít některý z internetových strojových překladačů k překladu textu např. do angličtiny. Uživatelské rozhraní programu je k dispozici ve češtině, angličtině a mezinárodním jazyce esperanto, což jej činí dostupným prakticky komukoliv. Stejně tak zdrojové kódy programu jsou volně k dispozici. Zajímavostí je v práci obsažená tabulka znázorňující zjištěné podobnosti mezi jednotlivými jazyky, která odpovídá tradičnímu systému jazykových skupin založenému na jejich původu a vývoji.

Abstract (in English)

Automatic language identification is the important requisite often used in spell checking, machine translation and Web content filtering. In this project, N-gram-based method is proposed for improved language identification. Furthermore, a novel computer program is designed to identify language of a given machine-readable text. The program processes the given text. It searches the latter for all present groups of letters (size of one to three) in order to create the set of possible outcomes with the related probabilities. Finally, based on the vector distance calculation, the closest language is determined by comparing this set with built-in patterns for known languages. Teaching module for recognizing new languages was also designed as a part of the program. When the set of the probabilities of some particular language is compared with the others, the resulting response indicates a similarity with the known genetic language classification. I have tested the designed program in various Web applications such as machine translation or Web content filtering in the scope of the semantic Web. Results which were achieved show on 85% language identification successfulness even for relatively short texts.

PrintDisplayed: 27/4/2024 15:24

Identifikace jazyka textu statistickými charakteristikami

Other applications