Identifikace jazyka textu statistickými charakteristikami

BLAHUŠ, Marek. Identifikace jazyka textu statistickými charakteristikami. Uherské Hradiště: Gymnázium Uherské Hradiště, 2004, 23 s.

Další formáty: BibTeX LaTeX RIS

Základní údaje
Originální název	Identifikace jazyka textu statistickými charakteristikami
Název anglicky	Novel Investigations for N-Gram-Based Automatic Identification of Written Language
Autoři	BLAHUŠ, Marek.
Vydání	Uherské Hradiště, 23 s. 2004.
Nakladatel	Gymnázium Uherské Hradiště

Další údaje
Typ výsledku	Odborná kniha
Utajení	není předmětem státního či obchodního tajemství
WWW	URL
Organizační jednotka	Fakulta informatiky
Změnil	Změnil: Mgr. Marek Blahuš, učo 172464. Změněno: 21. 2. 2006 21:45.

Anotace

Language Detector je počítačový program určený k identifikaci jazyka neznámého textu na základě porovnávání jeho statistických charakteristik, a to především frekvencí n-gramů (jednotlivých písmen nebo jejich skupin). Uživatelem vložený text je analyzován a porovnáním jeho statistických charakteristik se známými informacemi o jazycích je identifikován jazyk vloženého textu. Program podporuje množství nejrozšířenějších jazyků, dalších lze snadno doplnit prostřednictvím výukového modulu. Statistické charakteristiky je možno prohlížet a jazyky mezi sebou porovnávat. Prostřednictvím jednotného uživatelského rozhraní lze po identifikaci jazyka textu využít některý z internetových strojových překladačů k překladu textu např. do angličtiny. Uživatelské rozhraní programu je k dispozici ve češtině, angličtině a mezinárodním jazyce esperanto, což jej činí dostupným prakticky komukoliv. Stejně tak zdrojové kódy programu jsou volně k dispozici. Zajímavostí je v práci obsažená tabulka znázorňující zjištěné podobnosti mezi jednotlivými jazyky, která odpovídá tradičnímu systému jazykových skupin založenému na jejich původu a vývoji.

Anotace anglicky

Automatic language identification is the important requisite often used in spell checking, machine translation and Web content filtering. In this project, N-gram-based method is proposed for improved language identification. Furthermore, a novel computer program is designed to identify language of a given machine-readable text. The program processes the given text. It searches the latter for all present groups of letters (size of one to three) in order to create the set of possible outcomes with the related probabilities. Finally, based on the vector distance calculation, the closest language is determined by comparing this set with built-in patterns for known languages. Teaching module for recognizing new languages was also designed as a part of the program. When the set of the probabilities of some particular language is compared with the others, the resulting response indicates a similarity with the known genetic language classification. I have tested the designed program in various Web applications such as machine translation or Web content filtering in the scope of the semantic Web. Results which were achieved show on 85% language identification successfulness even for relatively short texts.

VytisknoutZobrazeno: 25. 4. 2024 23:04

Identifikace jazyka textu statistickými charakteristikami

Další aplikace