J 2021

New Czechoslovak Hyphenation Patterns, Word Lists, and Workflow

SOJKA, Petr and Ondřej SOJKA

Basic information

Original name

New Czechoslovak Hyphenation Patterns, Word Lists, and Workflow

Authors

SOJKA, Petr (203 Czech Republic, guarantor, belonging to the institution) and Ondřej SOJKA (203 Czech Republic, belonging to the institution)

Edition

TUGboat: The Communications of the TeX Users Group, San Francisco, USA, TUG, 2021, 0896-3207

Other information

Language

English

Type of outcome

Článek v odborném periodiku

Field of Study

20206 Computer hardware and architecture

Country of publisher

United States of America

Confidentiality degree

není předmětem státního či obchodního tajemství

RIV identification code

RIV/00216224:14330/21:00122189

Organization unit

Faculty of Informatics

Keywords (in Czech)

dělení slov; generování vzorů; databáze slov; vícejazyčná sazba; slabičné algoritmy; patgen; soutěživé vzory

Keywords in English

hyphenation; pattern generation; word list database; multilingual typesetting; syllabification algorithms; patgen; competing patterns

Tags

International impact, Reviewed
Změněno: 5/9/2023 11:40, doc. RNDr. Petr Sojka, Ph.D.

Abstract

V originále

Space- and time-effective segmentation and hyphenation of natural languages stay at the core of every document preparation system, web browser, or mobile rendering system. We use the unreasonable effectiveness of pattern generation with patgen. It is possible to use hyphenation patterns to solve the dictionary problem also for close languages without compromise. In this article, we show how we applied the marvelous effectiveness of patgen for the generation of the new Czechoslovak hyphenation patterns that cover both Czech and Slovak languages. We show that developing universal, up-to-date, high-coverage and high-generalization hyphenation patterns is feasible, generated from semi-automatically prepared word lists from actual language usage. We evaluate the new approach and argue that the new Czechoslovak hyphenation patterns bring significant coverage and generalization improvements, and space savings. We share all the data, word lists, and workflow for reproducibility and usage.

Links

MUNI/A/1573/2020, interní kód MU
Name: Aplikovaný výzkum: vyhledávání, analýza a vizualizace rozsáhlých dat, zpracování přirozeného jazyka, umělá inteligence pro analýzu biomedicínských obrazů.
Investor: Masaryk University

Files attached