New Czechoslovak Hyphenation Patterns, Word Lists, and Workflow*
Petr Sojka
Faculty of Informatics, Masaryk University, Brno, Czech Republic
sojkaOfi.muni.cz
Ondfej Sojka
Faculty of Informatics, Masaryk University, Brno, Czech Republic
454904@mail.muni.cz
Abstract
Space- and time-effective segmentation and hyphenation of natural languages
stay at the core of every document preparation system, web browser, or mobile
rendering system. We use the unreasonable effectiveness of pattern generation
with patgen. It is possible to use hyphenation patterns to solve the dictionary
problem also for close languages without compromise.
In this article, we show how we applied the marvelous effectiveness of patgen
for the generation of the new Czechoslovak hyphenation patterns that cover both
Czech and Slovak languages. We show that developing universal, up-to-date,
high-coverage and high-generalization hyphenation patterns is feasible, generated
from semi-automatically prepared word lists from actual language usage. We
evaluate the new approach and argue that the new Czechoslovak hyphenation
patterns bring significant coverage and generalization improvements, and space
savings. We share all the data, word lists, and workflow for reproducibility and
usage.
"Any respectable word processing package
includes a hyphenation facility. Those based on
an algorithm, also called logic systems, often
break words incorrectly." Major Keary in [11]
1 Introduction
Space- and time-effective segmentation and hyphenation
of natural languages stay at the core of every
document preparation system, being it TEX, modern
web browser, or mobile rendering system.
Unicode Consortium supports 5,000 languages
that are still in use today. Each of these languages
is on the move. A digital typographic system that
supports Unicode and its languages in full should
support hyphenation in the form of algorithms, rules,
or patterns.
However, languages are "moving targets". Vocabulary
does change (e.g., language adopts new words).
Meanings of individual words do change in time (e.g.,
gay in English). The importance of word etymology
and segmentation does change as well. Word roz-um
(understanding) hyphenated in 1956 [7] according to
prefix roz signaling separation and seam um signaling
knowledge is now perceived as single stem rozum
* This is significantly updated and enriched version of
paper published in the Zpravodaj Q j T U G [28].
(intelligence, mind). Thus also word hyphenation
algorithms should adapt accordingly from time to
time to match language usage.
There are essentially two quite different approaches
to hyphenation:
etymology-based The rule is to cut a word on the
border of a compound word or the boundary of
stem and affix, prefix, or negation. A typical
example is the British hyphenation rules by the
Oxford University Press [1].
phonology-based Hyphenation follows the pronunciation
of syllables and allows for much more
fluent reading. Syllabification is not followed
only near word borders in the same languages —
hyphenation is forbidden when close to word borders.
American publishers [6] and the Chicago
Manual of Style [2] users prefer this pragmatic
approach.
There is a trade-off between the two: one prefers
visual highlighting of the word meaning etymology as
British do, or likes phonology — convenient reading
across the lines.
There is high language diversity, but what is
the same is that the meaning is conveyed by syllables
of the language [15]. There is high diversity
TUGboat, Volume 42 (2021), No. 2 — Proceedings of the 2021 Annual Meeting 1
Petr Sojka, Ondřej Sojka
in the language spelling, but what is the same is
that the mapping from phonology to spelling is almost
lossless. There is high diversity in the language
hyphenation rules, but when phonology-based hyphenation
is preferred, the syllable definition based
on consonant and vowel segments is the same for all
languages, giving a chance to develop one universal
syllable-based segmentation algorithm.
Czech and Slovak are very close languages. Citizens
of Czechoslovakia understood both before the
state split in 1993. The syllabification and pronunciation
rules are the same. We spotted a clear trend
towards phonology-based hyphenation. The differences
in spelling are rule-based. These observations
lead us to the idea of common Czechoslovak hyphenation
patterns usable for both languages.
This paper evaluates the feasibility of the development
of universal phonology-based (syllabic)
hyphenation patterns. As a case study, we describe
the development of Czechoslovak hyphenation patterns
from word lists of Czech [20, 29, 21] and Slovak
[23]. We generated new patterns from word lists
captured from the actual language used during the
last decade. We rigorously evaluated new patterns
as superior to the current specific Czech and Slovak
patterns. We document our reproducible workflow
and all resources in a public repository. We conclude
by outlining further possible hyphenation pattern
developments to meet the demands of today.
"Hyphenation does not lend itself to any set of
unequivocal rules. Indeed, the many exceptions and
disagreements suggest it is all something dreamed
up at an anarchists' convention." Major Keary
in [11]
2 Syllable Segmentation Methods
The core idea is to develop shared hyphenation patterns
for phonology-based languages. If these languages
share pronunciation rules, homographs from
different languages typically do not cause problems,
as they are hyphenated the same. [7, 4, 9, 32] There
are sporadic cases where the seam of a compound
word dictates hyphenation point contrary to phonology
(roz-um vs. ro-zum). These could be solved by
not allowing the hyphenation of this particular word
around this specific seam.
Marchand et al. [16] showed that data-driven
approaches to syllabification algorithms outperform
rule-based ones, reaching accuracy around 95% per
single language. Bartlett et al. [3] developed a machine
learning approach for automatic syllabification,
motivated by the needs of letter-to-phoneme conversion.
Trogkanis et al. [30] used conditional random
fields for word hyphenation and compared the accuracy
and other metrics with the original technique
of Liang [14]. Their results abstracted heuristics to
optimize generated patterns by patgen, [8] diminishing
achievable performance by Liang's technique. A
recent study on syllabification [13] shows that even in
comparison with the latest "deep" neural approaches,
fine-tuned patgen's performance beats them both in
accuracy and performance.
Recently, there were attempts to tackle the
word segmentation problem in different languages by
Shao et al. [18]. The algorithm is error-prone, but it
was developed primarily for speech recognition and
language representation tasks. Due to the nonzero
error rate, its applicability to the hyphenation task
is limited. In a typesetting system, the hyphenation
algorithm must cover all exceptions and not tolerate
any errors.
We recently showed that the patgen approach
of pattern generation from word list is unreasonably
effective [25]. One can set the parameters of the
generation process so that the patterns cover 100% of
hyphenation points, and their size remains reasonably
tiny. We compressed the word list with 3,000,000
hyphenated words into 30,000 bytes of the packed
trie data structure for the Czech language. That
means achieving a compression ratio of several orders
of magnitude with 100% coverage and nearly zero
errors [25]. For a similar language such as Slovak,
the pronunciation is very similar, syllable-forming
principles are the same, and compositional rules and
prefixes are pretty close, if not identical.
We have decided to verify the approach by developing
hyphenation patterns that will hyphenate both
Czech and Slovak words without errors, with only a
few missed hyphens. The missed hyphen will appear
only in words like oblít where meaning of the term
is needed for the decision: o-blit or ob-lit.
The clear trend, at least in the Czech hyphenation
codification books from Haller [7] via [24] used
sofar in TjrjX and Word, [4] to currently maintained
word lists in [9] reflect gradual moving from etymology
to phonology for better syllabic pronunciation
when reading hyphenated words. The contextdependent
hyphenation decision to resolve such preferences
and the meaning ambiguities are needed only
sporadically.
We needed to create lists of correctly hyphenated
Czech and Slovak words to generate these hyphenation
patterns.
3 Data Preparation
For our work, Lexical Computing CZ donated word
lists with frequencies for Czech and Slovak from the
2 TUGboat, Volume 42 (2021), No. 2 — Proceedings of the 2021 Annual Meeting
New Czechoslovak Hyphenation Patterns, Word Lists, and Workflow
TenTen family of corpora [10, 12]. Corpora were
crawled from the Internet within the last decade.
They contain words used in both languages.
The Czech word list was cleaned up and extended
as described by us [25, 26, 27], using the
Czech morphological analyzer majka. Contrary to
the German database, we opted for a simple format
as possible, allowing easy word lists enrichment and
editing.
For generalization of hyphenation rules by patgen,
we do not need the word list as complete as possible,
so we used only words that appeared more than
ten times. The final word list cs-all-cstenten.wls
contained 606,494 words.
For Slovak, we obtained 1,048,860 Slovak words
with a frequency higher than ten from 2011 SkTenTen
corpora [10]. We only used words with a frequency
higher than thirty that comprised only of ISO Latin 2
characters, obtaining file sktenten.wls with 544,609
words.
By joining both language files, we got 967,058
Czech and Slovak words in cssk-all-join, wis, of
which 106,016 were contained in the intersection of
both word lists: cssk-all-intersect .wis.
4 Pattern Development
Figure 1 illustrates the workflow of the Czechoslovak
pattern development. We have used recent, accurate
Czech patterns [25] for the hyphenation of the joint
Czech and Slovak word list. We had to fix incorrect
hyphenation points manually, typically near the prefix
and stem boundary when phoneme-based hyphenation
point was one character away from the seam of
the prefix or compound word: neja-traktivnejsi,
neja-teistictejsi, neje-kologictejsi.
We have then hyphenated words used in both
languages also by current Slovak patterns. There
were only a few word hyphenations that needed to be
corrected — we created thefilesk-corrections.wlh
that contained the fixed hyphenated words. Finally,
we used them as an input to pat gen with a higher
weight during the generation of thefinalCzechoslovak
hyphenated patterns.
We did not pursue 100% coverage at all costs
because the source data is noisy, and we do not want
the patterns to learn all the typos and inconsistencies.
We expand on this in the Jupyter notebook [19].
Gentle readers may also find the scripts used there.
5 Evaluation
We evaluated the quality of developed patterns by
two metrics. Coverage of hyphenation points in the
training word list tells how the patterns correctly
predicted hyphenation points used in training. Generalization
means how the patterns behave on unseen
data, on the words not available in the data used
during patgen training.
We see the coverage and generalization as a
classification task, i.e., how the patterns classify hyphenation
points in the training and testing word
lists, respectively.
5.1 Classification
For evaluation of classification, there are four numbers
in the contingency matrix that compare hyphenation
point prediction by patterns with the ground
truth expressed in the wordlist: true positives (TP),
true negatives (TN), false positives (FP), and false
negatives (FN). In tables 1-4 on page 5, we report:
Good sum or percentage of found hyphenation
points (TP),
Bad sum or percentage of badly suggested hyphenation
points (FP, type 1 error),
Missed sum or percentage of missed hyphenation
points (FN, type 2 error).
Type 1 errors are more severe than type 2 errors
in our hyphenation points setup. Nonzero bad
results do not necessarily mean that the patterns performed
poorly. Just the opposite holds — patterns
have found a rule that the ground truth wordlist does
not obey. In other words, the inconsistency needs
fixing in the underlying word list rather than emitting
the pattern for a valid exception. We practiced
manual inspection of bad hyphenation points during
the development of the word list.
5.2 Generalization
We used tenfold cross-validation to assess the generalization
properties, leaving one-tenth out of the
training set to evaluate the patterns' effectiveness on
unseen words. We show results in Table 5. The evaluation
metrics slightly differ with different patgen
parameters, with the best results achieved when we
maximize the coverage of the training set.
The achieved results show that both evaluation
metrics are close to perfection. We can either opt for
perfect coverage and reach it or push to maximize
generalization qualities and performance on unseen
words. In the first case, we essentially do lossless
compression of wordlist hyphenation points by the
developed pattern). In the second, we miss only
less than 1% of valid hyphenation points. Achieving
that for two languages in parallel seems like a good
result. It is feasible to continue merging additional
word lists to develop generic patterns for syllabically
hyphenated languages.
TUGboat, Volume 42 (2021), No. 2 — Proceedings of the 2021 Annual Meeting
Petr Sojka, Ondřej Sojka
c s s k - a l l - j o i n . w l s
1,319,000 CS+SK words
I
c s s k - a l l - i n t e r s e c t . w i s
139,000 words that are in both CS and SK
patgen
(as hyphenator)
with cshyphen patterns
I
patgen
(as hyphenator)
with skhyph patterns
c s s k - a l l - i n t e r s e c t . w l h
139,000 hyphenated words
that are in both
CS and SK
c s s k - a l l - i n t e r s e c t . w l h
139,000 words hyphenated
by Slovak patterns
c s s k - a l l - j o i n . w l h
1,319,000 CS+SK
hyphenated words
diff
and fixing badly hyphenated SK words
s k - c o r r e c t i o n s . w l h
corrected SK words from
c s s k - a l l - i n t e r s e c t . w l h
word lists union
with added priorities
(join lx, intersect 2x, corrections 3x)
c s s k - a l l - w e i g h t e d . w l h
1,319,000 hyphenated words with weights
cs-sojka-correctopt.par
c s - s o j k a - s i z e o p t . p a r
patgen
(as pattern generator)
csskhyphen.pat
Figure 1: The whole pattern development workflow is showed above from top:
a) Czech and Slovak word lists collection, [25] and intersection; b) bootstrapping
hyphenated word lists with syllabic Czech patterns; c) checking and fixing by
deploying rarely used patgen weighting for Slovak words common with Czech ones;
d) generation of final patterns.
The whole workflow and scripts are available in the public repository [19].
4 TUGboat, Volume 42 (2021), No. 2 —Proceedings of the 2021 Annual Meeting
New Czechoslovak Hyphenation Patterns, Word Lists, and Workflow
Table 1: Statistics from the generation of Czechoslovak hyphenation patterns with custom parameters.
Level Patterns Good Bad Missed Lengths Params
1 830 2,819,833 470,649 35,908 1 3 1 3 12
2 1,590 2,748,581 3,207 107,160 2 4 1 1 5
3 2,766 2,852,334 12,197 3,407 3 6 1 2 4
4 1,285 2,851,931 986 3,810 3 7 1 4 2
Table 2: Statistics from the generation of Czechoslovak hyphenation patterns with correct optimized parameters.
Level Patterns Good Bad Missed Lengths Params
1 2,032 2,800,136 242,962 55,605 1 3 1 5 1
2 2,009 2,791,326 10,343 64,415 1 3 1 5 1
3 3,704 2,855,554 11,970 187 2 6 1 3 1
4 1,206 2,854,794 33 947 2 7 1 3 1
Table 3: Statistics from the generation of Czechoslovak hyphenation patterns with size optimized parameters.
Level Patterns Good Bad Missed Lengths Params
1 419 2,833,402 667,031 22,339 1 3 1 2 20
2 1,506 2,430,120 1,188 425,621 2 4 2 1 8
3 3,579 2,846,112 15,881 9,629 3 5 1 4 7
4 2,401 2,843,657 4 12,084 4 7 3 2 1
Table 4: Comparison of the efficiency of different approaches to hyphenating Czech
and Slovak. Note that the Czechoslovak patterns are comparable in size and quality
to single-language ones — there is only a negligible difference compared to i.e., purely
Czech patterns.
Word list Parameters Good Bad Missed Size Patterns
Slovak [5, by hand] N/A N/A N/A 20 kB 2,467
Czech correctopt [25] 99.76% 2.94% 0.24% 30 kB 5,593
Czech sizeopt [25] 98.95% 2.80% 1.05% 19 kB 3,816
Slovak [22, Table 1] 99.94% 0.01% 0.06% 56 kB 2,347
Czechoslovak sizeopt 99.67% 0.00% 0.33% 40 kB 7,417
Czechoslovak correctopt 99.99% 0.00% 0.01% 45 kB 8,231
Czechoslovak custom 99.87% 0.03% 0.13% 32 kB 5,907
Table 5: Results of 10-fold cross-validation with evaluated parameters shows very
good generalization properties (learning on 90%, and testing on remaining 10%)
Parameters Good Bad Missed
correctopt 99.81% 0.15% 0.04%
custom 99.64% 0.22% 0.14%
sizeopt 99.41% 0.18% 0.40%
TUGboat, Volume 42 (2021), No. 2 — Proceedings of the 2021 Annual Meeting 5
Petr Sojka, Ondřej Sojka
We do not know pattern performance for most
of the other available patterns as there are no word
lists to use for the evaluation and comparison.
"Esoteric Nonsense? Hyphenation is neither
anarchy nor the sole province of pedants and
pedagogues... . Used in moderation, it can make
a printed page more visually pleasing. If used
indiscriminately, it can have the opposite effect,
either putting the reader off or causing unnecessary
distraction. If the intended audience is bound
to read the work (a user manual, for example),
poor hyphenation practice may not matter. If the
author wants to attract and hold an audience, then
hyphenation needs just as careful attention as any
other aspect of presentation." Major Keary in [11]
6 Conclusion and summary
We have shown that the development of common
hyphenation patterns for several languages with similar
pronunciations is feasible. Patgen was able to
generalize hyphenation rules for both languages with
a negligible increase in the size of the generated
patterns.
The resulting Czechoslovak patterns hyphenate
Czech and Slovak much better than the former singlelanguage
patterns, with much higher coverage, zero
error rate, and evaluated generalization. The whole
process is reproducible, is documented, and available
as a Jupyter demo notebook with source code [19].
Dissemination
Current hyphenation support based on hyphenation
patterns is collected in the hyph-utf 8 [17] project.
The project uses ISO standards like Unicode and
IETF language tags BCP 47. BCP 47 defines a Scope
property to identify subtags for language collections,
hyph-utf 8 currently contains hyphenation patterns
for 65 different languages with an additional 9 dialect
or transliteration variants.
Our new patterns for "the Czechoslovak language"
were accepted for inclusion to the hyph-utf 8
repository [17], and will be supported in the next revisions
of hyph-utf 8, polyglossia in the TjrjX Live
distribution. The patterns have to be either loaded
all in precomputed, compact form into TjrjX's memory
from format file at the start of every document
compilation, which increases its start-up time. Only
LuaTpX allows loading patterns during run-time only
for languages actually used in a document.
As suggested by TgX experts, we prefer Czech
and Slovak \languages being internally synonyms,
with patterns loaded only once.
Using the patterns via available libraries in many
programming languages (JavaScript, Perl, Python, C,
and more) is straightforward and makes the patterns'
usage rather versatile. Most typesetting systems and
browsers, including OpenOfnce and Chrome, could
hyphenate in narrow columns of mobile devices. Most
of them, if not all systems, use pattern technology
and practices from the TgjX community anyway.
We will support pattern dissemination in TjrjX
distributions and multilingual support packages. We
will tidy up available language resources with the
community of Czech and Slovak users.
Future work
We think of developing language-agnostic patterns
for syllabically hyphenated languages, based on available
data from CELEX [13] with our workflow and
evaluation measures. Wordpiece segmentation algorithm
[31] gives superb results in the NLP domain
for language translation, indicating that information
is conveyed via character n-grams. With universal,
syllable-based patterns, it will be possible to hyphenate
text for most syllabically hyphenated languages
even without knowing the language mark up.
Another direction of research attention will be
machine-learned heuristics for setting of patgen generation
parameters, with the objectives of metrics
optimization used in the evaluation. When applied
to the languages with available word lists, it would
lead to pattern improvements for most supported
languages.
Acknowledgement
We are indebted to Don Knuth for questioning the
common properties of Czech and Slovak hyphenation
during our presentation of [25] at T U G 2019, which
has led us in this research direction. We also thank
everyone on which shoulders we build our work, and
to all who commented on our workflow, patterns,
and this paper, and discussed it at T U G 2021.
References
[1] R.E. Allen, ed. The Oxford Spelling Dictionary.
vol. II of The Oxford Library of English Usage.
Oxford University Press, 1990.
[2] Anonymous. The Chicago Manual of Style. University
of Chicago Press, Chicago, 17 edition,
Sept. 2017.
[3] S. Bartlett, G. Kondrak, C. Cherry. Automatic
Syllabification with Structured SVMs for
Letter-to-Phoneme Conversion. In Proceedings
of ACL-08: HLT, pp. 568-576, Columbus, Ohio,
June 2008. ACL. https://www.aclweb.org/
anthology/P08-1065
[4] A. Bauer. Dělení slov /slovotvorba v praxi/
(Word hyphenation /practical morphology/).
Nakladatelství Olomouc, Olomouc, 1997.
6 TUGboat, Volume 42 (2021), No. 2 — Proceedings of the 2021 Annual Meeting
New Czechoslovak Hyphenation Patterns, Word Lists, and Workflow
[5] J. Chlebíková. Ako rozdělit' (slovo) Československo
(How to hyphenate the word Czechoslovakia).
ZpravodajCsTUG 1(4):10-13, Apr. 1991.
10.5300/1991-4/10
[6] P.B. Gove, M. Webster. Webster's Third New
International Dictionary of the English language
Unabridged. Merriam-Webster Inc., Springfield,
Massachusetts, U.S.A, Jan. 2002.
[7] J. Haller. Jak se dělí slova (How the Words Get
Hyphenated). Státní pedagogické nakladatelství
Praha, 1956.
[8] Y. Haralambous. A Revisited Small Tutorial on
Patgen, 28 Years After. In electronic form, available
from CTAN as inf o/patgen2. tutorial,
Mar. 2021
[9] Internetová jazyková příručka (Internet Language
Reference Book), https://prirucka.
uj c.cas.cz/?id=135
[10] M. Jakubíček, A. Kilgarriff, et al. The TenTen
Corpus Family. In Proc. of the 7th International
Corpus Linguistics Conference (CL), pp. 125-
127, Lancaster, July 2013.
[11] M. Keary. On hyphenation - anarchy
of pedantry. PC Update, The magazine
of the Melbourne PC User Group,
2005. https://web.archive.org/web/
20050310054738/http://www.melbpc.org.
au/pcupdate/9100/9112article4.htm
[12] A. Kilgarriff, P. Rychlý, et al. The Sketch Engine.
In Proceedings of the Eleventh EURALEX
International Congress, pp. 105-116, Lorient,
France, 2004.
[13] J. Krantz, M . Dulin, P.D. Palma. LanguageAgnostic
Syllabification with Neural Sequence
Labeling. In 18th IEEE International Conference
On Machine Learning And Applications,
ICMLA 2019, Boca Raton, EL, USA, December
16-19, 2019, M.A. Wani, T.M. Khoshgoftaar,
et al., eds., pp. 804-810. IEEE, 2019.
10.1109/ICMLA.2019.00141
[14] F.M. Liang. Word Hy-phen-a-tion by Comput-er.
Ph.D. thesis, Department of Computer
Science, Stanford University, Aug.
1983. https://www.tug.org/docs/liang/
liang-thesis.pdf
[15] I. Maddieson. Syllable Structure. In The World
Atlas of Language Structures Online, M.S. Dryer,
M. Haspelmath, eds. Max Planck Institute
for Evolutionary Anthropology, Leipzig, 2013.
https://wals.info/chapter/12
[16] Y. Marchand, C.R. Adsett, R.I. Damper. Automatic
Syllabification in English: A Comparison
of Different Algorithms. Language and Speech
52(l):l-27, 2009. 10.1177/0023830908099881
[17] A. Rosendahl, M . Miklavec. TfiX hyphenation
patterns. Accessed 2021-08-15. http:
//hyphenation.org/tex
[18] Y. Shao, C. Hardmeier, J. Nivre. Universal
Word Segmentation: Implementation and Interpretation.
Transactions of the Association
for Computational Linguistics 6:421-435, 2018.
10.1162/tacl_a_00033
[19] O. Sojka, P. Sojka, cshyphen repository, https:
//github.com/tensoj ka/cshyphen
[20] P. Sojka. Notes on Compound Word Hyphenation
in TfiX. TUGboat 16(3):290-297,
1995. https://tug.org/TUGboat/tbl6-3/
tb48soj2.pdf
[21] P. Sojka. Hyphenation on Demand. TUGboat
20(3):241-247, 1999. https://tug.org/
TUGboat/tb20-3/tb64soj ka.pdf
[22] P. Sojka. Slovenské vzory dělení: čas pro
změnu? In Proceedings of SLT 2004, 4m
seminar
on Linux and TfiX, pp. 67-72, Znojmo, 2004.
Konvoj. https://f i.muni.cz/usr/sojka/
papers/skhyp.pdf
[23] P. Sojka. Slovenské vzory dělení: čas pro
změnu? (Slovak Hyphenation Patterns: A Time
for Change?). CsTUG Bulletin 14(3-4)T83-189,
2004. 10.5300/2004-3-4/183
[24] P. Sojka, P. Seveček. Hyphenation in TjrjX—
Quo Vadis? In Proceedings of the TfiX Users
Group 16th Annual Meeting, St. Petersburg,
1995, M. Goossens, ed., pp. 280-289, Portland,
Oregon, U.S.A., 1995. TfiX Users Group.
[25] P. Sojka, O. Sojka. The Unreasonable Effectiveness
of Pattern Generation. TUGboat
40(2): 187-193, 2019. https://tug.org/
TUGboat/tb40-2/tbl25soj ka-patgen.pdf
[26] P. Sojka, O. Sojka. The Unreasonable Effectiveness
of Pattern Generation. Zpravodaj Cg TUG
29(l-4):73-86, 2019. 10.5300/2019-1-4/73
[27] P. Sojka, O. Sojka. Towards Universal
Hyphenation Patterns. In Proceedings of
Recent Advances in Slavonic Natural Language
Processing—RASLAN 2019, A. Horák,
P. Rychlý A. Rambousek, eds., pp. 63-68,
Karlova Studánka, Czech Republic, 2019. Tribun
EU. https://is.muni.cz/publication/
1585259/?lang=en. https://nip.fi.muni.
cz/raslan/2019/paperl3-sojka.pdf
[28] P. Sojka, O. Sojka. Towards New Czechoslovak
Hyphenation Patterns. Zpravodaj CsTUG
30(3-4):118-126, 2020. https://cstug.cz/
bulletin/pdf/2020-3-4.pdf#page=16
TUGboat, Volume 42 (2021), No. 2 — Proceedings of the 2021 Annual Meeting 7
Petr Sojka, Ondrej Sojka
[29] P. Sojka, P. Ševeček. Hyphenation in
T E X —Quo Vadis? TUGboat 16(3):280-289,
1995. https://tug.org/TUGboat/tbl6-3/
tb48sojl.pdf
[30] N. Trogkanis, C. Elkan. Conditional Random
Fields for Word Hyphenation. In Proceedings of
the 48th Annual Meeting of the ACL, pp. 366-
374, Uppsala, Sweden, July 2010. ACL. https:
//www.aclweb.org/anthology/P10-1038
[31] Y. Wu, M. Schuster, et al. Google's Neural
Machine Translation System: Bridging the
Gap between Human and Machine Translation,
2016. https://paperswithcode.com/method/
wordpiece
[32] Ľ. Štúr Institute of Linguistics of the Slovak
Academy of Sciences (SAS), ed. Pravidlá slovenského
pravopisu (Rules of Slovak Grammar).
Veda, publisher of SAS, Bratislava, third, updated
printing edition, 2000. https: //www.
juls.savba.sk/ediela/psp2000/psp.pdf
8 TUGboat, Volume 42 (2021), No. 2 — Proceedings of the 2021 Annual Meeting