Finding space- and time-effective even perfect solution to the dictionary problem is an important practical and research problem, which solving may lead to a breakthrough in computation. Competing pattern technology from TeX is a special case, where for a given dictionary a word segmentation is stored in the competing patterns yet with very good generalization quality. Recently, the unreasonable effectiveness of pattern generation has been shown---it is possible to use hyphenation patterns to solve the dictionary problem jointly even for several languages without compromise.
In this article, we study the effectiveness of patgen for the supervised machine learning of the generation of the Czechoslovak hyphenation patterns. We show the machine learning techniques to develop competing patterns that are close to being perfect. We evaluate the new approach by improvements and space savings we gained during the development and finetuning of Czechoslovak hyphenation patterns.
Návaznosti
LM2023062, projekt VaV
Název: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy
Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, LINDAT/CLARIAH-CZ - Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy
SOJKA, Ondřej a Petr SOJKA. Towards Perfection of Machine Learning of Competing Patterns: The Use Case of Czechoslovak Patterns Development. In Recent Advances in Slavonic Natural Language Processing (RASLAN 2023). Recent Advances in Slavonic. Brno: Tribun EU, 2023, s. 113-120. ISBN 978-80-263-1793-7.
@inproceedings{2345338, author = {Sojka, Ondřej and Sojka, Petr}, address = {Brno}, booktitle = {Recent Advances in Slavonic Natural Language Processing (RASLAN 2023)}, edition = {Recent Advances in Slavonic}, keywords = {dictionary problem; effectiveness; hyphenation patterns; patgen; syllabification; Czech; Slovak; Czechoslovak patterns; machine learning}, howpublished = {tištěná verze "print"}, language = {eng}, location = {Brno}, isbn = {978-80-263-1793-7}, pages = {113-120}, publisher = {Tribun EU}, title = {Towards Perfection of Machine Learning of Competing Patterns: The Use Case of Czechoslovak Patterns Development}, url = {https://www.fi.muni.cz/usr/sojka/papers/sojka-sojka-raslan-2023.pdf}, year = {2023} }
TY - JOUR ID - 2345338 AU - Sojka, Ondřej - Sojka, Petr PY - 2023 TI - Towards Perfection of Machine Learning of Competing Patterns: The Use Case of Czechoslovak Patterns Development PB - Tribun EU CY - Brno SN - 9788026317937 KW - dictionary problem KW - effectiveness KW - hyphenation patterns KW - patgen KW - syllabification KW - Czech KW - Slovak KW - Czechoslovak patterns KW - machine learning UR - https://www.fi.muni.cz/usr/sojka/papers/sojka-sojka-raslan-2023.pdf N2 - Finding space- and time-effective even perfect solution to the dictionary problem is an important practical and research problem, which solving may lead to a breakthrough in computation. Competing pattern technology from TeX is a special case, where for a given dictionary a word segmentation is stored in the competing patterns yet with very good generalization quality. Recently, the unreasonable effectiveness of pattern generation has been shown---it is possible to use hyphenation patterns to solve the dictionary problem jointly even for several languages without compromise.
In this article, we study the effectiveness of patgen for the supervised machine learning of the generation of the Czechoslovak hyphenation patterns. We show the machine learning techniques to develop competing patterns that are close to being perfect. We evaluate the new approach by improvements and space savings we gained during the development and finetuning of Czechoslovak hyphenation patterns.
ER -
SOJKA, Ondřej a Petr SOJKA. Towards Perfection of Machine Learning of Competing Patterns: The Use Case of Czechoslovak Patterns Development. In \textit{Recent Advances in Slavonic Natural Language Processing (RASLAN 2023)}. Recent Advances in Slavonic. Brno: Tribun EU, 2023, s.~113-120. ISBN~978-80-263-1793-7.