KOVÁŘ, Vojtěch, Monika MOČIARIKOVÁ and Pavel RYCHLÝ. Finding Definitions in Large Corpora with Sketch Engine. In Nicoletta Calzolari (Conference Chair) et al. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia: European Language Resources Association (ELRA), 2016, p. 391-394. ISBN 978-2-9517408-9-1.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Finding Definitions in Large Corpora with Sketch Engine
Authors KOVÁŘ, Vojtěch (203 Czech Republic, guarantor, belonging to the institution), Monika MOČIARIKOVÁ (703 Slovakia, belonging to the institution) and Pavel RYCHLÝ (203 Czech Republic, belonging to the institution).
Edition Portorož, Slovenia, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), p. 391-394, 4 pp. 2016.
Publisher European Language Resources Association (ELRA)
Other information
Original language English
Type of outcome Proceedings paper
Field of Study 10201 Computer sciences, information science, bioinformatics
Country of publisher France
Confidentiality degree is not subject to a state or trade secret
Publication form storage medium (CD, DVD, flash disk)
RIV identification code RIV/00216224:14330/16:00088334
Organization unit Faculty of Informatics
ISBN 978-2-9517408-9-1
Keywords in English Sketch Engine; definition; definitions; CQL; corpora
Tags firank_B
Changed by Changed by: doc. Mgr. Pavel Rychlý, Ph.D., učo 3692. Changed: 20/12/2016 13:55.
Abstract
The paper describes automatic definition finding implemented within the leading corpus query and management tool, Sketch Engine. The implementation exploits complex pattern-matching queries in the corpus query language (CQL) and the indexing mechanism of word sketches for finding and storing definition candidates throughout the corpus. The approach is evaluated for Czech and English corpora, showing that the results are usable in practice: precision of the tool ranges between 30 and 75 percent (depending on the major corpus text types) and we were able to extract nearly 2 million definition candidates from an English corpus with 1.4 billion words. The feature is embedded into the interface as a concordance filter, so that users can search for definitions of any query to the corpus, including very specific multi-word queries. The results also indicate that ordinary texts (unlike explanatory texts) contain rather low number of definitions, which is perhaps the most important problem with automatic definition finding in general.
Links
GA15-13277S, research and development projectName: Hyperintensionální logika pro analýzu přirozeného jazyka
Investor: Czech Science Foundation
7F14047, research and development projectName: Harvesting big text data for under-resourced languages (Acronym: HaBiT)
Investor: Ministry of Education, Youth and Sports of the CR
PrintDisplayed: 8/5/2024 02:15