Longest-commonest Match

KILGARRIFF, Adam, Vít BAISA, Miloš JAKUBÍČEK and Pavel RYCHLÝ. Longest-commonest Match. Online. In Kosem, I., Jakubíček, M., Kallas, J., Krek, S. Electronic lexicography in the 21st century: linking lexical data in the digital age. Proceedings of the eLex 2015 conference, 11-13 August 2015, Herstmonceux Castle, United Kingdom. Jlubljana: Trojina, Institute for Applied Slovene Studies, 2015, p. 397-404. ISBN 978-961-93594-3-3.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Longest-commonest Match
Authors	KILGARRIFF, Adam (826 United Kingdom of Great Britain and Northern Ireland), Vít BAISA (203 Czech Republic, guarantor, belonging to the institution), Miloš JAKUBÍČEK (203 Czech Republic, belonging to the institution) and Pavel RYCHLÝ (203 Czech Republic, belonging to the institution).
Edition	Jlubljana, Electronic lexicography in the 21st century: linking lexical data in the digital age. Proceedings of the eLex 2015 conference, 11-13 August 2015, Herstmonceux Castle, United Kingdom. p. 397-404, 8 pp. 2015.
Publisher	Trojina, Institute for Applied Slovene Studies

Other information
Original language	English
Type of outcome	Proceedings paper
Field of Study	10201 Computer sciences, information science, bioinformatics
Country of publisher	Slovenia
Confidentiality degree	is not subject to a state or trade secret
Publication form	electronic version available online
WWW	URL
RIV identification code	RIV/00216224:14330/15:00080952
Organization unit	Faculty of Informatics
ISBN	978-961-93594-3-3
Keywords in English	multiword expresion; collocation; word sketch; Sketch Engine
Tags	International impact, Reviewed
Changed by	Changed by: Mgr. et Mgr. Vít Baisa, Ph.D., učo 139654. Changed: 6/1/2016 11:35.

Abstract

Finding two-word collocations is a well-studied task within natural language processing. The result of this task for a given headword is usually a list of collocations sorted by a salience score. In corpus manager Sketch Engine, these pairs are extracted from data using a word sketch grammar relation rules and log-dice statistics resulting in a sorted list of triples . The longest–commonest match is a straightforward extension of these two-word collocations into multiword expressions. The resulting expressions are also very useful for representing the most common realisation of the collocational pair and to facilitate the interpretation of the raw triplet because sometimes, for such a triple, it is not clear from what texts it comes. We present here an algorithm behind the longest–commonest match together with a simple evaluation. The longest–commonest match is already implemented in Sketch Engine.

Links
GA15-13277S, research and development project	Name: Hyperintensionální logika pro analýzu přirozeného jazyka
GA15-13277S, research and development project	Investor: Czech Science Foundation
LM2010013, research and development project	Name: LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat (Acronym: LINDAT-Clarin)
LM2010013, research and development project	Investor: Ministry of Education, Youth and Sports of the CR
7F14047, research and development project	Name: Harvesting big text data for under-resourced languages (Acronym: HaBiT)
7F14047, research and development project	Investor: Ministry of Education, Youth and Sports of the CR

PrintDisplayed: 13/6/2024 20:18

Longest-commonest Match

Other applications