Longest-commonest Match

D 2015

Longest-commonest Match

KILGARRIFF, Adam; Vít BAISA; Miloš JAKUBÍČEK a Pavel RYCHLÝ

Základní údaje

Originální název

Longest-commonest Match

Autoři

KILGARRIFF, Adam (826 Velká Británie a Severní Irsko); Vít BAISA (203 Česká republika, garant, domácí); Miloš JAKUBÍČEK (203 Česká republika, domácí) a Pavel RYCHLÝ (203 Česká republika, domácí)

Vydání

Jlubljana, Electronic lexicography in the 21st century: linking lexical data in the digital age. Proceedings of the eLex 2015 conference, 11-13 August 2015, Herstmonceux Castle, United Kingdom. od s. 397-404, 8 s. 2015

Nakladatel

Trojina, Institute for Applied Slovene Studies

Další údaje

Jazyk

angličtina

Typ výsledku

Stať ve sborníku

Obor

10201 Computer sciences, information science, bioinformatics

Stát vydavatele

Slovinsko

Utajení

není předmětem státního či obchodního tajemství

Forma vydání

elektronická verze "online"

Odkazy

URL

Kód RIV

RIV/00216224:14330/15:00080952

Organizační jednotka

Fakulta informatiky

ISBN

978-961-93594-3-3

Klíčová slova anglicky

multiword expresion; collocation; word sketch; Sketch Engine

Příznaky

Mezinárodní význam, Recenzováno

Změněno: 6. 1. 2016 11:35, Mgr. et Mgr. Vít Baisa, Ph.D.

Anotace

V originále

Finding two-word collocations is a well-studied task within natural language processing. The result of this task for a given headword is usually a list of collocations sorted by a salience score. In corpus manager Sketch Engine, these pairs are extracted from data using a word sketch grammar relation rules and log-dice statistics resulting in a sorted list of triples . The longest–commonest match is a straightforward extension of these two-word collocations into multiword expressions. The resulting expressions are also very useful for representing the most common realisation of the collocational pair and to facilitate the interpretation of the raw triplet because sometimes, for such a triple, it is not clear from what texts it comes. We present here an algorithm behind the longest–commonest match together with a simple evaluation. The longest–commonest match is already implemented in Sketch Engine.

Návaznosti

GA15-13277S, projekt VaV

Název: Hyperintensionální logika pro analýzu přirozeného jazyka

Investor: Grantová agentura ČR, Hyperintensionální logika pro analýzu přirozeného jazyka

LM2010013, projekt VaV

Název: LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat (Akronym: LINDAT-Clarin)

Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, Projekt LINDAT-Clarin - Vybudování a provoz českého uzlu pan-evropské infrastruktury pro výzkum

7F14047, projekt VaV

Název: Harvesting big text data for under-resourced languages (Akronym: HaBiT)

Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, Harvesting big text data for under-resourced languages

Citovat

KILGARRIFF, Adam; Vít BAISA; Miloš JAKUBÍČEK a Pavel RYCHLÝ. Longest-commonest Match. Online. In Kosem, I., Jakubíček, M., Kallas, J., Krek, S. Electronic lexicography in the 21st century: linking lexical data in the digital age. Proceedings of the eLex 2015 conference, 11-13 August 2015, Herstmonceux Castle, United Kingdom. Jlubljana: Trojina, Institute for Applied Slovene Studies, 2015, s. 397-404. ISBN 978-961-93594-3-3.

@inproceedings{1308616,
   author = {Kilgarriff, Adam and Baisa, Vít and Jakubíček, Miloš and Rychlý, Pavel},
   address = {Jlubljana},
   booktitle = {Electronic lexicography in the 21st century: linking lexical data in the digital age. Proceedings of the eLex 2015 conference, 11-13 August 2015, Herstmonceux Castle, United Kingdom.},
   editor = {Kosem, I., Jakubíček, M., Kallas, J., Krek, S.},
   keywords = {multiword expresion; collocation; word sketch; Sketch Engine},
   howpublished = {elektronická verze "online"},
   language = {eng},
   location = {Jlubljana},
   isbn = {978-961-93594-3-3},
   pages = {397-404},
   publisher = {Trojina, Institute for Applied Slovene Studies},
   title = {Longest-commonest Match},
   url = {https://elex.link/elex2015/proceedings/eLex_2015_26_Kilgarriff+etal.pdf},
   year = {2015}
}

TY  - CONF
ID  - 1308616
AU  - Kilgarriff, Adam - Baisa, Vít - Jakubíček, Miloš - Rychlý, Pavel
PY  - 2015
TI  - Longest-commonest Match
PB  - Trojina, Institute for Applied Slovene Studies
CY  - Jlubljana
SN  - 9789619359433
KW  - multiword expresion
KW  - collocation
KW  - word sketch
KW  - Sketch Engine
UR  - https://elex.link/elex2015/proceedings/eLex_2015_26_Kilgarriff+etal.pdf
L2  - https://elex.link/elex2015/proceedings/eLex_2015_26_Kilgarriff+etal.pdf
N2  - Finding two-word collocations is a well-studied task within natural language processing. The result of this task for a given headword is usually a list of collocations sorted by a salience score. In corpus manager Sketch Engine, these pairs are extracted from data using a word sketch grammar relation rules and log-dice statistics resulting in a sorted list of triples . The longest–commonest match is a straightforward extension of these two-word collocations into multiword expressions. The resulting expressions are also very useful for representing the most common realisation of the collocational pair and to facilitate the interpretation of the raw triplet because sometimes, for such a triple, it is not clear from what texts it comes. We present here an algorithm behind the longest–commonest match together with a simple evaluation. The longest–commonest match is already implemented in Sketch Engine.
ER  -

KILGARRIFF, Adam; Vít BAISA; Miloš JAKUBÍČEK a Pavel RYCHLÝ. Longest-commonest Match. Online. In Kosem, I., Jakubíček, M., Kallas, J., Krek, S. \textit{Electronic lexicography in the 21st century: linking lexical data in the digital age. Proceedings of the eLex 2015 conference, 11-13 August 2015, Herstmonceux Castle, United Kingdom.}. Jlubljana: Trojina, Institute for Applied Slovene Studies, 2015, s.~397-404. ISBN~978-961-93594-3-3.

Přehled o publikaci