Finding Terms in Corpora for Many Languages with the Sketch
Engine

D 2014

Finding Terms in Corpora for Many Languages with the Sketch Engine

KILGARRIFF, Adam, Miloš JAKUBÍČEK, Vojtěch KOVÁŘ, Pavel RYCHLÝ, Vít SUCHOMEL et. al.

Základní údaje

Originální název

Finding Terms in Corpora for Many Languages with the Sketch Engine

Autoři

KILGARRIFF, Adam (826 Velká Británie a Severní Irsko), Miloš JAKUBÍČEK (203 Česká republika, garant, domácí), Vojtěch KOVÁŘ (203 Česká republika, domácí), Pavel RYCHLÝ (203 Česká republika, domácí) a Vít SUCHOMEL (203 Česká republika, domácí)

Vydání

Gothenburg, Sweden, Proceedings of the Demonstrations at the 14th Conferencethe European Chapter of the Association for Computational Linguistics, od s. 53-56, 4 s. 2014

Nakladatel

The Association for Computational Linguistics

Další údaje

Jazyk

angličtina

Typ výsledku

Stať ve sborníku

Obor

10201 Computer sciences, information science, bioinformatics

Stát vydavatele

Česká republika

Utajení

není předmětem státního či obchodního tajemství

Forma vydání

elektronická verze "online"

Odkazy

Plný text výsledku

Kód RIV

RIV/00216224:14330/14:00075387

Organizační jednotka

Fakulta informatiky

ISBN

978-1-937284-75-6

Klíčová slova anglicky

terminology; terms; corpora; sketch engine

Štítky

best

Příznaky

Mezinárodní význam, Recenzováno

Změněno: 29. 10. 2014 09:19, RNDr. Vít Suchomel, Ph.D.

Anotace

V originále

Term candidates for a domain, in a language, can be found by • taking a corpus for the domain, and a refer- ence corpus for the language • identifying the grammatical shape of a term in the language • tokenising, lemmatising and POS-tagging both corpora • identifying (and counting) the items in each corpus which match the grammatical shape • for each item in the domain corpus, compar- ing its frequency with its frequency in the refence corpus. Then, the items with the highest frequency in the domain corpus in comparison to the reference cor- pus will be the top term candidates. None of the steps above are unusual or innova- tive for NLP (see, e. g., (Aker et al., 2013), (Go- jun et al., 2012)). However it is far from trivial to implement them all, for numerous languages, in an environment that makes it easy for non- programmers to find the terms in a domain. This is what we have done in the Sketch Engine (Kilgarriff et al., 2004), and will demonstrate. In this abstract we describe how we addressed each of the stages above.

Návaznosti

LM2010013, projekt VaV

Název: LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat (Akronym: LINDAT-Clarin)

Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, Projekt LINDAT-Clarin - Vybudování a provoz českého uzlu pan-evropské infrastruktury pro výzkum

MUNI/A/0765/2013, interní kód MU

Název: Zapojení studentů Fakulty informatiky do mezinárodní vědecké komunity (Akronym: SKOMU)

Investor: Masarykova univerzita, Zapojení studentů Fakulty informatiky do mezinárodní vědecké komunity, DO R. 2020_Kategorie A - Specifický výzkum - Studentské výzkumné projekty

Citovat

KILGARRIFF, Adam, Miloš JAKUBÍČEK, Vojtěch KOVÁŘ, Pavel RYCHLÝ a Vít SUCHOMEL. Finding Terms in Corpora for Many Languages with the Sketch Engine. Online. In Proceedings of the Demonstrations at the 14th Conferencethe European Chapter of the Association for Computational Linguistics. Gothenburg, Sweden: The Association for Computational Linguistics, 2014, s. 53-56. ISBN 978-1-937284-75-6.

@inproceedings{1181590,
   author = {Kilgarriff, Adam and Jakubíček, Miloš and Kovář, Vojtěch and Rychlý, Pavel and Suchomel, Vít},
   address = {Gothenburg, Sweden},
   booktitle = {Proceedings of the Demonstrations at the 14th Conferencethe European Chapter of the Association for Computational Linguistics},
   keywords = {terminology; terms; corpora; sketch engine},
   howpublished = {elektronická verze "online"},
   language = {eng},
   location = {Gothenburg, Sweden},
   isbn = {978-1-937284-75-6},
   pages = {53-56},
   publisher = {The Association for Computational Linguistics},
   title = {Finding Terms in Corpora for Many Languages with the Sketch Engine},
   url = {http://aclweb.org/anthology/E/E14/E14-2014.pdf},
   year = {2014}
}

TY  - CONF
ID  - 1181590
AU  - Kilgarriff, Adam - Jakubíček, Miloš - Kovář, Vojtěch - Rychlý, Pavel - Suchomel, Vít
PY  - 2014
TI  - Finding Terms in Corpora for Many Languages with the Sketch Engine
PB  - The Association for Computational Linguistics
CY  - Gothenburg, Sweden
SN  - 9781937284756
KW  - terminology
KW  - terms
KW  - corpora
KW  - sketch engine
UR  - http://aclweb.org/anthology/E/E14/E14-2014.pdf
L2  - http://aclweb.org/anthology/E/E14/E14-2014.pdf
N2  - Term candidates for a domain, in a language, can be found by • taking a corpus for the domain, and a refer- ence corpus for the language • identifying the grammatical shape of a term in the language • tokenising, lemmatising and POS-tagging both corpora • identifying (and counting) the items in each corpus which match the grammatical shape • for each item in the domain corpus, compar- ing its frequency with its frequency in the refence corpus. Then, the items with the highest frequency in the domain corpus in comparison to the reference cor- pus will be the top term candidates. None of the steps above are unusual or innova- tive for NLP (see, e. g., (Aker et al., 2013), (Go- jun et al., 2012)). However it is far from trivial to implement them all, for numerous languages, in an environment that makes it easy for non- programmers to find the terms in a domain. This is what we have done in the Sketch Engine (Kilgarriff et al., 2004), and will demonstrate. In this abstract we describe how we addressed each of the stages above.
ER  -

KILGARRIFF, Adam, Miloš JAKUBÍČEK, Vojtěch KOVÁŘ, Pavel RYCHLÝ a Vít SUCHOMEL. Finding Terms in Corpora for Many Languages with the Sketch Engine. Online. In \textit{Proceedings of the Demonstrations at the 14th Conferencethe European Chapter of the Association for Computational Linguistics}. Gothenburg, Sweden: The Association for Computational Linguistics, 2014, s.~53-56. ISBN~978-1-937284-75-6.

Přehled o publikaci