Finding Terms in Corpora for Many Languages with the Sketch
Engine

KILGARRIFF, Adam, Miloš JAKUBÍČEK, Vojtěch KOVÁŘ, Pavel RYCHLÝ a Vít SUCHOMEL. Finding Terms in Corpora for Many Languages with the Sketch Engine. Online. In Proceedings of the Demonstrations at the 14th Conferencethe European Chapter of the Association for Computational Linguistics. Gothenburg, Sweden: The Association for Computational Linguistics, 2014, s. 53-56. ISBN 978-1-937284-75-6.

Další formáty: BibTeX LaTeX RIS

Základní údaje
Originální název	Finding Terms in Corpora for Many Languages with the Sketch Engine
Autoři	KILGARRIFF, Adam (826 Velká Británie a Severní Irsko), Miloš JAKUBÍČEK (203 Česká republika, garant, domácí), Vojtěch KOVÁŘ (203 Česká republika, domácí), Pavel RYCHLÝ (203 Česká republika, domácí) a Vít SUCHOMEL (203 Česká republika, domácí).
Vydání	Gothenburg, Sweden, Proceedings of the Demonstrations at the 14th Conferencethe European Chapter of the Association for Computational Linguistics, od s. 53-56, 4 s. 2014.
Nakladatel	The Association for Computational Linguistics

Další údaje
Originální jazyk	angličtina
Typ výsledku	Stať ve sborníku
Obor	10201 Computer sciences, information science, bioinformatics
Stát vydavatele	Česká republika
Utajení	není předmětem státního či obchodního tajemství
Forma vydání	elektronická verze "online"
WWW	Plný text výsledku
Kód RIV	RIV/00216224:14330/14:00075387
Organizační jednotka	Fakulta informatiky
ISBN	978-1-937284-75-6
Klíčová slova anglicky	terminology; terms; corpora; sketch engine
Štítky	best
Příznaky	Mezinárodní význam, Recenzováno
Změnil	Změnil: RNDr. Vít Suchomel, Ph.D., učo 139723. Změněno: 29. 10. 2014 09:19.

Anotace

Term candidates for a domain, in a language, can be found by • taking a corpus for the domain, and a refer- ence corpus for the language • identifying the grammatical shape of a term in the language • tokenising, lemmatising and POS-tagging both corpora • identifying (and counting) the items in each corpus which match the grammatical shape • for each item in the domain corpus, compar- ing its frequency with its frequency in the refence corpus. Then, the items with the highest frequency in the domain corpus in comparison to the reference cor- pus will be the top term candidates. None of the steps above are unusual or innova- tive for NLP (see, e. g., (Aker et al., 2013), (Go- jun et al., 2012)). However it is far from trivial to implement them all, for numerous languages, in an environment that makes it easy for non- programmers to find the terms in a domain. This is what we have done in the Sketch Engine (Kilgarriff et al., 2004), and will demonstrate. In this abstract we describe how we addressed each of the stages above.

Návaznosti
LM2010013, projekt VaV	Název: LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat (Akronym: LINDAT-Clarin)
LM2010013, projekt VaV	Investor: Ministerstvo školství, mládeže a tělovýchovy ČR, Projekt LINDAT-Clarin - Vybudování a provoz českého uzlu pan-evropské infrastruktury pro výzkum
MUNI/A/0765/2013, interní kód MU	Název: Zapojení studentů Fakulty informatiky do mezinárodní vědecké komunity (Akronym: SKOMU)
MUNI/A/0765/2013, interní kód MU	Investor: Masarykova univerzita, Zapojení studentů Fakulty informatiky do mezinárodní vědecké komunity, DO R. 2020_Kategorie A - Specifický výzkum - Studentské výzkumné projekty

VytisknoutZobrazeno: 25. 4. 2024 12:21

Finding Terms in Corpora for Many Languages with the Sketch Engine

Další aplikace