Finding Terms in Corpora for Many Languages with the Sketch
Engine

KILGARRIFF, Adam, Miloš JAKUBÍČEK, Vojtěch KOVÁŘ, Pavel RYCHLÝ and Vít SUCHOMEL. Finding Terms in Corpora for Many Languages with the Sketch Engine. Online. In Proceedings of the Demonstrations at the 14th Conferencethe European Chapter of the Association for Computational Linguistics. Gothenburg, Sweden: The Association for Computational Linguistics, 2014, p. 53-56. ISBN 978-1-937284-75-6.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Finding Terms in Corpora for Many Languages with the Sketch Engine
Authors	KILGARRIFF, Adam (826 United Kingdom of Great Britain and Northern Ireland), Miloš JAKUBÍČEK (203 Czech Republic, guarantor, belonging to the institution), Vojtěch KOVÁŘ (203 Czech Republic, belonging to the institution), Pavel RYCHLÝ (203 Czech Republic, belonging to the institution) and Vít SUCHOMEL (203 Czech Republic, belonging to the institution).
Edition	Gothenburg, Sweden, Proceedings of the Demonstrations at the 14th Conferencethe European Chapter of the Association for Computational Linguistics, p. 53-56, 4 pp. 2014.
Publisher	The Association for Computational Linguistics

Other information
Original language	English
Type of outcome	Proceedings paper
Field of Study	10201 Computer sciences, information science, bioinformatics
Country of publisher	Czech Republic
Confidentiality degree	is not subject to a state or trade secret
Publication form	electronic version available online
WWW	Plný text výsledku
RIV identification code	RIV/00216224:14330/14:00075387
Organization unit	Faculty of Informatics
ISBN	978-1-937284-75-6
Keywords in English	terminology; terms; corpora; sketch engine
Tags	best
Tags	International impact, Reviewed
Changed by	Changed by: RNDr. Vít Suchomel, Ph.D., učo 139723. Changed: 29/10/2014 09:19.

Abstract

Term candidates for a domain, in a language, can be found by • taking a corpus for the domain, and a refer- ence corpus for the language • identifying the grammatical shape of a term in the language • tokenising, lemmatising and POS-tagging both corpora • identifying (and counting) the items in each corpus which match the grammatical shape • for each item in the domain corpus, compar- ing its frequency with its frequency in the refence corpus. Then, the items with the highest frequency in the domain corpus in comparison to the reference cor- pus will be the top term candidates. None of the steps above are unusual or innova- tive for NLP (see, e. g., (Aker et al., 2013), (Go- jun et al., 2012)). However it is far from trivial to implement them all, for numerous languages, in an environment that makes it easy for non- programmers to find the terms in a domain. This is what we have done in the Sketch Engine (Kilgarriff et al., 2004), and will demonstrate. In this abstract we describe how we addressed each of the stages above.

Links
LM2010013, research and development project	Name: LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat (Acronym: LINDAT-Clarin)
LM2010013, research and development project	Investor: Ministry of Education, Youth and Sports of the CR
MUNI/A/0765/2013, interní kód MU	Name: Zapojení studentů Fakulty informatiky do mezinárodní vědecké komunity (Acronym: SKOMU)
MUNI/A/0765/2013, interní kód MU	Investor: Masaryk University, Category A

PrintDisplayed: 25/4/2024 02:40

Finding Terms in Corpora for Many Languages with the Sketch Engine

Other applications