Rap Corpus



















































The RapCor is a small specialized corpus for the French language. It is developed at the Department of Romance Languages and Literatures of the Faculty of Arts, Masaryk University in Brno, under the supervision of Alena Podhorná-Polická (associate professor).
It is a corpus of spoken French in rap songs designed for the purposes of socio-lexical research. The specific nature of rap texts allows a wider understanding of substandard French, particularly the dynamics of the development of generational and ethno-socio-geographically conditioned word-formation and neology in relation to lexicography. This corpus can also serve those interested in modern poetry or sociolinguistics (especially in relation to multi-ethnic suburbs).
Current status
- last update
- ready for corpus
- number of scans in storage
- total elaborated
To introduce: What is a corpus?
The word corpus refers to a set of examined texts. With the development of the capacity of computers, a corpus is more and more understood as an electronic corpus, i.e. a corpus of texts (or transcripts of sound recordings) stored and processed by computers for linguistic research. Thanks to the ease of searching and evaluating results, it is now possible to obtain much more reliable information and statistics than in the days of card files.
Electronic language corpora began to emerge with the development of computer technology in the last decades of the 20th century. Today, a range of smaller or larger corpora exists for most of the world’s major languages, the largest of which describe the entire national language and span several hundred million-word forms. For example, for the Czech language, the Czech National Corpus Institute at the Faculty of Arts of Charles University in Prague actively creates the Czech National Corpus (Český Národní Korpus, ČNK) – a corpus of several subcorpora of written and spoken texts (see www.korpus.cz). As for the French language, the largest corpus Frantext is designed at the University of Nancy and is composed of mostly literary texts. Furthermore, there are several small corpora, including those of spoken French, such as ESLO and Clapi, among others.
Corpus RapCor
RapCor was established in 2009 as part of the postdoctoral project of the Czech Science Foundation - Expressivity in Youth Slang on the Background of the Quest for Individual and Group Identity (GP405/09/P307). Collecting and primary editing of the source material are managed in cooperation with French language students, who draw the lyrics of selected French rap songs either from fan transcripts available on the Internet or (currently as a priority) directly from the original lyrics on the CD covers if they are presented.
The lyrics are then checked according to the sound recordings and corrected if there are any differences so that the rapped text is accurately transcribed. Using the TreeTagger programme, the lyrics are then automatically segmented into individual words which are lemmatized (converted to their base form, i.e. lemma) and completed with grammatical category tags. Due to the high frequency of neologisms and substandard expressions, the result must be completed manually and also the automatic assignment of grammatical categories is checked. Substandard terms are marked both lexicographically according to the sign attributed to them in the reference dictionary (words of the spoken language or vulgarisms) and according to word-formation. Le Petit Robert Électronique is used as a reference dictionary from which the substandard expressions are taken. This dictionary also serves as a differential dictionary for identifying omitted neologisms and lexemes. And Le Petit Larousse électronique is used as a differential dictionary for proper nouns.
The annotated table of morphosyntactic tags, lemmas and other information about the song and the performer of the song or its part is finally, thanks to the technical assistance of Marek Stehlík of the Computer Systems Unit of the Faculty of Informatics of Masaryk University (CVT FI MU), converted into a HTML file provided with metadata from an associated database. The current and older files of all elaborated texts can then be downloaded from our Storage imported into the TXM lexicometric program.
With the latest version it is also possible to work in the client application Sketch Engine (corpus manager and software used for text analysis, licensed source, for FF MU students and, until March 2022 for all academicians, available free of charge). Its co-author is Pavel Rychlý from the Department of Machine Learning and Data Processing of the Faculty of Informatics of Masaryk University. The oldest version of the corpus used his older products (corpus manager Manatee and client application Bonito.