PLIN080 Creating data set

Faculty of Arts
Autumn 2024
Extent and Intensity
0/2/0. 4 credit(s). Type of Completion: z (credit).
Taught partially online.
Teacher(s)
prof. Radek Čech, Ph.D. (lecturer)
Mgr. Helena Medková (lecturer)
Guaranteed by
prof. Radek Čech, Ph.D.
Department of Czech Language – Faculty of Arts
Contact Person: Bc. Silvie Hulewicz, DiS.
Supplier department: Department of Czech Language – Faculty of Arts
Prerequisites (in Czech)
FAKULTA ( FF ) && FORMA ( P )
Course Enrolment Limitations
The course is also offered to the students of the fields other than those the course is directly associated with.
The capacity limit for the course is 20 student(s).
Current registration and enrolment status: enrolled: 0/20, only registered: 0/20, only registered with preference (fields directly associated with the programme): 0/20
fields of study / plans the course is directly associated with
Course objectives
The course suits linguistics and computational linguistics students with basic or zero knowledge of machine learning who want to gain practical skills useful for machine learning projects. Students will acquire knowledge of the fundamental aspects of computational natural language processing, emphasising creating training/test sets for machine learning applications in linguistic research.
Learning outcomes
In the course, students will gain practical experience with data collection using the corpus manager Sketch Engine, creating training/test data sets, and modifying and manipulating data using Python and selected libraries (Pandas, Re, NLTK, Scikit-Learn, Matplotlib etc.) for data cleaning and visualization.
Syllabus
  • 1. Introduction: Assignment overview, introduction to machine learning methods.
  • 2. Data set types: data sets according to learning tasks, research objectives in linguistics, and data set creation.
  • 3. Data preprocessing: data cleaning, duplicate removal, tokenization, lemmatization, morphological analysis, and syntactic analysis (UD Pipe, Majka, Desamb tools).
  • 4. Data annotation: Annotation scheme, inter-annotator agreement measurement.
  • 5. Linguistic data analysis: Data set statistics and visualization in graphs.
  • 6. Machine learning: Supervised and unsupervised learning, training a language model for the classification task, model evaluation, and cross-validation.
Literature
  • GÉRON, Aurélien. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow : concepts, tools, and techniques to build intelligent systems. Third edition. Beijing: O'Reilly, 2022, xxv, 834. ISBN 9781098125974. info
Teaching methods
Seminar, computer practice (Google Colaboratory tool), independent work, consultation.
Assessment methods
To receive credits, students must deliver two well-annotated data sets of 1,000 sentences each. The evaluation criteria include submitting homework on time and active class participation.
Language of instruction
Czech
Further comments (probably available only in Czech)
The course is taught annually.
The course is taught: every week.
Listed among pre-requisites of other courses
Teacher's information
The course is structured to alternate between instruction and independent student work.
The course is also listed under the following terms Autumn 2023.
  • Enrolment Statistics (Autumn 2024, recent)
  • Permalink: https://is.muni.cz/course/phil/autumn2024/PLIN080