Education and Information Technologies https://doi.org/! 0.1007/sl 0639-024-12480-x Check for updates Using data clustering to reveal trainees' behavior in cybersecurity education Karolína Dočkalové Burská1 © • Jakub Rudolf Mlynářík1 • Radek Ošlejšek1 Received: 21 April 2023 / Accepted: 9 January 2024 ©TheAuthor(s) 2024 Abstract In cyber security education, hands-on training is a common type of exercise to help raise awareness and competence, and improve students' cybersecurity skills. To be able to measure the impact of the design of the particular courses, the designers need methods that can reveal hidden patterns in trainee behavior. However, the support of the designers in performing such analytic and evaluation tasks is ad-hoc and insufficient. With unsupervised machine learning methods, we designed a tool for clustering the trainee actions that can exhibit their strategies or help pinpoint flaws in the training design. By using a k-means++ algorithm, we explore clusters of trainees that unveil their specific behavior within the training sessions. The final visualization tool consists of views with scatter plots and radar charts. The former provides a two-dimensional correlation of selected trainee actions and displays their clusters. In contrast, the radar chart displays distinct clusters of trainees based on their more specific strategies or approaches when solving tasks. Through iterative training redesign, the tool can help designers identify improper training parameters and improve the quality of the courses accordingly. To evaluate the tool, we performed a qualitative evaluation of its outcomes with cybersecurity experts. The results confirm the usability of the selected methods in discovering significant trainee behavior. Our insights and recommendations can be beneficial for the design of tools for educators, even beyond cyber security. Keywords Visual analytics • Clustering analysis • Hands-on learning • Visualization K l Karolína Dočkalová Burská burska@mail.muni.cz Jakub Rudolf Mlynárik 445304 @ mail.muni.cz Radek Ošlejšek oslejsek@mail.muni.cz 1 Faculty of Informatics, Masaryk University, Botanická 68a, Brno 60200, Czech Republic Published online: 13 February 2024 Springer Education and Information Technologies 1 Introduction The shortage of cybersecurity workforce poses a critical danger for current companies ((ISC)2 2022). As cybersecurity skills require higher-order thinking (McMurtrey et al., 2008), the best way to develop and ameliorate these abilities is through practical exercises that help raise awareness and competence and improve students' cybersecurity skills. Regardless of the educational subject, tutors make intensive efforts to create, organize, and continually improve their hands-on courses. In contrast to many learning areas that produce tangible output suitable for checking, analysis, or assessment, e.g., a code of programming courses, practical cybersecurity training evinces a strong process-oriented characteristic. Tasks like "search for a vulnerability on server X " produce only sparse behavioral data that limit tutors' understanding of what trainees were really doing to solve the task. Therefore, we strive to support their endeavor by developing learning analytics tools to help tutors of cybersecurity hands-on exercises learn from conducted training sessions. Moreover, we apply methods of visual analytics to design and deliver easy-to-use analytical applications usable in practice. 1.1 Cybersecurity training background and limitations Cybersecurity education can take many forms, from table-tops and online quizzes to hands-on drills. Our approach is based on data collected from hands-on training sessions organized in so-called cyber ranges (Kniipfer et al., 2020; Ukwandu et al., 2020; Yamin et al., 2020; Chouliaras et al., 2021). They serve as safe virtual environments emulating computer networks and enabling a data analyst to gather traces of trainees' behavior. The data has the form of event logs that can be further aggregated into relevant higher-level features for clustering. However, even in the area of practical cyber exercises organized in cyber ranges, there are significant differences. Some of them follow free structure and rules, aiming to mimic real conditions. Typically, so-called cyber-defense exercises (CDX) are intended to train professionals (Eagle, 2013; Dasgupta et al., 2013). These competitions involve many teams like blue teams of defenders, red teams of attackers, or white teams responsible for the organization and compliance with rules. The complex scenarios of CDXs and many involved user roles introduce extremely variable behavior. On the contrary, training of beginners, typically students, often follows puzzle-based gamification principles of the educational content, where puzzles are used as a metaphor for getting students to think about how to frame and solve unstructured problems (Michalewicz & Michalewicz, 2008). In cyber security, such exercises are referred to as Capture the Flag (CTF) games (Werther et al., 2011; Davis et al., 2014; Svabensky et al., 2018; Kucek & Leitner, 2020). In what follows, we focus primarily on puzzle-based hands-on exercises organized as time-restricted (usually supervised) training sessions. The formal puzzle-based structure of CTFs enables us to select relevant features for clustering-based analytical methods and overcome the gap between the raw data and analytical goals. Springer Education and Information Technologies 1.2 Analytical background and challenges Data analysis can be conducted in different phases of a training life cycle. Based on the classification provided by Oslejsek et al. (2021), objectives of this paper address the post-training analysis of the quality of training exercise (V4) and behavior analysis (V5). We apply clustering methods on behavioral event logs collected during the exercise to find common correlations (often subtle) in the behavior of groups of trainees. Clustering techniques are among the unsupervised machine learning methods used to group data features by their similarity (Madhulatha, 2012). Their potential use in education is to identify typical or exceptional behavior of students, which may not be immediately obvious from individual data records. Group clusters and outliers observed in the data can raise hypotheses about used training strategies or features of training scenarios that analysts can further explore. Behavioral patterns revealed from cybersecurity exercises could be used, for instance, to estimate trainees' cybersecurity skills and the effectiveness of their actions, unveil attack-defense strategies, or identify possible issues in training scenarios. Clustering methods deal with features extracted from raw data. For example, if we have the Bash commands each trainee used in a Linux server to protect it (i.e., the raw data), we can extract a feature like the number of commands and use it to cluster trainees into groups with respect to their efficiency (the fewer commands used to protect the server, the more effective the trainee was). Multiple features are usually combined to get meaningful behavioral clusters. Even this trivial example demonstrates that features (the number of commands) and analytical goals (analyzing the efficiency of trainees in protecting the server) go hand-in-hand. Available raw data limits possible features and then possible analytical goals, and vice versa. Matching them up is challenging and requires employing some iterative strategy. Also, the feature extraction process itself can be limiting, as it must be computed automatically. Suppose, for instance, we have individual Bash commands used for some cybersecurity task in the raw dataset. While counting them is simple and straightforward, what if we would like to define the correctness of command sequence to protect the server feature? In this case, assessing the correctness of an arbitrary sequence of commands algorithmically can be very difficult. Therefore, this feature can be considered too ambiguous and practically unusable for automated clustering. This short discussion demonstrates that the definition of realistic analytical goals backed by available data and automatically retrievable features is challenging. In this paper, we apply an iterative visual-analytics development process to define analytical goals for data from CTF training sessions and to design and evaluate a practically usable analytical application. 1.3 Objectives We aim to contribute to the state of the art of behavioral analysis in education practice with the following objectives: Springer Education and Information Technologies • The formulation of analytical goals and related clustering method for the posttraining analysis of hands-on cybersecurity Capture the Flag games. • The design of an exploratory visual-analytics tool to support tutors in clusteringbased behavioral analysis of trainees. • The evaluation of the practical usability of the clustering method and visualizations. Different stakeholders can benefit from the research. Mainly: • Researchers can follow our approach to define additional analytical goals related to hands-on cybersecurity education or to apply described principles in different educational domains. • Developers of cybersecurity training platforms can adopt and integrate our tool into their analytical dashboards. • Tutors of CTF training sessions can use the proposed clustering-based analysis to spot their missteps during their tutoring. Or to explore exceptional or typical behavior of trainees. • Designers of training content can use the proposed clustering-based analysis to notice an inaccurate or faulty training design. 1.4 Research method Our approach to the clustering-based behavioral analysis of CTF trainees is based on the conceptual model for the visual analysis process Sacha et al. (2014), which is characterized by the interaction between data, visualizations, models of the data, and users discovering knowledge, as shown in the right-hand side of Fig. 1. The idea lies in automatically extracting features from raw data for a suitable clustering algorithm (the model part in the Figure) and gaining knowledge about trainees' behavior by interactively adjusting and exploring the data and clustering results via intuitive visualizations. We applied the Nested Model (Munzner, 2009; Meyer et al., 2012) methodology to propose a relevant clustering method and deliver practically usable exploratory visualizations. This methodology guides designers of visual analytics tools through Data gathering Visual data analysis access the training > automated (Z data collection > Visualizations Trainees Cyber range user interface model Ra^data visualization (event logs) j£„. V model Insights and building knowledge Features Clustering algorithm Model Training designers & tutors Fig. 1 The depiction of the whole data clustering process, including data collection, extraction, and application of the clustering algorithm Springer Education and Information Technologies the whole process and enables them to independently validate each of its layers. The nested model consists of four phases: Domain problem and data characterization aims to get familiar with the target domain. At this stage, we benefited from close collaboration with domain experts - tutors of hands-on cybersecurity courses, who gave us the necessary insight into their needs and actions. We conducted unstructured interviews and field observations of the hands-on cybersecurity training sessions. Furthermore, we collected data from the training sessions that consisted of succeeding tasks to solve. We focused on the training events described in Section 3.1 in more detail. A qualitative evaluation was then held to validate the fulfillment of the initial needs. Operation and data type abstraction aims at mapping the input problems onto a more specific description. We identified the main needs of the organizers of the training sessions and transformed them into three analytical goals posed in Section 3.2. Each goal focuses on a different aggregation technique that utilizes data clusters to help identify three types of training data outcomes. After identifying those main areas in the form of requirements, we needed to determine the main measures that could help align the necessary training values in the form of features. We selected six of them, as defined in Section 3.3. A Simple Ease Question (SEQ) questionnaire was used to measure the outcomes related to the use of the tools. Visual encoding and interaction design aims to interconnect visualization elements with interaction strategies. Once having the necessary characteristics, we had to encode the data into a suitable visual representation. As a result, we designed two types of visualization that deal with clustered data. They are described in Section 4. We then used the System Usability Scale (SUS) to measure the usability of the tool in the given context. The whole evaluation is discussed in detail in Section 5. Algorithm design aims at carrying out the implementation of the visual encoding. We selected the unsupervised machine learning algorithm k-means++, which clusters data according to the measures (features) from the previous steps. To validate the selected approach, we measured the performance of the algorithm on different sizes of datasets. The results are discussed in Section 5.5. 2 Related work Clustering is an essential part of data mining. The generic state-of-the-art overview of traditional and recently proposed clustering methods and their application domains can be found in Ezugwu et al. (2022). Educational Data Mining (EDM) is an emerging discipline that exploits statistical, machine learning, and data mining algorithms over the different types of educational data (Romero & Ventura, 2010; Salloum et al., 2020). Dutt et al. (2017) provide a comprehensive overview of EDM techniques. Their educational data clustering process Springer Education and Information Technologies explains important steps in the design of clustering approaches. We adopt these steps within the nested model to design and validate visualization systems (Munzner, 2009), which we used to develop a practical exploratory tool for cybersecurity education. A number of approaches aim to identify the effectiveness or pinpoint distinct student strategies in specific types of courses. Specific solutions can be found in the literature addressing, for instance, student performance in generic courses (Durairaj & Vijitha, 2014), classification of students of small online courses by features adopted from business systems (Wang, 2021), revealing patterns of engagement in massive open online courses (Khalil & Ebner, 2017), or understanding how students approached solving a particular programming problem (Yin et al., 2015). In the area of cybersecurity education, which is the primary subject of our research, Svabensky et al. (2022) applied techniques of pattern mining and clustering to analyze the usage of command-line tools in hands-on cybersecurity exercises, aiming to support the automated assessment of students. Our solution focuses on different aspects of EDM - revealing gameplay strategies and possible flaws in the training content. Despite the primary focus on fairness, the recent survey paper Le Quy et al. (2023) brings a useful classification of EDS tasks that use clustering models. Among them is the category "students' behavior, interaction, engagement, motivation, and emotion," which our research falls into and which provides a comprehensive overview of specific approaches, including the usage of k-mean clustering models. Since clustering results can be influenced by the algorithms used, multiple studies compare the performance of clustering methods in EDM. DeFreitas and Bernard (2015) compared partition-based (k-means), density-based (DBSCAN), and hierarchical (BIRCH) methods to determine which technique is the most appropriate for performing clustering analysis within the Learning Management Systems (LMS), e.g., Moodle. Hooshyar et al. (2020) proposed an automatic comparative approach utilizing multiple internal and external performance measures to compare and accordingly recommend the most suitable clustering method for each LMS dataset. The results of these studies indicate that the performance of clustering algorithms vary depending on the type of data, its size, and the performance measures being used. Moreover, these studies focus on LMS data from long-term courses. Our application is specific in that we do not use data from a general LSM but particular data from handson CTF games. Therefore, we chose the k-means++ algorithm (Arthur & Vassilvitskii, 2006) - an improved version of k-means partitioning (Lloyd, 1982). This partitionbased approach belongs to the top 10 algorithms in data mining (Wu et al., 2008) and also the most frequently used clustering methods by EDM and learning analytics (Dutt et al., 2017). 3 Data clustering In this section, we explore three components of the visual analysis process (Fig. 1) that are required for the successful design of appropriate visual-analysis tool: Raw data, Model (i.e., features and clustering algorithm), and the formulation of analytical goals suitable to obtain Insights and knowledge. Springer Education and Information Technologies 3.1 Raw training data Modern cyber ranges usually collect data in the form of event logs triggered by the trainees in the platform during the training session. For example, the Locust 3302 CTF game (Svabensky et al., 2018) is split into six consequent levels, each representing a single cybersecurity task - puzzles from the puzzle-based gamification perspective. The goal of the tasks is to • scan a computer network, • search for vulnerabilities on a web server, • exploit the server, • crack the SSH password, • use the SSH password to steal data. Successful completion of one task is required before proceeding to the next one. Trainees can receive various hints or a complete step-by-step solution. The gathering of raw data is depicted on the left-hand side of Fig. 1, where a trainee interacts with the cyber range, and the interactions are stored in the form of event logs in a database. In general, the puzzle-based gamification produces two types of events that could be used to track the behavior of individuals and classify them: (a) logs capturing the game state, e.g., when a trainee took a hint or finished the task, and (b) commands used to find a solution of a task, e.g., the nmap used for scanning the network. We used the open-source KYPO Cyber Range Platform (KYPO CRP) [Vykopal et al. 2017] to collect data from multiple games and design the clustering methods. Types of events produced by this cyber range are discussed in detail in Macak et al. (2022). They include game events, command histories, and the usage of the Metasploit tool. Considering the analytical goals outlined in Section 3.2, we utilize only a selected subset of game events for the clustering, omitting other data collected by the cyber range. We show that even with such limited data, we can obtain relevant clusters useful for learning analytics. Nevertheless, the unused data can be employed in the future to address other analytical goals, simply by following the same principles as discussed in the remainder of the paper. Events used for clustering are summarized in Table 1 and the whole dataset is also available among the supplementary materials. The events are related to the higherlevel abstraction of puzzle-based gamification. The LevelStarted and LevelCompleted events encode the start and end of each puzzle - a cybersecurity task called game level. If a task is successfully solved, the trainee finds a flag (hidden text), which is used to proceed to the next level. The CorrectFlagSubmitted and WrongFlagSubmitted events capture trainees' attempts to proceed to the next level. The assessment aspects of exercises are captured by the HintTaken and SolutionDisplayed events that are penalized. All events are equipped with a timestamp, absolute training time, trainee ID, and level ID to trace walkthroughs of individual trainees. Besides this, additional records that vary depending on the event's type can be present. Relevant attribute types are summarized in the last column of Table 1 and used in the analytical workflow either for feature extraction, data clustering, or exploratory visualizations. Springer Education and Information Technologies Table 1 Training events and their meaning Event Description Event-specific records LevelStarted The trainee started a new level. maximum achievable score LevelCompleted The trainee successfully finished the level. awarded score WrongFlagSubmitted The trainee submitted a wrong flag. provided flag (text), correct flag (text), penalty score CorrectFlagSubmitted The trainee submitted a correct flag. provided flag (text) HintTaken The trainee took a hint. hint title, hint wording, penalty score SolutionDisplayed The trainee viewed a complete task solution. level ID, solution wording, penalty score 3.2 Analytical goals Based on our long-term collaboration with domain experts on cybersecurity education, we identified three areas where a clustering tool could help improve the impact of current hands-on training programs on trainees (the insight and knowledge artifact in Fig. 1) and, simultaneously, for which common cyber ranges produce relevant event logs. We formulated them into three analytical goals. G1: Examine typical gameplay strategies. The gamification process reflects the initial vision of a training designer who transfers ideas into game elements like abstraction, challenges, and rules (Kapp, 2012). However, the gameplay of real trainees often differs from these expectations as the users adapt to real-time conditions (time press, assessment rules, etc.) or their knowledge (e.g., using commands or steps leading to solving a task in an unexpected way). Traditionally, the divergence of users' behavior from the expected one is analyzed by using so-called conformance checking (van der Aalst, 2016; Weiss et al., 2016; Svabensky et al., 2022), where a model of expected behavior is required, which makes these approaches laborious and prone to errors. On the contrary, clustering could reveal different gameplay strategies without this prior knowledge. The idea lies in the computation of behavioral clusters related to common gameplay strategies, paying attention to bigger clusters since they represent significant "herd behavior." For example, suppose that we are able to cluster trainees by their dependency on using hints. If a significant group of trainees in the training session prefers taking hints while solving tasks, then this behavior can be considered a gameplay strategy indicating that taking hints can be more advantageous than making own effort to find the correct solution. On the contrary, if no such significant group is observable or there are only several individuals with different patterns of using hints (e.g., in different tasks), then the analyst would conclude that the training session does not evince a significant strategy related to using hints. Springer Education and Information Technologies G2: Identify flaws in training design. Proposing educational content is a creative process that is always prone to errors. Recurring behavioral patterns in trainees' progression (such as the inability to solve a task even after using hints) may help the training designers to find flaws in the training design (e.g., the hint is useless or confusing). Identifying such situations can lead to more precise task delimitation and thus improve the quality of the training. Clustering can help identify such situations and allow training designers to adapt game parameters (time limits, assessment rules, etc.) or game content (e.g., the wording of hints). Flaws can be observed using two approaches. First, an analyst could extract specific flaw-relevant features from the raw data and use the clustering to determine whether a significant group of trainees struggled from such a potential flaw. For example, features encoding the "frequency of displaying solution at the very last level" could be used to indicate insufficient time allocated for the training (with respect to trainees' skills). Another flaw-detection approach is based on Gl and expert opinion on gameplay strategies. In this case, a training designer can intentionally search for clusters of gameplay strategies that might indicate some trouble. This approach could be more practical because we usually aim to identify unknown flaws. Regardless of analytical workflow, more attention should be paid to larger clusters because they represent significant behavior (a shortcoming encountered by multiple trainees). G3: Identify outliers. While the analytical goals Gl and G2 primarily target majority behavior,findingindividuals or small groups with certain characteristics is also important. It is beneficial to identify exceptionally skilled people or rare gameplay strategies that are not desired in a certain course or scenario, for instance. However, this sparse behavior is often hidden in the amount of data, which makes its identification difficult. Therefore, the analytical tool has to provide a solution for these contradictory requirements: searching for typical or rare behavioral patterns. Identification of outliers represents a mixture of Gl and G2 analytical workflows. Analysts should search for clusters related to gameplay strategies or the identification of flaws, be able to recognize small clusters (or individuals), and then assess their importance either using expert knowledge or comparing revealed behavior with "herd behavior" (i.e., bigger clusters). The solution to this issue lies in using proper visualization techniques that can emphasize small clusters and outliers alongside significant groups of trainees. 3.3 Extracted features Considering analytical goals G1-G3 and raw data produced by the cyber range, the following features reflecting trainees' game-play style were chosen for automated extraction. They can be computed for the whole training, e.g., the total number of hints taken by a trainee during the training session, or for a selected level, i.e., the Springer Education and Information Technologies number of hints taken by a trainee at level X. Which option is used depends on the granularity of the analysis, as discussed in the tool design in Section 4. Defined features can be divided into three categories. First, we chose two main features that can be easily extracted from logs by counting specific events of individual trainees: • Fl: The number of submitted wrong flags. A high number can highlight trainees who could not reach the milestone. This may be due to an intentional strategy where the trainee is trying to guess the correct answer (Gl) or as a result of a flaw, e.g., confusing task description or hints, where the trainee struggles with finding the correct solution, being convinced of the correct flag (G2). • F2: The number of taken hints. If a trainee uses more hints to find the solution than others, it can again indicate either intention or trouble. It can be the result of the trainee's deliberate strategy, where he or she wants to go through the game or a level with minimum effort (Gl), or the consequence of a wrong game design, e.g., a useless hints or too difficult assignment (G2). While the previous features represent general statistics that can be easily obtained from raw data by counting corresponding event logs, the following two features have to be computed by analyzing sequences of events in individual trainees' walkthroughs: • F3: Time spent after using a hint. This feature calculates the time between taking a last hint and providing the correct flag, i.e., solving the task. If the data from the whole game are examined, not just from one level, the value is obtained by averaging the trainee's values across all tasks. In both cases, long times should indicate the uselessness of hints (G2). • F4: Wrong flags submitted after a hint: Similarly to the previous feature, if a hint is useful, it should not be accompanied by many wrong flags afterward. Otherwise, it may indicate a flaw in the game design (G2). When examining the whole game, the values from individual levels are averaged. Since both F3 and F4 strictly indicate design flaws, their absence or presence in join clusters with Fl and F2 can help analysts to assess whether the trainees' behavior is due to errors in the game design (G2) or deliberate strategies (Gl). Although the features F1-F4 do not mention G3 explicitly, they are also relevant to identifying outliers. The only difference is in looking for rare vs. obvious behavioral patterns. The last two features introduce unifying indicators of the overall trainees' success or failure: • F5: The total time played. The differences in the amount of time spent playing CTF can indicate different skills or interests. Extremely long or short playing time should always attract an analyst's attention because it can indicate a talented or indifferent individual (G3). However, a shorter time does not automatically mean a better trainee. A frustrated trainee, for instance, could use hits or solutions to go through the game as quickly as possible without any effort. Only if this feature is used with other features, then a real reason can be observed. On the other hand, the appearance of a long playtime together with a high number of taken hints (Fl) is quite obvious because the reason for taking hints is often the lack of time, typically at the end of the training session. If this behavior is exceptional in a group Springer Education and Information Technologies of trainees, then we could interpret it like an intentional "do my best during the game, then use hints when the end approaches" strategy (Gl). On the contrary, if this behavior is typical, then it could be interpreted instead as a design flaw (the training scenario is too complex for the allocated time). • F6: The total score. A score earned by trainees can give a straightforward insight into the overall trainee success since it includes points for successful levels or penalties for used hints or for skipped levels. While the total time play provides only a very simplified view of the trainees' success, a total score introduces a more precise assessment. Besides the use cases mentioned in F5, the combination of total time and total score brings yet another analytical possibility. Trainees who quickly gain a high score can be considered talented or skilled, and vice versa. This makes their discovery easier. 3.4 Clustering method As an unsupervised data mining technique, clustering of the provided data does not require pretraining models and, therefore, does not require human intervention. Many approaches to clustering exist nowadays. We can divide them into several categories the most widely used include overlapping, partitional, and hierarchical clustering (Rai & Singh, 2010). At the same time, a number of comparisons of these methods is available (e.g., Rodriguez et al., 2019; Gelbard et al.,2007; Fraley and Raftery, 1998) for various requirements. In our case, we aimed at the technique as a proof-of-concept, which would help us determine the fit of the whole approach of clustering for cybersecurity training data analysis. Partition-based algorithms are widely used in various fields because of their easy implementation (MacQueen et al., 1967). The most typical partitional method is K-means (Jain, 2010). The K-means algorithm is useful for our use case since it can adapt to sparse matrix data sets and efficiently organize large data sets. It is also suitable for numerical values that we use because it measures the squared Euclidean distances in the clustered data. However, the number of clusters and the selection of initial centers can significantly impact the clustering results of the K-means algorithm. In our solution, we therefore use an improved K-means++ variant (Arthur & Vassilvitskii, 2006) that provides better results. More specifically, it does not allocate all the cluster centers randomly. Instead, it chooses the first centroid randomly and then selects the remaining clusters from the rest of the points with probability proportional to its squared distance from the point's closest existing cluster center. The algorithm requires specifying how to calculate the similarity of features. As the features F1-F6 represent numbers, their combination defines the points in Euclidean space that can be measured by the Euclidean distance. The algorithm takes the desired number of clusters k and points (features) to be classified as input. It divides the data records into k classes, starting with randomly selecting k data points as cluster centers. It then improves the clustering results by repetitively recalculating the centers of clusters by averaging cluster members. Springer Education and Information Technologies 1.0 C 9 OS •U 07 n 05 I 0.5 Many wrong flags, long time to solve the training tasks. A trainee stnigling with the training. Many wrong flags in a very short time, an outlier. 00 0.1 0.2 03 04 0.5 06 0.7 08 Wrong flags submitted K Most trainees submitted some wrong flags. This can imply struggle with the solution. !05 ffl 04 g 0 3 O 5 0.2 j 1 o o Majority of trainees who submitted more wrong ) flags did so in short succession. This could imply guessing the correct form of the flag •* a small incostistencv in the hint. 0 3 0.4 0.5 0 6 0.7 0 8 Time spent after jsing hint Most trainees submitted no wrong flags after using a hint. This suggests the hint was useful. Fig. 2 Scatter plot visualization. On the left-hand side, it displays the wrong flags submitted in relation to the time of gameplay. It is distributed in 5 clusters. The right-hand side (with 4 clusters) shows how many wrong flags the trainees submit after asking for a hint 4 Visual-analysis tool To support clustering-based post-training analysis covering goals G1-G3, we designed and implemented an exploratory tool1 . Event logs generated by the KYPO Cyber Range are stored in the ElasticSearch no-SQL database. Features F1-F6 are extracted by transforming and aggregating the raw event logs. The developed API unifies the aggregation services so that new features can be integrated in the future. Raw data and data clusters produced by the k-means+ + algorithm are consumed by several complementary visualizations (the Visualizations component in Fig. 1). Their full integration into the open-source cyber range provides an off-the-shelf selection of training sessions and their game levels, making the analysis comfortable and available right after a training session. This section explains key visualization principles and design decisions on several examples. The analytical tool provides two primary views, both equipped with an interactive estimation of optimal clusters. Both visualizations are discussed in what follows. 4.1 Scatter plots The scatter plot views provide a detailed comparison of a pair of features, as shown in two examples in Fig. 2. Points represent individual trainees, while the color denotes the clusters identified by the clustering algorithm from the distribution of points on the chart (their x and y coordinates). Points are semitransparent - a darker shade of the same color indicates multiple trainees with the same feature values. Axes are normalized according to data samples to provide relative values. Therefore, number one represents the time of the slowest trainee or the maximum number of wrong flags submitted by some trainee, for instance. The link to the source code, together with supplementary materials, is available at https://eait.surge.sh/. Springer Education and Information Technologies The tool predefines plots of two specific pairs of features. They were selected in accordance with the domain experts' preferences, but other pairs can be easily integrated or selected dynamically. Both pairs primarily address possible flaws in the game settings (G2). Wrong flags vs. time A scatter plot dealing with the wrongflagssubmitted (Fl) and timeplayed (F5) features show how straightforward or confusing the training or task was in general or for individuals. For example, many wrong flags submitted over a long time period (the upper-right quadrant of the graph) can indicate that the task was rather difficult for trainees. Many wrong flags submitted over a short time period (points located near the right half of the x-axis) can indicate trainees who try to guess the correct flag. Points located close to the y-axis indicate trainees who were rather successful in finding a correct solution (in variable time). To infer hypotheses or conclusions about behavioral aspects, the analyst has to consider the distribution of individual points and whole clusters on the chart. In the left-hand side scatter plot in Fig. 2, one can see two outliers - the yellow and violet clusters with a single trainee. Their positions on the right-hand side of the graph show they had trouble completing the training. Especially the most-right violet outlier close to the x-axis indicates suspicious behavior as the trainee has finished the training very quickly, using many wrong flags. In general, since many trainees submitted many wrong flags in a relatively short time (the lower-right quadrant of the graph), it might point to some inaccuracy in the task assignment. Wrong flags vs. time after using a hint The combination of wrongflagsafter using hint (F4) and time spent after using a hint (F3) features shows that if a trainee used a hint, how useful it was. The main assumption is that once a trainee reads a hint, the solution should be more straightforward, with only a minority of succeeding wrong flags. On the contrary, when many trainees still struggle with the solution, the situation can indicate a faulty or insufficiently explained hint. Like the previous scatter plot view, this one also helps point out the wrong parameters of the training G2 or discover possible outliers who submit too many wrong flags even after taking a hint G3. The right view in Fig. 2 shows an example. As the graph contains only trainees who took any hint when solving a task, fewer points indicate a simpler task and vice versa. Points close to the y-axis can indicate a possibly confusing hint where the trainees could finish the task quickly at the cost of repeatedly providing an incorrect flag. On the contrary, points located near the horizontal x-axis can be interpreted as the existence of useful hints that lead to a correct solution without mistakes. The time of finding the solution (the distribution of points along the x axis) is not that important in this case. Therefore, it might appear that a chart with points predominantly located at the bottom part could be considered a well-designed game level. However, it is true only for games that are organized to teach new cybersecurity concepts. Other types of Springer Education and Information Technologies hands-on sessions can produce different expected distributions. For instance, in tests or competitions, multiple wrong flags in a short time would be considered expected behavior due to time pressure. Therefore, the analyst needs to decide what is obvious. 4.2 Radar charts The radar chart view depicted in Fig. 3 represents a dominant visualization for the analysis of gameplay strategies. Unlike in scatter plots of two values, multivariate features are captured compactly as two-dimensional volumes that clearly visualize commonalities between samples and help recognize more compound strategies or individual outliers (Chambers et al., 2018). Similarly to the scatter plot views, either the whole game results or the results of selected game levels can be chosen. The color shapes help to distinguish different strategies visually. The number of shapes corresponds to the number of computed Dataset with 34 trainees in total Maximum time after hint Wrong :s taken Two outliters. Struggled with hints, submitted many wrong flags and played tor a long time. Wrong flags Cluster with 2 trainees Maximum time after hint Hints taken Cluster with 2 trainees Maximum time after hint Wrong flags Hints taken Two outliters, submitted many wrong flags but finished fast without hints. Score total Time played Cluster with 5 trainees Maximum time after hint Wrong flags Hints take Score total Time played Cluster with i trainees Maximum time after hint / ^ ^ ^ ^ ^ \ \ Wrong flags Hints taken Score total Time played Cluster with 12 trainees Maximum time after hint Wrong flags Hints taki Score total Time played Many achieved good score fast without struggle. J Wrong flags Score total Time played Almost 1/3 of the trainees scored low in the training and needed more hints. But played significantly shorter than the two outliers in the vellow cluster. Fig. 3 The radar charts view. The main upper chart shows all the computed clusters combined, while the small charts below enable a better examination of individual clusters Springer Education and Information Technologies clusters. The number of trainees in each cluster indicates groups of trainees using the same strategy. It helps the analyst assess the cluster's significance (typical vs. exceptional behavior). In the radar chart in Fig. 3, six clusters were selected, resulting in two small groups potential outliers. The yellow clusterreveals two trainees who played for a significantly long time and took hints that were not very helpful. In contrast, the two trainees from the violet cluster played relatively fast but submitted noticeably many wrong flags without trying to take any hints. Nine trainees out of 34 performed very well. They achieved good scores fast without struggle (the red cluster). Almost one-third of trainees scored low in training and needed more hints. But they played significantly shorter than the two outliers from the yellow cluster. 4.3 Elbow function Both the scatter plot and radar chart views require the analyst to specify the number of clusters. Selecting them ad-hoc iteratively and inspecting obtained results is not a very efficient workflow. Therefore, we introduced a helping elbow function visualization aiming to support this crucial analytical decision. Finding an optimal k for k-means clustering is based on finding the sum of the square distance between points in a cluster and the cluster centroid (Nainggolan et al., 2019). Drawing these values in a line chart allows the analyst to identify an elbow point where the curve is refracting. This point can be used as an initial number of clusters for the exploratory analysis. In the example in Fig. 4, clusters of sizes 4 and 5 are emphasized as candidates for the initial exploration. The elbow graphs are automatically computed for all scatter plots and radar chart views. 0-1 1 1 1 1 1 1 1 1 1 1 2 3 4 5 6 7 8 9 10 Number of clusters Fig. 4 A helper line chart representing the elbow function. It serves for the selection of an initial number of clusters for the analysis. In this case, 4 or 5 can be selected as suitable values Springer Education and Information Technologies 5 Evaluation We conducted a qualitative user evaluation to receive feedback on the tool's usage and to verify that it provides the information outlined by the initial goals. Additionally, we evaluated the tool's usability and usefulness and gathered valuable remarks for further improvements. 5.1 Participants The evaluation includes nine target users (PI - P9). Due to the necessity of background domain knowledge, they need experience in designing and organizing cybersecurity training sessions. All the participants were familiar with the concept and design process of cybersecurity training in the Cyber Range and used the platform to conduct or design various types of educational training. Details of participants are summarized in Table 2. 5.2 Procedure The user study sessions were held individually, in person for seven participants, and online (using MS Teams) for the remaining two. Each session lasted about 60 minutes and had four parts. In the introductory part, the experimenter explained the evaluation procedure, and the participant consented and filled out the demography questionnaire. In the second, the familiarization phase, the experimenter presented the tool, and the participant spent 2-3 minutes familiarizing themselves with it using a demonstration dataset. Next, the respondent performed eight predefined analytical tasks (Table 3) that were formulated to cover all the goals put on the tool. Because of the relatively small size of the participants' group, we decided to focus on inputs beyond simple textual feedback. The tasks do not have strictly correct answers and were purposefully formulated to require a more thoroughjustification. This ensures Table 2 Demographic summary of the participants ID Age Gender Position L E OE V E PI 38 M Senior lecturer, Researcher 5 >20 4 P2 36 M Lecturer, Manager 5 >20 5 P3 29 M Senior lecturer, Researcher 5 >20 5 P4 30 F Seminar tutor, Researcher 4 <10 1 P5 30 M Analyst, Tutor 4 <20 4 P6 41 M Forensics Analyst, Lecturer 5 >20 3 P7 34 M Data analyst, Lecturer 4 <10 3 P8 25 F Training designer, Lecturer 3 <10 3 P9 46 M Researcher, Seminar tutor 3 <2 5 LE - Lecturing experience, OE - Exercise organization experience, V E - Experience with analytical visualizations Springer Education and Information Technologies Table 3 The tasks used for the evaluation Task In the 'Wrong flags per time played' view, identify the most appropriate elbow method number in the helper elbow chart. In the 'Wrong flags per time played' view (for all levels), do you see any suspicious trainees? Why/why not? If so, what is the trainee ID? In the 'Wrong flags per time played' view (for level 5), what could the results imply regarding the level design? In the 'Time spent after using the hint' view (for all levels), what does the point distribution suggest? Does it imply a good training design or bad? In the 'Radar chart' (for all levels), are there any clusters that represent distinct strategies but share similar training success? In the 'Radar chart' (for all levels), are there any possible outliers? In the 'Radar chart', how variable is overall success of the trainees across the clusters? Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8 In the dataset of 'Hacking Day Cyber Task Force Delta' (all levels), determine which strategy (which cluster of trainees) was the most successful. that, apart from the inputs of the answers, we can get more insight regarding the distinct understanding of concepts such as "success" or "good training design". Therefore, the respondents were asked to comment on their actions and interpretation of the results. The experimenter took notes and recorded the screen and audio with the participant's opinions and thoughts for further qualitative evaluation. In addition, the difficulty of each task was also formally evaluated using an SEQ - Single Ease Question (Sauro & Dumas, 2009) questionnaire to validate the process of mapping the abstract problem onto a specific visual form. Lastly, for a complete assessment of the tool's usability, we combined SEQ with the SUS - System Usability Scale (Sauro, 2011) metric. The SUS questionnaire helps us rate the overall design of visual encoding and exploratory interactions. 5.3 Datasets We used a total of three datasets, hereafter referred to as DS1 - DS3. They were collected in past hands-on training sessions. All the datasets contain various events that occur during training: submission of a wrong or correct flag, taking a hint, finishing a level, not interacting with the training portal, etc. To avoid information bias, we used DS1 exclusively to introduce the visualizations and their capabilities to the respondents, while DS2 and DS3 were used for the evaluation itself. DS2, on which the majority of the tasks were performed, contains data collected from 34 trainees of a training seminar with an attack-oriented scenario. The training definition contains six training levels in which the participants attempt to scan a server for vulnerabilities and exploit it. The session lasted seven days, during which 1741 events were collected. The DS3 was used to determine the usability of our approach on a small dataset with only seven trainees and 121 collected events. The exercise was a 90-minute-long hacking competition with a similar scenario as the above but reduced to five levels. Springer Education and Information Technologies 5.4 Usability results We compared the respective answers from SUS questionnaires to assess the overall usability of the proposed analytical tool. The obtained score of 75 lies in the interval from 68 to 80.3, which fits the good rating category according to Bangor et al. (2009). While SUS evaluation addresses the overall usability, SEQ scores reflect the difficulty of solving individual tasks. The result summarized in Fig. 5 reveals that solving all tasks was rather easy (more than half of respondents ranked them from neutral to very easy). The lowest rank was assigned to Task 5 (an average score of 3.6), which can be considered the most difficult. Other tasks except Task 4 achieved very high average scores with values above five on the 7-point scale. Task 8, with an average score of 6.0, was rated as the most simple when using our tool. Therefore, also the SEQ results confirm an overall good usability of the tool for solving the tasks. To access participants' insight gained into individual tasks during the data exploration, we observed how consistent their answers, comments, and recorded interactions are regarding the original analytical goals G1-G3. In what follows, we summarize our observations for individual goals. Task 1 is specific as it is related to all the goals. It can be considered an introductory task making respondents aware that they can change the desired number of clusters at any time. Insight into the examination of gameplay strategies The aim of the first analytical goal is to examine typical gameplay strategies. Our objective was to confirm that the majority of participants would state similar outcomes regarding the gameplay, the most often recurring behaviors, or the variability of the playing strategies. This goal was covered by the radar chart view, to which Tasks 5, 7, and 8 were related. Radar charts in Fig. 3 illustrate the situation from the evaluation. However, it must be remembered that the evaluation is dynamic, and a particular view depends on SEQ answers Average • Very difficult Difficult Somewhat difficult Neutral Somewhat easy " E a s y • Very easy Fig. 5 SEQ score for individual tasks. The color scale rates each task from red - very difficult to blue very easy. Numbers inside the color bars show the number of corresponding ratings. The numbers on the right-hand side of each task show average ratings (red = 1, blue = 7) Springer Education and Information Technologies selected parameters, especially the number of desired clusters and whether a specific level or the entire training is examined. To solve evaluation tasks, analysts must first clarify the meaning of "success" or "failure" in an exercise. Our participants assessed the success of computed clusters mainly by their score total and time played values, using other features as complementing. This approach confirms our expectations. Task 7 was directly proposed to get the participants' remarks on the interpretation of the success. In general, most participants identified two groups of trainees: a majority that didn't perform well and one smaller group that was much more successful. Seven participants described the clusters as very variable, with different results. P7 identified four groups: normal, good, bad, and unusual. P3 measured the success by the score total axis length instead of comparing the clusters, thus ranking the success variability as lower. The rest of the participants correctly compared the clusters in relative scales between each other. The ability to recognize and assess clusters with different degrees of training success implies that the participants chose the right number of desired clusters for the analysis. In Task 5, the goal was to identify distinct strategies leading to similar success or failure. The evaluation revealed that participants primarily compared the total timeplayed with the number of hints taken, as these two features evinced significant differences among clusters with similar success. Task 5 was rated as the most difficult (SEQ score 3.6). However, it was the first to work with the radar chart (before Task 7) in the evaluation process. All the succeeding tasks related to the radar chart were rated as easier (with SEQ scores ranging from 5.8 to 6.0). And because they were conducted with different datasets, it refutes the reason for the sudden rise of rating would be getting familiar with the data and thus subjectively seeing the tasks as easier. It rather suggests that users find it easier to comprehend encoded information after getting more familiar with the overall concept of the radar charts. Task 8 was rather straightforward as all the participants pinpointed the same set of characteristics and selected the same groups of trainees. The participants measured success as a correlation between score total and timeplayed. Some participants ignored the high number of wrongflagsas a factor that should lower the success rate. Overall, the participants were able to find the connections between the features in individual clusters that are related to distinct strategies and, moreover, to identify and name specific significant gameplay strategies. One of the often mentioned strategies was that some trainees were omitting the hints and were trying to pass the level on their own. Insight into flaws in training design The second goal focuses on identifyingflawsin training design. It should help determine if there are any points where it is too hard for the trainees to solve the puzzle or, in contrast, some trainees get too good results too easily. This goal was covered by Tasks 3 and 4. Both of them relate to the scatter plot views, and their average SEQ score achieved 5.1 (Task 3) and 4.6 (Task 4). Springer Education and Information Technologies Task 3 focuses on the relationship between the number of wrongflagssubmitted in a certain game level and the time of level playing (i.e., the time played feature). The answer to the question depended on how people perceive the quality of a training design with respect to the effort. In general, a 'good level design' was mostly defined by participants as one in which the dots are adjacent to the y-axis (low number of wrong flags) and the level time is not too high (not too apparent in the current visualization - a time estimate could help according to one of the participants). The majority identified level 5 (depicted in Fig. 6) as quite easy as the significant amount of points lies in the left part of the chart, and it was considered a good sign for them (straightforward and balanced assignment). The goal of Task 4 was to identify how the participants dealt with training design specifics related to hints. Data from the evaluation are depicted in Fig. 2 - right view. Clusters are computed for the entire game (averaged values across all levels), and they evince significant distribution close to both axes. The interpretation of this distribution by respondents met our expectations. All the participants agreed on an equivalent response. They decided the design was rather good. They agreed that good design (related to hint usage) is denoted by a high concentration of dots on the bottom side (close to the x-axis), which suggests that once displaying the hint, there were not many successive issues (i.e., the hint was helpful). They noticed that after using a hint, the majority of trainees solved the level without too many wrong answers. Five participants would, however, analyze the hints further because some seem to be less useful for some trainees. 1.0- I 0.9- 0.8- 0.7& 0.6 H Q.