A survey of methods for encrypted traffic classification and analysis

With the widespread use of encrypted data transport, network traffic encryption is becoming a standard nowadays. This presents a challenge for traffic measurement, especially for analysis and anomaly detection methods, which are dependent on the type of network traffic. In this paper, we survey existing approaches for classification and analysis of encrypted traffic. First, we describe the most widespread encryption protocols used throughout the Internet. We show that the initiation of an encrypted connection and the protocol structure give away much information for encrypted traffic classification and analysis. Then, we survey payload and feature‐based classification methods for encrypted traffic and categorize them using an established taxonomy. The advantage of some of described classification methods is the ability to recognize the encrypted application protocol in addition to the encryption protocol. Finally, we make a comprehensive comparison of the surveyed feature‐based classification methods and present their weaknesses and strengths. Copyright © 2015 John Wiley & Sons, Ltd.


INTRODUCTION
Network visibility is becoming a necessity in current networks.Security, traffic provisioning, and failure detection are the prime reasons to deploy traffic measurement.Yet, measurement has other uses and new ones are still being discovered.For instance, application performance can be measured using the data from the application layer.Information about certain applications can also be used to detect attacks and intrusions on the application level.In contrast to this, the need for protection of transmitted data and user privacy is rapidly increasing.It is for this reason that, the encryption of transmitted data is increasingly used.The ratio of encrypted traffic has recently been rising steeply as common Internet services become protected [62].This change poses a challenge to currently used methods for traffic measurement, for which the identification and analysis of network traffic becomes more difficult.The protocols presented in this section are listed in order of their position in the ISO/OSI reference model [43].IPsec protocol suite, which operates on the network layer is described first, followed by TLS and SSH protocols on the presentation layer.BitTorrent and Skype protocols represent the application layer and they implement their own protocols for secure data transmission.

Internet Protocol Security
Internet Protocol Security (IPsec) is a framework of open standards for ensuring authentication, encryption and data integrity on the network layer.Due to its location on the network layer, IPsec can protect both the data within the packet and also L3 information (e.g., IP addresses) in each packet [33].The main advantage of using IPsec is securing the network connection without the necessity of modifying any application on the clients or servers.However, it provides less control and flexibility for protecting specific applications.
IPsec follows the general scheme depicted in Figure 1.The first phase is represented by the Internet Key Exchange Version 2 (IKEv2) protocol [45].IPsec uses an UDP protocol on the port 500 through which all messages covering the initial handshake, the authentication and the shared secret establishment run.Two protocols could be used in the second phase of IPsec: Authentication Header (AH) and Encapsulating Security Payload (ESP).In the initial version of IPsec, the ESP protocol providing data confidentiality did not include authentication, so ESP and AH were used together.Nowadays, the current version of ESP contains authentication and AH has become less significant, although it is still used to authenticate portions of packets that ESP cannot manage.The ESP protocol is the main protocol of IPsec.It provides data confidentiality, origin authentication, connectionless integrity, an anti-replay service, and limited traffic flow confidentiality [46].ESP adds a header and a trailer to each transferred packet, see Figure 2, placed according to the transport mode used.The ESP and AH protocols can operate in two modes: transport and tunnel.In the tunnel mode, a new IP header is created for each packet with endpoints of the tunnel as the source and destination addresses; the original IP header is used in the transport mode.

Transport Layer Security
Transport Layer Security (TLS) [28] is based on the Secure Sockets Layer version 3 (SSLv3) protocol [34] and provides transport level security directly on top of the TCP protocol.Specifically, it provides confidentiality, data integrity, non-repudiation, replay protection and authentication through digital certificates.The TLS is currently one of the most common protocols for securing network communication.It is mainly used for securing HTTP, FTP, SMTP sessions, as well as for Virtual Private Networks or Voice over Internet Protocol (VoIP).The protocol design is layered and consists of different sub-protocols, as well as configurable and replaceable cryptographic algorithms [55].
The main part of TLS is the Record Protocol [28], which acts as an envelope for application data as well as TLS messages.In the case of the application data, the Record Protocol is responsible for dividing the data into optionally compressed fragments.The addition of fragments to the record is complemented by Message Authentication Code (MAC).For more details, see Figure 3. Depending on the selected security algorithms, a fragment and MAC are encrypted together and sent as one TLS record.A packet may contain more than one record to avoid sending multiple short packets.During the first phase of a TLS connection, communication parties are usually authenticated (more often we can see only server authentication) using an X.509 certificates chain [24], as shown in the general scheme in Figure 1.Alternatively, a previous connection can be resumed without authentication.TLS messages exchanged during this phase are unencrypted and do not contain MAC until the shared keys are established and confirmed.In the second phase, these keys are used directly by the Record Protocol, which is based on the selected algorithms ensuring communication security.

Secure Shell Protocol
In a similar fashion to the TLS protocol, the Secure Shell (SSH) protocol [70] exists as a separate application on top of TCP.This protocol uses a client-server model where the server usually listens to the TCP port 22. SSH was originally designed to provide a remote login access to replace unsecured Telnet connections.Nowadays, it can be used not only for a remote login and a shell, but also to allow file transfers using the associated SSH File Transfer Protocol (SFTP) [35] and a Secure Copy (SCP) [60] protocol, or by Virtual Private Networks (VPN).The SSH protocol provides user and server authentication, data confidentiality and integrity and, optionally, compression.SSH consists of three protocols, of which the most important is the Transport Layer Protocol, which provides the establishment of the whole connection and its management.It defines the SSH 6 PETR VELAN, MILAN ČERM ÁK, PAVEL ČELEDA, MARTIN DRA ŠAR packet structure, which is depicted in Figure 4.The MAC code is computed on a plaintext payload together with the packet length, the padding and a continuously incremented sequence number not present in the packet itself.Other SSH protocols are the User Authentication Protocol and the Connection Protocol for multiplexing multiple logical communications channels [66].
Each SSH connection passes through the same phases which were depicted in Figure 1.In the first phase, a TCP connection is established and information about preferred algorithms is exchanged.During authentication, a server sends its public key which must be verified by the client (using a certification authority or manually through a different channel).The shared keys are subsequently established and confirmed.All following packets are then encrypted and authenticated.

BitTorrent
BitTorrent [36] is an application protocol based on the principle of peer-to-peer network communication for sharing large amounts of data over the Internet.Originally, the protocol did not ensure any type of network communication security.Once the popularity of this protocol increased, some Internet Service Providers (ISP) started to limit this type of traffic.As a response to this, the Message Stream Encryption (MSE) algorithm [12], also known as Protocol Encryption (PE), was introduced.It serves as an obfuscation algorithm to make BitTorrent traffic identification more difficult.In addition to obfuscation, the mechanism also ensures some level of confidentiality and authentication for communicating peers.
The MSE protocol specification [12] describes MSE as a transparent wrapper for bidirectional data streams over TCP which prevents passive eavesdropping and thus protocol content identification.MSE is also designed to provide limited protection against active man-in-the-middle attacks and port scanning by requiring a weak shared secret to complete the handshake.The major design goal was payload and protocol obfuscation, not peer authentication and data integrity.Thus, it does not offer protection against adversaries which already know the necessary data to establish connections (that is the IP/port/shared secret/payload protocol).
The first general phase of the MSE protocol follows a TCP three-way handshake and starts with a newly generated Diffie-Hellman (DH) public key exchange (together with random data padding for better obfuscation).The shared key is computed by the DH key and combined with hashed information about the requested data which acts as a pre-shared secret.The packet's payload is completely encrypted by a RC4 stream cipher after successfully confirming the shared key.

Skype
Skype [63] is a peer-to-peer based VoIP application providing not only video chat and voice calls, but also an instant messaging and file exchange service.As the protocol used is not publicly known, it is not possible to accurately describe its specific details.The main reason for this is network data obfuscation to make the detection of Skype traffic more difficult, which is similar to the BitTorrent protocol.
The Skype protocol operates over both UDP and TCP protocols depending on network characteristics.If the network has no restrictions, the application usually sends data traffic over UDP and the signalling traffic over TCP [1].If UDP cannot be used (for example a firewall prevents users from using such a protocol), Skype sends both the signalling and the data traffic over TCP.
When TCP is used, the connection is usually established over port 80 or 443, which masks Skype traffic as standard web traffic.
Skype uses the TLS protocol over the TCP port 443 and a proprietary protocol over port 80 [1] for securing and obfuscating generated traffic with each communicating peer.TLS is also used in communication with other Voice over IP solutions, where it is used for protecting Session Initiation Protocol (SIP) messages [64].Skype uses a proprietary protocol for communication over UDP.To offer a reliable connection, UDP packets contain an unencrypted header with a frame ID number and function mode fields.The encryption in UDP connections is used only for obfuscation and not for confidentiality; therefore, there is no generated shared secret, only a proprietary key expansion algorithm.UDP connections do not follow the general scheme in Figure 1, because encrypted data are directly transferred without an initialization phase.

INFORMATION EXTRACTION FROM ENCRYPTED TRAFFIC
Network monitoring is one of the main pillars of network security.If the appropriate data is collected, it is possible to detect network attacks and trace attackers, detect security policy violations and monitor network applications performance.If encrypted traffic is used, the possibility of information extraction is significantly limited.Nevertheless, it is possible to obtain some information from this traffic, primarily from the unencrypted initialization phase, but also from the encrypted transport phase.This section begins with a description of the initialization phase, which is then followed by a description of methods which use traffic feature analysis to gain information from the transport phase.

The Unencrypted Initialization Phase
Almost all network protocols ensure secure data transfer by means of encryption containing an unencrypted initialization phase, as depicted in Figure 1.Because the data exchanged at this stage are not encrypted, they can be easily extracted and used for monitoring network traffic.Generally, two types of information common to most protocols can be extracted during this phase.The first type covers the connection itself, and its properties exchanged in the initial handshake.The second type covers communicating peers' identifiers which are exchanged in the authentication phase.
During the initial handshake, parameters of a connection are negotiated, such cipher suites and which protocol version is used.This dynamic setting of the connection properties enables backward compatibility for different versions of software or is used to set a different level of security based on established security policies.Some examples of this are data authentication and compression in addition to encryption itself.The list of possible identifications, with references to algorithm specification for IPsec, TLS, SSH and other protocols, can be found in IANA Protocol Registries [40].All of this information can be used for proper connection characterization and correct parsing of other packets in the rest of the connection.
One interesting use of the information from the initial handshake is presented by client fingerprinting based on the provided cipher-suites.A large amount of cipher-suites types exists, which usually are not all implemented by the client's applications.Therefore, each application specifies the supported cipher-suites and also prefers using them during the initial handshake.This makes it possible to passively distinguish specific operating systems, web browsers and other applications, together with their versions, based only on the cipher-suites which they use.An example of such client fingerprinting, based on the SSL/TLS initial handshake, is presented by applications from SSL Labs [61] and p0f module [54].
The authentication phase represents the second source of information which can be easily extracted from secured network traffic.Unique identifiers of one or both communicating peers, optionally supplemented by additional data about them, are exchanged during this phase.For example, in the authentication phase of the SSH protocol, the server sends its public key to the user.The user must validate the key and verify that the server knows the appropriate private key [70].Since this information is almost unique for each SSH server, it is possible to detect server changes or man-in-the-middle attacks on them by passive network traffic monitoring.
A very good example of extracting information from communication peers is demonstrated by monitoring the authentication phase of the SSL/TLS protocol.The server, and optionally even the client, sends their X.509 certificates [24] to each other to verify their identities in this phase.This certificate contains a public key signed by the certification authority which is supplemented with information about the peer and issuer.More detailed content of such certificates is shown in Figure 5.  Monitoring passive certificates enables not only the identification of communicating peers but also the ability to check whether these certificates are valid and contains proper security algorithms to fulfil local security policies.SSL/TLS certificate properties were studied by Holz et al. [38] who revealed a great number of invalid certificates and some which were shared between a large number of hosts.Holz et al's work was followed by Durumeric et al. [30] who mainly focused on assessing certification authorities.Certificate monitoring can also be used to detect malicious software trying to hide its activities by connecting with their command and controls centres using the SSL/TLS protocol [65].

X.509 certificate
Even though information extraction from encrypted traffic is not a computationally intensive process, it can provide valuable information.For example, extracting the Server Name Indication (SNI) [17] can be used by a home router's firewall to filter traffic.The unencrypted initialization phase is often used to recognize encrypted traffic and, the authentication information might be utilized to detect and prevent man-in-the-middle attacks on a network-wide level.

The Encrypted Data Transport Phase
Information about network traffic can be extracted from encrypted data which is transported between communicating parties.Packets exchanged during the transport phase usually contain only information about the packet itself, such as the length and the authentication field which are not useful for monitoring network traffic.Nevertheless, two methods do exists to obtain more suitable data.
The first method uses direct traffic decryption, which is possible to perform only if the shared secret of the connection is known.Therefore, decryption can be used in networks on the servers' side where organizations know the private key of the connected server.However, such decryption would be impossible if algorithms wee used which ensure forward secrecy [39].
The second method is based on the extraction of traffic features.An example of such an analysis is presented by Miller et al. [56] who monitor the size of TLS encrypted packet sequences.Based on their data, together with various predictive models, they are able to identify a specific web page and deduce its content even though the traffic is encrypted.A similar approach was also used by Koch and Rodosek [49] for analysing SSH traffic.Another example of using traffic features is the work by Hellemons et al. [37], which focused on intrusion detection in SSH traffic.Encrypted traffic features could also be used for classifying encrypted traffic, which is described in Section 6.

A TAXONOMY FOR TRAFFIC CLASSIFICATION METHODS
The first step in analysing network traffic is identifying the type of traffic measured.Network traffic classification methods are used for this purpose based on the knowledge of the protocol packet structure, communication patterns, or a combination of both.The recognition of the TLS protocol, described in Section 2, may be seen as an example of this.This protocol can be identified based on the knowledge of the packet structure, especially the unencrypted packet parts such as the content type, the version and the length.Similarly, the protocol can be identified by analysing its behaviour, e.g., the knowledge of a number and an approximate size of packets sent during the unencrypted initialization phase.
To present the current state of research on encrypted traffic classification in a comprehensive manner, we use a taxonomy of classification methods.We choose the multilevel taxonomy by Khalife et al. [48], which provides a detailed categorization of traffic classification methods.This taxonomy is uniquely descriptive and allows us to efficiently categorize all our surveyed classification methods.Figure 6 shows an overview of the taxonomy levels.On the topmost level, the authors distinguish between classification input, technique and output.The input determines the traffic's characteristics, which are used for classification (e.g., packets or flows).The technique describes the core of the classification method, which may be, among others, payload inspection, a statistical method or a machine learning method.The output then describes how the traffic objects (packets or flows) map to traffic classes (application types or application protocols).The traffic classes are of a different granularity.Some methods allow identification of application protocols (e.g., HTTP), and some are more fine grained to detect, for example, a Google search or a Facebook chat.each packet, whole flow or a host.Traffic classes describe the type of classification being performed by a specific algorithm.

PAYLOAD-BASED TRAFFIC CLASSIFICATION TECHNIQUES FOR ENCRYPTED TRAFFIC
Almost every network traffic encryption protocol has a specific packet format that differs from others, as was described in Section 2. Thus, with knowledge of these formats it is possible to distinguish and identify individual protocols by inspecting the packet payload.It is for this purpose that, string or regular expression matching algorithms are used witch a specific protocol patterns.Some examples of contemporary classification tools which use payload inspection are discussed in more detail in the first part of this section.The second part presents current research papers which focus on comparing these tools in terms of their performance and success rate.

Payload-Based Classification Tools
Most network traffic classification tools address all network protocols and not only the encrypted ones.
The following examples represent the most widely used tools for classifying network traffic.
Most of these tools are also able to distinguish specific network applications, mainly in unencrypted traffic.In terms of the taxonomy, these tools mostly use the Payload Inspection technique on the Traffic Payload classification input to map Flows to Application Protocols.PACE [42] is a commercial classification library written in C, which uses pattern matching augmented by heuristics, behavioural and statistical analysis.In addition to standard protocol and application recognition, it is able to identify obfuscated protocols such as BitTorrent or Skype.According to its website, PACE is able to identify thousands of network applications and protocols.
Cisco Network Based Application Recognition (NBAR) [22] is another example of a commercial tool for classifying network traffic.This tool is primarily used on Cisco routers for quality and security purposes.According to its authors, NBAR is also able to recognize stateful protocols and non-TCP and non-UDP IP protocols.nDPI [27] is an open-source classifier forked from the (currently closed) project OpenDPI, which was in turn derived from PACE.nDPI analyzes at most eight packets from each connection for classifying traffic, however each packet is examined separately.If the connection contains multiple matches, then the most detailed match is returned.For encrypted traffic recognition, nDPI contains only a SSL decoder that extracts the host name from the server certificate.Using these names, nDPI is able to identify specific network applications.
Libprotoident [2] is an open-source C library for classifying traffic.In contrast to the previous tools, Libprotoident inspects only the first four bytes of a packet payload for each direction.This makes it much faster but reduces its detection accuracy.The classification uses a combined approach of pattern matching, payload size, port numbers and IP matching.
L7-filter [23] is an open-source classifier for Linux which is designed to classify traffic on the application layer.The initial phase of classification is based on non-payload data such as port numbers, IP protocol numbers, the number of transferred bytes, and so forth.Payload data are analysed with regular expressions during the second phase.One disadvantage of the l7-filter is that it contains a database with old patterns which was last updated in 2011.

A Comparison of Classification Tools
A comparison of the presented open-source tools was introduced by Finsterbusch et al. [32].They prepared a data set containing the traffic of 14 different network protocols such as DNS, HTTP, BitTorrent, and SMTP(S) to compare the tools.Using this data set, they measured classification accuracy, memory usage, CPU utilization, and the number of packets required for proper classification.The comparison showed, amongst other results, that nDPI is not able to classify the BitTorrent protocol with more than a 43 % true positive rate, although it detects SSL/TLS with 100 % accuracy.The Libprotoident tool had the highest classification accuracy of the whole analysed traffic, which was able to classify DNS, HTTP, SIP and e-mail protocols with 100 % accuracy.Libprotoident also needs the least number of packets on average for classifying real-time traffic.Based on the comparison by Finsterbusch et al., we can say that Libprotoident is the most appropriate tool for classifying payload-based traffic, although it is more CPU intensive than the other tools.
Another comparison of the tools was carried out by Bujlow et al. [18].They compared all of the previously presented tools.They prepared a publicly available data set for the comparison which contained encrypted and unencrypted network traffic from 17 application protocols, 25 network applications and 34 web services.Their results show that PACE and Libprotoident are the most accurate tools.Nevertheless, the Libprotoident tool was the only classifier able to identify all the encrypted protocols which were tested.Their results were generally very similar to those in the comparison by Finsterbusch et al.
In general, the payload-based classification tools use regular expression matching algorithms to identify the encrypted traffic.The main difference between the tools is how much data they need to examine and whether they need both directions of the connection for the classification.

FEATURE-BASED TRAFFIC CLASSIFICATION TECHNIQUES FOR ENCRYPTED TRAFFIC
This section surveys feature-based classification methods which specialize in encrypted traffic.These methods do not require any knowledge of the encryption protocol packet structure.Instead, they use specific protocol communication patterns to classify encrypted traffic.These methods are based on the specific protocol differences described in Section 2, such as packet and flow features of unencrypted initialization or the encrypted data transport phase.This approach provides greater generalization and allows these methods to work with new versions and types of encryption protocols without the modification of underlying algorithms.The taxonomy specified in Section 4 is used to describe the individual methods.Apart from the properties defined by the taxonomy, we also provide information about data sets used for evaluating these methods.This is especially important when additional evaluation is to be performed by other groups to verify the results of the authors.While the taxonomy provides a traffic class for classifying the application protocol, this is sometimes too coarse for our purposes.Most classification methods not only identify the encryption protocol, but also the underlying encrypted application protocol.As both cases belong to the application protocol traffic class, we provide further explanation in the description of each classification method.
A slight drawback of flow-based classifiers is in performing classification often after the flow has expired.This prevents the possibility of a real-time response, which has led several research groups to research real-time classification using flow-features.Their methods are also included with the others in Table IV and differentiated by a column describing whether the classification is done in real-time or not.
Almost 250 discriminators (flow or packet features) are identified by Moore et al. [57], which can be used to classify flow records.The authors do not propose any specific classification method themselves, however, most of the classification algorithms use a subset of these discriminators for identifying traffic.In terms of the taxonomy, the authors provide a list of traffic features which are used to infer the category of the traffic's properties for the classification input.
Most of the feature-based traffic classification methods use statistical or machine learning methods.These methods are comprehensively described in [3].Port-based classification was used in the past to associate applications with network connections, but the accuracy of this method is decreasing with the increased use of dynamic ports and applications evading firewalls.Despite the decreased accuracy, port numbers are often utilized as one of the packet features.Furthermore, port-based classification is still quite often used to establish a ground truth for traffic classification experiments.
For easier orientation, we present the surveyed papers grouped by the class of their traffic classification algorithm.Most of the papers employ Supervised Machine Learning Methods (Section 6.1), Semi-Supervised Machine Learning Methods (Section 6.2) and Basic Statistical Methods (Section 6.3).The rest combine more than one method and are therefore gathered in a Hybrid Methods category (Section 6.4).For each surveyed paper, we specify the classification technique, classification output, and the data sets used in the description of the classification process itself.Table IV in Section 6.5 provides a summary of the papers with all of the mentioned properties properly categorized.

Supervised Machine Learning Methods
Sun et al. [67] propose a hybrid method for classifying encrypted traffic.First, the SSL/TLS protocol is recognized using a signature matching method.A Naive Bayes machine learning algorithm is then applied to identify the encrypted application protocol.The authors use a combination of public and private data sets to evaluate their method.Background traffic is taken from a public data set, BitTorrent, eDonkey, HTTP, FTP, Thunder and GRE application protocols.The signature based recognition of SSL/TLS protocols was tested on HTTPS, TOR, ICQ and other protocols.The identification of the underlying protocol was tested only for TOR and HTTPS protocols.
Okada et al. [59] analyzed changes in flow features due to encryption.They created a training data set with HTTP, FTP, SSH, and SMTP application protocols encrypted using PPTP and IPsec tunnels.The authors assessed 49 flow features and analyzed which of them are strongly correlated in normal and encrypted traffic.The correlated features were then used to infer functions which transform the features between normal and encrypted traffic.Therefore, standard classifiers can be used to classify the traffic after the transformation.The authors verified their method using several modifications of the Naive Bayesian classifier.
Arndt and Zincir-Heywood [11] also concentrated on the classification of encrypted traffic and compared C4.5, k-means and Multi-Objective Genetic Algorithm (MOGA) to this end.The classification of the SSH protocol was used as an example in their study.The authors focused on the accuracy and robustness of the algorithms.Stability was tested using three different public and private data sets for teaching and evaluating the methods.Multiple different flow export settings were tested as well.Altogether, 46 flow features were used in the evaluation, however, a different subset was used by each algorithm.The C4.5 algorithm provided the best robustness, although the MOGA had a very low false positive rate when used on the same data set as it was trained on.The C4.5 was recommended for forensic analysis by law enforcement since it is applicable on a variety of networks.
Alshammari and Zincir-Heywood have published several papers [4,5,6,7,8,9] on traffic classification using various supervised machine learning methods.They focused on recognising SSH, Skype, and in one case, Gtalk traffic using flow features without port numbers, IP addresses or payloads.A set of 22 flow features was mostly used, although in one case the authors selected the features using genetic programming.Public data sets are used as well as a private data set generated on a test-bed network.The ground truth for public data sets was gained from port numbers, and the private data set includes the payload and was labelled with a commercial packet classification tool.The following algorithms for traffic classification were compared: AdaBoost, RIPPER, Support Vector Machine (SVM), Naive Bayes, C4.5 and Genetic Programming.
Kumano et al. [52] investigated real-time application identification in encrypted traffic.IPsec and PPTP encryption were applied to web, interactive and bulk transfer flows to create an evaluation data set.C4.5 and SVM algorithms were utilized to classify the application on these data sets.The authors then tested the accuracy of the classification using a different number of packets from the start of the flows.They measured the impact of using fewer packets on the flow features and proposed using features which show the least change to classify applications at an early stage.

Semi-Supervised Machine Learning Methods
Bernaille and Teixeira [16] used traffic clustering to detect applications encrypted by SSL.Their method has three steps.First, they detected SSL connections using a clustering algorithm (Gaussian Mixture Model) on packet sizes and directions of initial packets of a connection.The first three packets and 35 clusters provide good accuracy in detecting SSL traffic.After the SSL traffic is identified, the first data packets of the connections are identified.The sizes of the data packets are used by a clustering algorithm to detect an underlying application in the third step.However, the packet sizes are modified in the last step to allow for encryption overhead.The evaluation of the proposed method is done on traffic traces from live networks and a manually generated packet trace.The data sets contain HTTP, POP3, FTP, BitTorrent and eDonkey application protocols encrypted using SSL.The authors also show that using a combination of clustering and port numbers to differentiate between applications in clusters provides better results than clustering alone.
Maolini et al. [53] identify SSH traffic and determine underlying protocols (SCP, SFTP, HTTP) using a k-means algorithm.Only three packet features are used: the direction, the number of bytes and the timestamps of each packet.Their private data sets are created from artificial traffic and contain HTTP, FTP, POP3 and SSH protocols.Control packets such as TCP handshake, retransmitted packets and ACK only packets are removed from statistics as they negatively affect the precision.Authors use only first 3-7 packets to achieve a real-time identification.
Backquet et al. [13,14] use a Multi-Objective Genetic Algorithm to select a flow feature subspace and parameters for a clustering algorithm which detects encrypted traffic.The second work employs a hierarchical k-means algorithm to increase the identification's accuracy.The authors evaluate both approaches on a private data set captured at a university campus.SSH is used as a representative of encrypted traffic and the ground truth is gained from the payload of the captured packets.Based on previous works, the authors argue that the feature selection and number of clusters highly affect the overall accuracy.Therefore, the authors selected four objectives for the genetic algorithm: cluster cohesiveness, cluster separation, the number of clusters and the amount of used flow features.The results show (a) that only 14 from a total of 38 flow features were used by the best-performing algorithm and (b) that using a hierarchical k-means algorithm increases the identification performance.
Bar-Yanai et al. [15] combined k-means and k-nearest neighbour clustering algorithms to construct a new, real-time classifier for encrypted traffic.The resulting classification algorithm has the light weight complexity of the k-means algorithm and accuracy of the k-nearest neighbour algorithm.They claim the method is fast, accurate and robust in regard to encryption, asymmetric routing and packet ordering.A labelled data set was prepared from generated samples of the traffic and the classification of the data set was done using payloads of the packets.Flows shorter than 15 packets were removed from the data set, since real-time classification for such short flows is not of practical use.If available, the first 100 packets were used for the classification.The authors stress that their method is applicable in a real-time environment and tested their implementation on an ISP link.The application protocols classified are HTTP, SMTP, POP3, Skype, eDonkey, BitTorrent, Encrypted BitTorrent, RTP and ICQ.
Zhang et al. [72] propose an improvement to the k-means clustering algorithm.Using harmonic mean to reduce the impact of random initial clustering centres, the authors are able to increase Figure 7.A visual representation of transport-layer interactions for various applications [44].
applications such as Twitter, Skype and Dropbox.The Markov chain fingerprints are based on protocol specific distributions of packets in time.Data sets were captured from real traffic, contain only SSL/TLS traffic and were not published.The ground truth was obtained by inspecting domain names of the SSL/TLS traffic.The authors discovered that many protocol implementations differ from the RFC specification, which required them to adjust the fingerprints.

Hybrid Methods
Karagiannis et al. [44] focused on host-based classification.Their method uses only information from the network level and therefore is not affected by transport layer encryption (e.g.TLS).The authors classified the behaviour of the hosts on social, functional and application levels, without access to packet payloads or headers.Each level was classified independently and a cross-level classification was performed afterwards.Particular applications were represented using graphlets, see Figure 7, which are representations of the application's behaviour.The authors then used heuristics to refine the classification.The ground truth for captured data sets was established by using a signature-based payload classification.Without using transport layer information, only the following traffic classes were identified: web, p2p, data (FTP, database), network management (DNS, SNMP, NTP), mail (SMTP, POP, IMAP), news (NNTP), chat (IRC, AIM, MSN messenger), streaming and gaming.Wright et al. [69] worked on a classification of traffic in encrypted tunnels.Multiple flows can be wrapped in a single flow representing the encrypted tunnel.The information from the packet headers was not applicable, therefore the authors used only packet sizes, timing and communication direction.A k-nearest neighbour classifier was used for classification when all TCP connections in a set carried the same application protocol.When TCP connections carried different application protocols, the authors used Hidden Markov Models.The authors also demonstrated that it is possible to determine the number of flows in an encrypted tunnel.The port numbers were used to obtain a ground truth for the captured data set.The authors argue that mislabelled data only decreased the efficiency of their classification algorithm and therefore the real accuracy would be even higher than the reported one.The classifiers were able to detect the following application protocols: HTTP, HTTPS, SMTP, AIM, FTP, SSH, Telnet.
Koch and Rodosek [49] proposed a system for detecting interactive attacks using SSH.Packet sizes, IP addresses and packet inter-arrival times were used to create clusters of packets which were likely to match a SSH command and its corresponding response.The SSH protocol was recognized based on the port number, and individual commands were identified from the clusters.Following this, sequences of commands were evaluated and possible malicious sequences were reported.The system allows for the customization of malicious sequences' definitions using a subgoals characterization.Each sub-goal maps to a malicious event, such as data gathering or system manipulation.The results from the evaluation of the proposed method show that such identification is possible.
Khakpour and Liu [47] used an entropy of packet payloads for classifying traffic.The authors showed how to compute the entropy of files and how to modify the formula for on-the-fly computation.Several entropy values were computed for each packet.CART and support vector machine (SVM) methods were used for a subsequent classification based on the computed values.Their results demonstrated that SVM methods provide comparable accuracy with less false positives.The authors argue that it is necessary to exclude application layer headers such as HTTP response or picture headers.The reason for this is that computing entropy on the headers leads to a bias and misclassification of a packet.Therefore, a cut-off threshold was used to strip application headers from unknown protocols.The traffic was first classified into three categories: text, encrypted or binary.The authors also postulated that the classification can be more fine-grained and they investigated the classification of application protocols.Then, they demonstrated that it is possible to determine an encryption algorithm with a higher accuracy than random guessing, which they found surprising.

A Summary of Machine Learning and Statistical Encrypted Traffic Classification
We have provided a summarizing overview of the feature-based traffic classification papers and methods they use in Table IV.Where a method belongs to multiple categories, it is not marked as a hybrid, but all the categories are listed instead.We find this approach more descriptive than using a hybrid category as defined by the taxonomy.
Most surveyed methods use flow or packet header features as an input for the classification techniques.The authors of [6,9] compare the results gained by utilizing packet header features and flow features.They show that using both sets of features can result in faster and more accurate classification algorithms.Nevertheless, using all the available traffic features does not necessarily lead to the best classification performance as demonstrated by the authors of [4].
The column Number of features shows how many flow or packet header features were used in each method.Some methods used different subsets of the features for different algorithms, and this is denoted by the ⊆ mark.Moreover, some of the methods used a feature selection algorithm to select the best combination of features from the entire feature set.These methods are marked in the Feature selection column.For this case, the Number of features column represents the initial number of features.
Most classification algorithms are based on machine learning.The category of supervised machine learning algorithms is represented by Hidden Markov Models, RIPPER, AdaBoost, Support Vector Machines, C4.5 and Naive Bayes.Several works [5,7,8,9] compare these algorithms to establish which is the best for the task of classifying traffic.The C4.5 algorithm performs the best in several cases, however, genetic programming is reported to achieve the best results in [9].The second most common algorithm category is the semi-supervised machine learning, which is dominated by clustering algorithms.The k-means algorithm is the most frequent in this category, and the k-nearest neighbour comes second.The popularity of k-means is due to its variability, which allows it to be fine-tuned for various purposes.It is often combined with  genetic algorithms to find the best setting.The authors of [68,47] use the entropy of packet payloads to classify traffic.Using simple statistical properties of the traffic is the third most common classification method.Other methods are rarely used, mainly because they cannot learn from labelled traffic and therefore require too much effort to set up.The SSH protocol is heavily used as a classification example.The authors of [4,5,6,7,9,11,13,14,72] test their methods for recognising SSH and non-SSH traffic.Maiolini et al. [53] take the classification one step further and identify the type of traffic encapsulated in a SSH connection.The authors of [16,51,67] use SSL/TLS traffic and identify underlying application protocols.Since the SSL/TLS protocol is more general and is used to encrypt various types of traffic, the complexity of identification is higher than for SSH.Another very popular protocol for identification is Skype, which is addressed by the authors of [7,9,50,51].
Some of the methods focus only on identifying encrypted traffic, whereas others try to identify the underlying application protocol.The methods which perform a more thorough analysis to gain information about the application protocol are indicated in the column Encrypted protocol identification.
Because all methods, with the exception of [44], classify whole flows and rely mostly on flow features, they are rarely able to classify traffic in real-time.However, the authors of [15,16,52,53,68] achieved near real-time classification by extracting features of only a fixed number of packets in a flow.They argue that the first packets carry enough information for classification.Using a higher number of packets increases accuracy, therefore it is possible to strike a balance between accuracy and early identification.
The Data set columns describe whether the data used to evaluate the presented methods was taken from a live network (Real) or generated by a tool (Artificial).We also identify if the data sets were publicly available (Public), were made available by the authors (Published) or kept undisclosed as they contain sensitive information (Private).If more than one data set was used for each evaluation, we simply performed an union of the data set descriptions.The Ground truth column indicates how the ground truth was obtained for each data set.Common methods are based on port numbers or signatures, which use the packet payload.When the data sets are generated manually, the ground truth is known in advance.
The classification accuracy reported by authors of the surveyed methods depends heavily on the data sets used.All authors use their own private data sets which are seldom published.Such methods simply cannot be compared without repeating the experiments on a common data set.The authors of [5,7,9,11,67,72] also used publicly available data sets which were either labelled beforehand, using payload when available, or simply labelled using port numbers.A combination of data sets is often used to test the robustness of the methods.
The surveyed methods show that a lot of effort was put into classifying encrypted traffic.We believe that there are several points that should be taken into account in any future research in this field.First, identifying encrypted traffic is not enough.The identification of the underlying protocol is the real challenge.Second, a SSL/TLS protocol should be used as the reference protocol, as it can contain much more complex traffic than the SSH protocol.Finally, the traces used should be labelled and made available to other researchers.Following these points does not limit the scope of future research, however, it simplifies the comparison of the presented approaches and allows others to verify the results more easily.

CONCLUSIONS
In this paper we presented an overview of current approaches for the classification and analysis of encrypted traffic.First, we selected a number of the most widely used encryption protocols and described their packet structure and standard behaviour in a network.Second, we focused on information which is provided by encryption protocols themselves.We found that the initiation phase often provides information about the protocol version, ciphers used, and the identity of at least one communicating party.Such information can be used to monitor and enforce security policies in an organization.We also discovered that the use of information from the unencrypted parts of an encrypted connection for a network anomaly detection is only briefly investigated by researchers.Information about communicating parties can be leveraged to discern the type of encrypted traffic.For example, the list of supported cipher suites provided by a client when establishing a secure connection can help to identify the client.We believe that the use of unencrypted parts from the initiation of an encrypted connection should be explored in more detail.
Before starting the analysis of the encrypted network traffic, it is necessary to identify it.Thus, we surveyed approaches to classifying network traffic.These, were, first payload-based methods which use knowledge of a packets' structure and feature-based methods which use characteristics specific to the protocol flow.For the payload-based classification, there are several open-source traffic classifiers which can identify encrypted traffic using pattern matching.The initiation of a communication often has a strictly defined structure, therefore, the patterns can be constructed for specific protocols.The main difference between various classifiers is that some of them require traffic from both directions of the communication to correctly classify the flows.
Feature-based traffic classifiers have been intensively researched over the last decade.Many statistical and machine-based learning methods have been applied to the task of traffic classification.Despite this, there are no conclusive results to show which method has the best properties.The main reason is that the results depend heavily on the data sets used and the configuration of the methods.We have applied the multilevel taxonomy of Khalife et al. [48] and categorized existing methods.Our results show that most of the authors use private data sets, sometimes in combination with public ones.For this reason, the individual results are not directly comparable.Most of the methods use supervised or semi-supervised machine learning algorithms to classify flows and even determine the application protocol of a given flow.Most methods target encryption protocols, such as SSH, SSL/TLS and encrypted BitTorrent, and use similar methods.However, there are also some novel works which apply innovative approaches to refine the classification up to deriving the content of the encrypted connections.
Most authors of feature-based classification methods claim that their approach is privacy sensitive as it does not require the traffic payload.However, privacy issues are much wider.In 2013, the Cyber-security Research Ethics Dialog & Strategy Workshop [19] started a discussion about the influence of cyber-security research on the privacy of Internet users.Researchers need to keep in mind that their research activities have a significant impact on infrastructure security, network neutrality and privacy of end users.
In the past, internet protocols were not designed with security considerations in mind.The recent interest in privacy has motivated the IETF to reconsider this approach and discuss the privacy aspects of the protocols.Discussions held in [41] revealed that monitoring privacy issues are of great

Figure 1 .
Figure 1.A general scheme of network security protocols.

Figure 2 .
Figure 2. The IPsec packet structure in (a) the transport mode and (b) the tunnel mode.

Figure 3 .
Figure 3.The TLS Record packet format.

Figure 4 .
Figure 4.The SSH protocol packet format.

Table II .
Classification technique level.Tables I, II and III provide examples for each of the input, technique and output category.The input and technique tables have the most general categories in the left column, some of which are divided into more specific subcategories.The classification output table is divided horizontally into traffic objects and classes.The objects describe what is being classified, in other words, whether it is

Table III .
Classification output level.

Table IV .
A summary table of cited papers and methods they use to detect encrypted traffic.