Device Classification for Industrial Control Systems Using Predicted Traffic Features

To achieve a secure interconnected Industrial Control System (ICS) architecture, security practitioners depend on accurate identification of network host behavior. However, accurate machine learning based host identification methods depends on the availability of significant quantities of network traffic data, which can be difficult to obtain due to system constraints such as network security, data confidentiality, and physical location. In this work, we propose a network traffic feature prediction method based on a generative model, which achieves high host identification accuracy. Furthermore, we develop a joint training algorithm to improve host identification performance compared to separate training of the generative model and the classifier responsible for host identification.


INTRODUCTION
Modern Industrial Control Systems (ICS) carry out the automation of industrial processes via complex, interconnected, large-scale arrangements of sensors, actuators, and specialized and general-purpose computing devices. As high-visibility ICS cyber-security events over the past two decades have shown, left unsecured, these cyber-physical systems can be vulnerable to misuse and abuse, leading to real-world consequences. A first step toward securing internetworked cyber-physical ICS is to identify, or characterize the network hosts so that an informed security posture assessment can be performed. Conventional Information Technology (IT) tools often use human-specified rules, signatures, and fingerprints to characterize network hosts, and these techniques have been adapted to ICS networks. These conventional characterization techniques begin to break down in ICS environments due to the breadth of device types and specialized communication protocols used in ICS networks; maintenance of databases used by conventional techniques need to be continuously updated to keep up with the introduction of new device types, operating systems and communication protocols. Due to the essential nature of critical infrastructure and ICS for society at large, and the possibility of accurate and high-performance automation of security-related tasks, such as the determination of device characteristics like model and manufacturer, much work has been done recently in the field of Machine Learning (ML) to devise methods of characterizing network hosts, and specifically, ICS devices, so as to avoid much of the downsides of conventional host characterization techniques.
Previous work shows the effectiveness of both network traffic features and ICS protocol features for ICS network device classification (Chakraborty et al., 2021). However, in order to acquire such features, it is necessary to capture network traffic data from the relevant ICS network and then perform network-packet level analysis on that data. This is a non-trivial undertaking, especially in ICS networks which may be physically isolated and difficult to access, resource constrained (e.g., in network bandwidth and storage), or where access is restricted due to confidentiality concerns. In such an environment, we may have access to a limited amount of network traffic data. The objective of this work is to augment the collection of real ICS network data with a computationally inexpensive process, which can generate realistic traffic features as a function of time.
In this work, we frame the above as a time-series forecasting problem and seek to answer two questions: (1) Given a partial history of network traffic, can we accurately forecast that history into the future? and (2) Can we use this network traffic forecast to improve device classification performance? To answer these questions, we develop and evaluate a novel approach comprised of two key elements: (1) a time-series forecasting model based on a sequence-to-sequence (Seq2Seq) deep learning architecture and (2) a joint training procedure for the Seq2Seq model and the device classifier, in which the Seq2Seq model functions as a traffic feature generator and the classifier functions as a data discriminator, implicitly evaluating the quality of the traffic features generated by the Seq2Seq model. We demonstrate that these two pieces are both necessary and sufficient to significantly improve device classification performance (up to a 23% increase in accuracy) in a data-constrained ICS setting.

Main Contribution
We highlight the main contribution of this work, before getting into the technical details.
• The primary contribution of this work is the development of a novel machine learning framework, which jointly trains (a) a network traffic forecasting model and (b) a traffic-feature based device classification model, and which produces significant improvements in device classification performance over a similar approach with independently trained model components. • This proposed framework is shown to be useful in a dataconstrained scenario, which is common for ICS due to network security, as shown in Figure 1.

RELATED WORK
Time series modeling has historically been a key area of academic research, forming an integral part of applications in topics such as climate modeling (Mudelsee, 2019), biological sciences (Stoffer and Ombao, 2012), and medicine (Topol, 2019), as well as commercial decision making in retail (Böse et al., 2017), finance (Andersen et al., 2005), and net-load consumption for customer (Thayer et al., 2020) to name a few. While traditional methods have focused on parametric models informed by domain expertise such as autoregressive (AR) (Box and Jenkins, 1976), exponential smoothing (Winters, 1960;Gardner, 1985), or structural time series models (Harvey, 1990), modern machine learning methods provide a means to learn temporal dynamics in a purely data-driven manner (Ahmed et al., 2010). With the increasing data availability and computing power in recent times, machine learning has become a vital part of the next generation of time series forecasting models. Deep learning in particular has gained popularity in recent times, inspired by notable achievements in image classification (Krizhevsky et al., 2012), natural language processing (Devlin et al., 2018), and reinforcement learning (Silver et al., 2016). By incorporating bespoke architectural assumptions, inductive biases (Baxter, 2000), that reflect the nuances of underlying FIGURE 1 | The accuracy of ICS device classification methods depends on the amount of available data (blue arrow). In data-constrained environments, classification accuracy may suffer. The goal of this work is to develop methods to reduce the slope of the data-accuracy relation such that higher ICS device classification accuracy may be obtained with less data (green arrow).
TABLE 1 | Feature list per protocol, per direction. In the "Symbol" column "a" is either sent or received, representing each direction of flow.
Description Symbol datasets, deep neural networks are able to learn complex data representations (Bengio et al., 2013), which alleviates the need for manual feature engineering and model design.
The availability of open-source backpropagation frameworks (Abadi et al., 2016;Paszke et al., 2019) has also simplified the network training, allowing for the customization for network components and loss functions. Given the diversity of time-series problems across various domains, numerous neural network design choices have emerged. A rich body of literature exists for automated approaches to time series forecasting including automatic parametric model selection (Hyndman and Khandakar, 2008), and traditional machine learning methods such as kernel regression (Nadaraya, 1964) and support vector regression (Smola and Schölkopf, 2004a). In addition, Gaussian processes (Williams and Rasmussen, 1996) have been extensively used for time series prediction with recent extensions including deep Gaussian processes (Damianou and Lawrence, 2013), and parallels in deep learning via neural processes (Garnelo et al., 2018). Furthermore, older models of neural networks have been used historically in time series applications, as seen in Waibel (1989) and Wan (1993).
In the domain of Information Technology communication networks, traffic forecasting and prediction is an active area of research. While the literature uses the terms somewhat interchangeably, strictly speaking, forecasting is the task of using a model to estimate the future value of a variable (such as number of bytes transferred, for example) based on previous, timeordered values of the same variable, while prediction uses a model to estimate the value of an unknown variable based on the values of other known variables. Thus, forecasting is a sub-discipline of prediction that focuses on time. Historical and modern uses of traffic forecasting include trunk demand servicing, planning for network design and capacity upgrade requirements, network control, bandwidth provisioning, and efficient resource utilization. Early work focused primarily on the application of Kalman filters, the Sequential Projection Algorithm (SPA), and Autoregressive Integrated Moving Average (ARIMA) models to predict traffic behavior of telecommunication networks and Internet forerunners (Szelag, 1982;Groschwitz and Polyzos, 1994).
Subsequent work continued to apply ARIMA model variants to modern Internet traffic and mobile wireless networks while improving the prediction time-step resolution (Shu et al., 1999;Liang, 2002;Papagiannaki et al., 2003Papagiannaki et al., , 2005El Hag and Sharif, 2007). Simultaneously, new methods based on Autoregressive Conditional Heteroskedasticity (ARCH) models were introduced (Krithikaivasan et al., 2007;Kim, 2011) to capture the observed second-order non-stationarity of real traffic data.
Artificial Intelligence and Machine Learning methods for predicting communication network traffic gained popularity in the 2000's with the successful application of fuzzy logic and artificial neural network models (Liang, 2002;Guang et al., 2004). A large volume of AI and ML work for predicting traffic appeared in the following decades, and we refer the reader to surveys of ML applications to traffic forecasting and traffic characterization (Mohammed et al., 2019;Xie et al., 2019), as applied to Software Defined Networks in particular, for more detailed discussions.
In this article, we present a classification problem based on predicted time series data. For an ICS dataset, availability Frontiers in Computer Science | www.frontiersin.org of sufficient network data significantly affects the classification accuracy for classifying ICS and non-ICS devices, as shown in Chakraborty et al. (2021). This motivates the requirement of defining a time series prediction task, associated with a classification task using the predicted time series data. Therefore, our proposed method lies in the intersection of regression and prediction tasks.

Paper Structure
The remainder of this article is structured as follows: in Section 3 we describe the ICS dataset used for this work; in Section 4 we formally introduce the problem solved in this work; in Section 5 we introduce the method used for the time series prediction and in Section 6 we develop the model used for the prediction task; in Section 7 we describe the results while using the prediction model for the classification task; and finally, in Section 8, we discuss findings of our current work.

DATA DESCRIPTION
Network flows from a production-level industrial electrical distribution system (Site A) are utilized for this work. First, (TCP and UDP) network flow datasets were generated for this system by processing passively captured network traffic in PCAP form. Based on the source port of the network flows, the flows are categorized into ICS and non-ICS flows.
Network traffic for Site A is attributed to one of six different protocol groups: ARP, TCP/IP, DNP3 (Distributed Network Protocol, version 3), HTTP, TLS, and UDP. Before calculating the traffic features, the packets within each flow are subdivided according to these protocol types. Ten standard traffic features, shown in Table 1, are selected for this flow classification task. These traffic features are then computed for the extracted traffic flows for Site A. Since the captured data from Site A is not large enough to achieve sufficient device classification accuracy, our objective is to forecast (predict) the traffic features listed in Table 1 using the measured flows from Site A, which then can be used to perform the flow classification task with expectantly higher accuracy 1 .
To illustrate the observed characteristics of the Site A traffic features, Figure 2 plots 10, 000 timesteps (with each timestep having a resolution of 10 s, for a total of 100, 000 s) of each 1 We have shown in Chakraborty et al. (2021) that a minimum amount of data is required to achieve a specific ICS device classification accuracy.  Frontiers in Computer Science | www.frontiersin.org feature for sent and received packets, for both ICS and non-ICS devices. Four different colors in Figure 2 indicate the sent and received traffic features associated with ICS and non-ICS traffic flows.

PROBLEM FORMULATION
Consider a dataset consisting of traffic features, D t , is present for Site A with ICS and non-ICS devices. This dataset can be utilized to identify the device types (ICS or non-ICS, and device manufacturer and model number), as demonstrated in Chakraborty et al. (2021). Furthermore, we assume the length of the collected dataset, t, is less than the optimal data length (T) required for obtaining a desired device classification accuracy, as  shown in Section 4 of Chakraborty et al. (2021). Now we formally define the two problem statements.
Problem 1 (Prediction model). Given a traffic feature dataset of time length t, and an optimal data length requirement of T, can we develop a prediction model for forecasting the traffic features from time t to time T?
Problem 2 (Classification using prediction model). Using the prediction model from Problem 1, can we get the desired device classification accuracy for Site A?
FIGURE 6 | Variation of the evaluation metrics with input sequence length. The black vertical line is the selected input sequence length for our proposed architecture.
FIGURE 7 | Variation of the evaluation metrics with the output sequence length (prediction length). The black vertical line is the selected output sequence length for our architecture.

TIME-SERIES PREDICTION METHODOLOGY
One of the primary tasks associated with solving Problem 1 and Problem 2 is obtaining a sufficiently accurate 2 prediction of traffic features for ICS and non-ICS traffic. We propose a sequence-to-sequence model (Seq2Seq) based deep network architecture for the time series prediction task. The Seq2Seq model (Sutskever et al., 2014) is comprised of a Long Short Term Memory (LSTM, Hochreiter and Schmidhuber, 1997) based encoder-decoder.

Seq2Seq Model
We used a sequence-to-sequence model architecture as shown in Figure 3 with context based encoding-decoding as in Luong et al. (2015). Here we repeat some important aspects of the Luong context-based Seq2Seq architecture, as we refer to aspects of this architecture that were tuned for our application in Section 6.
In Equation (1), f is a recurrent unit like a LSTM, h d i−1 represents the hidden state from the previous decoder unit andŷ i−1 ∈ R k×1 is the k-variate prediction of the previous decoder unit passed as input to the current decoder unit.ᾱ i⋆ ∈ R l×1 is the attention weight vector for decoder i over the encoder hidden states (i.e., l = E ) and c i ∈ R h×1 is the attention context vector calculated as a linear combination over all E, and corresponding attention weightsᾱ i⋆ .
2 Prediction accuracy is defined in the literature (Thayer et al., 2020) based on several performance metrics, such as root mean square error, mean bias error etc. The attention energy vector α i⋆ for the i-th decoder unit can be calculated in multiple ways. The attention energies can be calculated purely based on the current hidden vector or as a function of the alignment between the current hidden vector and encoder states, as shown in Equation (2).
The attention energies are then normalized using a softmax transformation, to obtain the attention weightsᾱ i * ∈ R l×1 , which is then used to obtain the attention context vector c i . Next, the attention hidden stateh d i of the i-th decoder is obtained by imposing a nonlinear transfer function (hyperbolic tangent) over an affine transformation (with W h ∈ R h×2h ) of the concatenation of c i and h d i . Finally, g is a function which calculates the i-th predictionŷ i from the corresponding decoder attention hidden stateh d i i. For this case, we consider g to be a linear function. In Figure 3, a schematic of the proposed Seq2Seq architecture is shown.

PREDICTION MODEL DEVELOPMENT
In this section we will utilize the Seq2Seq architecture mentioned in Section 5.1 to solve Problem 1. Before describing the solution results, we briefly describe the data processing step and the evaluation metrics for the time series prediction.

Data Generation
The network flow data contains packet level information between different devices present in the dataset. We have extracted different traffic features (as listed in Table 1) across specific type of device (in both sending and receiving direction). Therefore from the network flow data, we extract time series measurement of different traffic features across individual present devices. We have collected the network flow data for a pre-defined amount of time. In Section 6.5, for brevity we evaluated our prediction model for predicting ICS and Non-ICS flows (average of all the traffic features for the broader class of ICS and Non-ICS devices), however in Section 7, we evaluated the time series prediction performance for individual devices. In this whole section, the experimental setup is as described in Problem 1, with t = 12, 000 and T = 16, 000, unless explicitly mentioned otherwise.

Data Processing
For the prediction task, we extracted the trend lines using "scikitlearn"-SVR (support vector regression) package, from the time series measurements for both ICS and non-ICS flows. As shown in Figure 4, the blue and red solid lines show the total number of transferred packets for ICS and non-ICS flows, respectively. In Figure 4, the black lines show the trend line for both type of flows. For the prediction task, the proposed architecture will be evaluated for the "detrend" line, as shown in the zoomed-box in Figure 4. The "detrend" line is calculated by subtracting the trend line from the traffic feature measurements. Table 2 describes the metrics used to evaluate the performance of the proposed Seq2Seq model architecture.

Seq2Seq Model Exploration
In this section, we will evaluate the effect that different architecture hyperparameters (number of hidden neurons of the LSTM layer, input sequence length, i.e., input data dimension, output sequence length, i.e., output data dimension) have on ICS and non-ICS sequence prediction performance. We used standard k-folding, with k = 100 (160 time points in each fold), for evaluating the performance of different hyperparameter values, using the evaluation metrics introduced in Section 6.3.
As shown in Figure 5, we varied the number of LSTM neurons (dimension h as introduced in Section 5.1) of the Seq2Seq model between 20 and 190, with an increment of 10, and plotted the evaluation metrics (Figure 5, all the evaluation metrics are calculated as an average over all the traffic features). As shown in Figure 5, with an increase in the number of hidden neurons, all the evaluation metrics improve, although there is a parabolic increase in the number of trainable parameters. Therefore, for the sake of the burden of computational complexity, we have FIGURE 9 | Time series plots of the true and predicted values for ICS traffic features. We have plotted the first 1, 000 data points out of predicted 4, 000, for visual clarity.
selected 100 hidden neurons for our Seq2Seq model architecture for future performance evaluation.
We also evaluated the prediction performance of the proposed Seq2Seq model, by varying the input sequence length (number of input data points at each step, i.e., number of encoder units n as introduced in Section 5.1). As shown in Figure 6, the evaluation metrics are plotted while varying the input sequence length between 40 and 380 (as before, all the evaluation metrics are calculated as an average over all the traffic features). Figure 6 shows that MAXE improves while increasing n between 40 and 180, and stays almost unchanged thereafter, while the other three metrics remain unchanged while varying n. This shows that the effect of increasing input sequence length does not affect the prediction performance. Therefore, we have selected 200 encoder units for our Seq2Seq model architecture for future performance evaluation.
Finally, the prediction performance of the proposed Seq2Seq model is evaluated by varying the output sequence length (number of output time step, i.e., number of decoder units q as introduced in Section 5.1). As before, four evaluation metrics were computed while varying q between 30 and 1, 000. All the evaluation metrics degrade while increasing the output prediction sequence length as shown in Figure 7. Note that, for this evaluation, the aforementioned model parameters are kept constant at the selected value. Therefore, we have selected 30 decoder units for our Seq2Seq model architecture for future performance evaluation.
For further clarity, we have put the detailed parameter values of Seq2Seq model, in Table 3.

Performance Comparison of ICS and Non-ics Traffic Features
In this section, we evaluate the prediction performance between ICS and non-ICS traffic features. In Figure 8, we calculated the percentage of prediction error 3 and plotted the histogram variation with different traffic features. The features are in the same order as they appear in Table 1. Therefore, features 1 to 10 are for ICS flows and 11 to 20 are for non-ICS flows. For both ICS and non-ICS traffic flows, the prediction performance is better for maximum and standard deviation values of packet inter-arrival time, compared with the other eight features. It is important to note here that using the recursive feature elimination (RFE, Guyon et al., 2002) algorithm, in the context of device classification, our prediction model performs better for predicting the highest ranked traffic features, such as packet inter-arrival time.
FIGURE 10 | Time series plots of the true and predicted values for non-ICS traffic features. We have plotted the first 1, 000 data points out of predicted 4, 000, for visual clarity.

Time Series Comparison
In this section we plot time series measurements and their prediction for the test dataset. Figures 9, 10 show the overlapped time series plots of the true value and the predicted value for ICS and non-ICS traffic features. The Seq2Seq architecture performs better in predicting all of the traffic features, except for the one associated with packet size. Even for the packet size related features (which vary the most with time), our proposed architecture performs significantly well.

Performance Comparison With Other Methods
In this section, we perform comparisons with other contemporary machine learning models present in the existing literature. Specifically, we compare the proposed Seq2Seq model's performance with Multilayer Perceptron (MLP) (Rosenblatt, 1961), Linear Regression (LR) (Bianco et al., 2009), Linear Support Vector Regression (LSVR) (Smola and  Schölkopf, 2004b), Random Forest (RF) (Hedén, 2016), and Gaussian Mixture model (Srivastav et al., 2013). We evaluated the prediction performance of these methods for our dataset, presented in Figure 11. We observe that the Seq2Seq model outperforms all of the aforementioned machine learning methods in all performance categories.
Furthermore, we compare the performance of the proposed Seq2Seq model with three different loss functions (weighted combination of mean absolute error (MAE), Kullback-Leibler divergence (KL), and Maximum absolute error (Max), and the Dilated ConvNet (Yazdanbakhsh and Dick, 2019). As shown in Figure 12, for different traffic features, the Seq2Seq model with the combination loss function of (10*MAE+KL+Max) performs better than other deep network architectures. FIGURE 13 | Confusion matrix with 12000 data points of measured data. The classification dataset is unbalanced with a class ratio of 17:21:2:1:12:2 (the class ratio has the same order as the confusion matrix).

CLASSIFICATION USING THE SEQ2SEQ PREDICTION MODEL
Device classification requires measured data from Site A of at least time length T for satisfactory classification performance (Chakraborty et al., 2021). For our dataset, using the optimal time length calculation, we have evaluated the required length to be 16, 000 data points for achieving overall classification accuracy more than 90% across all present devices. However, suppose that only 12, 000 data points of measured data from Site A are available, leading to reduced classification performance. We propose to use an additional 4, 000 time steps of predicted data to improve the performance of the classification task.
As proposed in Chakraborty et al. (2021), we used the traffic features as features for classifying various devices present in Site A. Before showing the performance of our prediction model in this context, we define our baseline performance as the device classification performance with 12, 000 measured data points. As seen in Figure 13, the confusion matrix shows the classification performance for six devices, with all available measurement data. We further note that we can achieve an overall classification accuracy of 73.62%, where "Workstation" and "Server" are misclassified most often. Similarly, misclassification is also noted between SEL devices. Now, as previously proposed, we added 4, 000 data points from our prediction model, along with the present 12, 000 measurement data points. Then, these 16, 000 data points of traffic features are used as feature vectors for classifying the same six devices. Similar to Figures 13, 14 shows the updated confusion matrix for the device classification task. Although we have added more data points to our feature set, we do not see any significant classification performance increase when compared to the baseline result. More precisely, FIGURE 14 | Confusion matrix with 12, 000 data points of measured data and 4, 000 data points of predicted data with separate training of Seq2Seq model and the classifier. The classification dataset is unbalanced with a class ratio of 17:21:2:1:12:2 (the class ratio has the same order as the confusion matrix). SEL-735, SEL-751, and SEL-787 classification performance was improved on average over 15%, however the classifier shows a performance reduction of 10% for SEL-2240, with the new dataset.
Next, we propose to train the Seq2Seq model and the classifier with a combined loss function to improve the device classification accuracy.

Combined Training of Seq2Seq Model and Classifier
We propose to improve the device classification performance shown in Figure 14, by a joint training of the Seq2Seq model and the classifier. The Seq2Seq model works as a traffic feature generator, while the classifier responsible for device classification works as a data discriminator, evaluating the quality of the traffic features generated from the Seq2Seq model. For demonstration, we adopt the notation similar to Arora et al. (2017) and Li et al. (2018). Let G = G g , g ∈ U and D = D d , d ∈ V denote the Seq2Seq model based traffic feature generator and device classification based discriminator respectively, where G g and D d are functions parameterized by g and d. U, V ⊆ R p represents the respective parameter spaces of the generator and discriminator. Finally, let G * be the target known distribution to which we would like to fit our Seq2Seq model, which utilizes the measured 12000 data points. We incorporate a combined weighted loss of regression and classification for the combined training of G and D. We found a weight ratio of 1 : 10, for regression to classification, works best for optimum classification performance; this was determined by experimental variation of the weight ratio.
To compare performance with Figure 14, we generated the same amount of predicted data as in Figure 14, using the combined loss function training. The classification performance is shown in Figure 15, which shows an overall FIGURE 15 | Confusion matrix with 12, 000 data points of measured data and 4, 000 data points of predicted data with combined training of the Seq2Seq and the classifier. The classification dataset is unbalanced with a class ratio of 17:21:2:1:12:2 (the class ratio has the same order as the confusion matrix).
increase in classification accuracy of around 15%, across all devices. Particularly, we can see a classification accuracy improvement of over 23% for "Workstation" and "Server" device types. The traffic feature prediction using this combination loss function performs significantly better compared to both the separate training (Figure 14) and the baseline (Figure 13).

CONCLUSION
ICS device classification requires a minimum amount of collected traffic flow data for satisfactory classification performance. Due to various constraints, we do not always have access to the required amount of data and therefore we propose an alternative approach of device classification. Our proposed approach utilizes a Seq2Seq based deep network architecture for traffic feature prediction, related to device classification. Moreover, a classifier responsible for identifying device types is used to evaluate the quality of the predicted data. We further show that a combined training of a prediction model and a classifier improves the device classification performance by as much as 23%, compared to separate training.

DATA AVAILABILITY STATEMENT
The datasets presented in this article are not readily available because, the data is generated for the sole purpose of internal research effort. Requests to access the datasets should be directed to kelley35@llnl.gov.

AUTHOR CONTRIBUTIONS
IC, BK, and BG: study conception and design and draft manuscript preparation. BK: data collection. IC: analysis and interpretation of results. All authors reviewed the results and approved the final version of the manuscript.