Artificial Intelligence in Optical Communications: From Machine Learning to Deep Learning

Techniques from artificial intelligence have been widely applied in optical communication and networks, evolving from early machine learning (ML) to the recent deep learning (DL). This paper focuses on state-of-the-art DL algorithms and aims to highlight the contributions of DL to optical communications. Considering the characteristics of different DL algorithms and data types, we review multiple DL-enabled solutions to optical communication. First, a convolutional neural network (CNN) is used for image recognition and a recurrent neural network (RNN) is applied for sequential data analysis. A variety of functions can be achieved by the corresponding DL algorithms through processing the different image data and sequential data collected from optical communication. A data-driven channel modeling method is also proposed to replace the conventional block-based modeling method and improve the end-to-end learning performance. Additionally, a generative adversarial network (GAN) is introduced for data augmentation to expand the training dataset from rare experimental data. Finally, deep reinforcement learning (DRL) is applied to perform self-configuration and adaptive allocation for optical networks.


INTRODUCTION
Machine learning (ML) techniques have been developed and applied to optical communication in both the physical layer and network layer for years (Musumeci et al., 2018;Khan et al., 2019). Various algorithms from ML communities powered a wide range of aspects in optical communication, involving digital signal processing (DSP), optical performance monitoring (OPM), signal detection and analysis, proactive fault management, network automation, and optical sensing, etc. The conventional ML system is limited by the ability to undertake feature extraction and complex analysis, and has always relied on considerable domain expertise and feature engineering. In recent years, rapid advances in information technology have made great strides and parallel developments in computation and low-cost computing hardware have made big data modeling possible. Driven by this growth in the volume of data and improvements to computing power, ML has successfully evolved into deep learning (DL), which addresses complex and large-scale problems with robust, adaptable, and efficient solutions (LeCun et al., 2015), as illustrated in Figure 1.
In general, DL can be understood as a deep neural network (DNN) with multiple non-linear layers made up of a large number of neurons, each of which is mathematically modeled as an activation function. In DL communities, different algorithms with specific structures were suitable for different problems and specialized in different data types. Among them, convolutional neural network (CNN), recurrent neural network (RNN), generative adversarial network (GAN), deep reinforcement learning (DRL), end-to-end learning based on autoencoder, and their variants have made a distinctive contribution to fields such as machine vision, natural language processing, drug discovery, genomics, speech recognition, information retrieval, affective computing, and automatic deriving (Deng, 2014). Meanwhile, to promote the development of artificial intelligence (AI) in optical communication, the evolution from ML to DL is making major advances in a wide variety of applications in both physical and network layers (Fan et al., 2020;Häger and Pfister, 2020;Saif et al., 2020).
This paper reports the progress of AI in optical communication from ML to DL. Unlike other review papers about conventional ML algorithms, the presentation focuses on state-of-the-art DL techniques and aims to highlight the contributions of DL to optical communication for both the physical layer and the network layer. Examining the characteristics of different DL algorithms and data types, we briefly review multiple DL-enabled applications for optical communication. First, as one of the most popular DL algorithms, CNN is introduced for image recognition to process seven kinds of common image data from optical communication to execute various functions. Then RNN is applied for sequential data analysis to process digital signal waveforms, network traffic data, and equipment state parameters. In addition, a data-driven channel modeling technique using DL is proposed to provide a supplementary solution to the conventional block-based modeling, which could also improve end-to-end learning performance. As an emerging technique, GAN is implemented for data augmentation to expand image data and network traffic data. Finally, DRL is considered for various decisionmaking tasks, including routing, resource allocation, and automatic configuration.

CONVOLUTIONAL NEURAL NETWORK FOR IMAGE DATA
DL belongs to a branch of the ML family mainly referring to the faction of neural networks. The term "neural network" has its origins in attempts to find mathematical representations of information processing in biological systems, which are built of a lot of interconnected neurons. As the basic unit of a neural network, each neuron can be modeled as an activation function to emulate the process of transferring information in the practical biological system. According to the network topology, neural networks can be categorized into feedforward networks and feedback networks. A convolutional neural network is a specialized type of feedforward network for primarily processing image data that can be regarded as a two-dimension (2D) grid of pixels (LeCun et al., 2015). The operating process of CNN can be described as convolution, pooling, and activation.

Convolution
The kernel convolves with pixel points across the width and height of the input image, computing the dot product between the entries of the kernel and input. The kernel works like a filter that scans the input image to extract the informative features for recognition. The extracted features from the image are displayable and explainable, such as eyes, nose, or mouth in face photos. Convolution takes advantage of sparse interaction, parameter sharing, and equivariant representations to improve the performance of image recognition.

Pooling: Down-Sampling Operation
The output of the convolution layer at a certain location is replaced by a summary statistic of the nearby outputs. The typical pooling is to calculate the average or maximum value of a small local region in one feature map to down-sample the dimension of the feature map, thereby greatly reducing the parameter size and creating an invariance to small translations of the input.

Activation: Non-linear Operation
The representation capacity of the whole network is improved through the non-linear mapping between adjacent layers. Common activation functions include ReLU, Softmax, Softplus, and Sigmoid, etc.
Due to the above factors, CNN is particularly effective at examining image data, including image recognition, objection detection, image understanding, and video translation (Gu et al., 2018). It has been statistically established that images often account for a large proportion of various data types. Therefore, CNN is one of the most useful approaches in DL for image processing. In optical communication, most data are denoted in the format of a digital signal, while some other kinds of information are presented in the form of images, as summarized and displayed in Figure 2. Compared with the data format of digital vectors, one great advantage of image formats is that various digital data of different sizes can be comprehensively and integrally presented in a picture with a fixed pixel size. Image data with a fixed size can therefore contain various information, which is important for ML and DL in keeping their structures stable .
As can be seen from Figure 2, the seven kinds of typical image data in optical communication are linear polarization (LP) mode diagrams, orbital angular momentum (OAM) mode diagrams, eye diagrams, constellation diagrams, optical spectrum diagrams, asynchronous amplitude histograms (AAH) diagrams, and asynchronous delay-tap plot (ADTP) diagrams (ADTP combines asynchronous sampling with a two-tap delay line, so that each sample point comprises two measurements, separated by a fixed time corresponding to the delay length. The samples are plotted as sample pairs, producing a joint map of the power and evolution over the delay time) (Wang et al., 2017a,b;Li et al., 2018). Through analyzing and processing these image data, CNN can explore a large number of informative features for optical communication to execute a variety of functions, including but not limited to channel estimation, mode demodulation, optical signal analysis, impairment diagnosis, OPM, DSP, and spectral analysis. For example, CNN is capable of: detecting mode crosstalk and estimating a few mode fiber channels from LP mode diagrams; demodulating multiplexed modes and detecting atmospheric turbulence from OAM mode diagrams; analyzing the signal quality; diagnosing system impairments from eye diagrams (for intensity-modulated signals) and constellation diagrams (for complex-modulated signals); monitoring opticalto-noise ratio (OSNR) and identifying modulation format with low-cost methods from ADTP and AAH diagrams; and measuring and analyzing spectral characteristics from spectrum diagrams.

RECURRENT NEURAL NETWORK FOR SEQUENTIAL DATA
Unlike CNN designed for image data, RNNs are specifically proposed for sequential data, where temporal correlations exist at a range of different timescales. Different from feedforward neural networks, RNNs containing cyclic connections aim to provide neural networks with memory, meaning that the outputs are not only related by the current inputs but also the formerly available information (Mikolov et al., 2010). Thus, RNNs have achieved great success in sequence modeling and prediction tasks, such as speech recognition, handwriting recognition, language translation, and stock price forecasting. The principle of RNN is illustrated in Figure 3. The input vector is a series of sequential data X = {. . . x t−1 , x t , x t +1 . . . }, and the neurons in the hidden layer get inputs from not only x t of the input layer but also the output h t−1 of the hidden layer at the previous time steps. Passing through multiple hidden layers, an input sequence x t can be mapped into an output sequence y t that involves some previous stated information.
However, conventional RNN finds it difficult to learn longterm dependencies from sequential data. To overcome this weakness in RNNs, long short-term memory (LSTM) was designed to learn long-range temporal relationships among sequential data and remember inputs for a long time (Zia and Zahid, 2019). As one of the most famous RNN variants, the core idea of LSTM is the memory cell, which can pass information through time steps, and structures called gates, which are used to remove or add information to the memory cell, as shown in Figure 3B. The operating process of LSTM can be summarized by forgetting the old state and memorizing the fresh state such that the useful information in the cell can be passed on, and the useless information can be discarded. Thus, LSTM can not only allow the accumulation of information over a long period of time but also forgets the old state by setting it to zero and starting to count afresh.
In the era of big data, except for image data, most of the rest are sequential data, such as speech, language, and words. In optical communication, most data are sequential data, such as optical and electrical signals, network traffic data, equipment state operating parameters, as summarized and displayed in Figure 3A. In optical communication, for tasks that involve these sequential data, it is better to use RNNs to realize digital signal pre-distortion and post-compensation, inter-symbol interference (ISI) cancellation, network traffic prediction, and equipment failure management, etc.
The optical signals can be regarded as a series of time-domain data, and the mutual influence and the experienced impairments from the transmission process can be embodied into temporal signal waveforms. Considering the superior performance of RNN for these data, RNN can pre-distort signal before transmission to resist transmitter imperfection and the post-compensate signal after receiver to mitigate system impairments or identify the crosstalk between adjacent symbols to cancel the ISI Deligiannidis et al., 2020;Zhao et al., 2020).
For network traffic data, the traffic loads fluctuate regularly or irregularly over time according to daily statistics (Lu et al., 2015). Based on previous scenes, RNN can build a prediction model for large-scale network traffic forecasting from the perspective of temporal analysis, which is important for load balancing and network planning Zheng et al., 2020).
Early-warning and proactive protection are becoming increasingly critical for network operators as a failure of the optical network could result in huge economic loss. The operating conditions of network equipment can be reflected in the equipment state parameters, which are varied over time. Through analyzing a great deal of historical data, RNN can learn the variation trend of state parameters and establish a failure prediction mechanism to prevent risk in advance Zhang et al., 2020).

END-TO-END LEARNING FOR JOINT OPTIMIZATION WITH DL-BASED CHANNEL MODEL
The conventional model of the optical communication system is constructed in a divide-and-conquer manner and consists of a series of model blocks, including symbol mapping, shaping filter, laser, modulator, fiber channel, amplifier, optical filter, detector, low-pass filter, and digital sampling, as shown in Figure 4A. This block-based optical communication system is strongly dependent on practical channel conditions and is characterized by rigid mathematical models (Agrawal, 2012). However, the conventional block-based communication systems still have the following deficiencies: (a) they are only effective in tractable and stable scenarios, but invalid for those complex and dynamic scenarios; (b) they require a lot of artificial expertise; and (c) they have a relatively long computation time owing to the small step sizes and repeated iterative operations they undertake.
In deep learning communities, autoencoder is another important and popular algorithm. It is an unsupervised learning algorithm for a neural network that sets the target output values to equal the inputs. The autoencoder has been applied in dimensionality reduction, feature reconstruction, and data encryption (Tschannen et al., 2018). A new fundamental way to interpret the entire communication systems as an autoencoder has been proposed. It was first presented in wireless communication systems before being introduced to optical communication systems (Karanov et al., 2018). This technique is based on the concept of end-to-end learning that seeks to jointly optimize the transmitter and receiver components in a single process. However, a major drawback hindering practical implementation is that a differentiable channel model is necessary to execute parameter adjustment through backpropagation. Accordingly, a DL-based fiber channel modeling scheme was proposed . In theory, DL can approximate any function to solve both linear and nonlinear problems. According to the characteristics of DL, the model functions can be approximated by mapping independent to dependent variables, corresponding to the input and output data as shown in Figure 4B. DL constructs an approximate model for a black box driven by source data and received data. Furthermore, because the scheme does not rely on expert experience, it can significantly reduce the modeling cost and improve the simulation efficiency. This transmission simulation model in the DT system can not only digitize the physical process but also provide the numerical channel model that is important for adaptive damage compensation, like the endto-end learning method, to ensure high reliable transmission of optical communication. Based on the idea of an auxiliary channel, a DL-based channel as shown in Figure 4C was also flexibly embedded into an end-to-end learning model to perform joint optimization more accurately (Karanov et al., 2020;.

GENERATIVE ADVERSARIAL NETWORK FOR DATA AUGMENTATION
One of the main motivations for DL is having an effective and available dataset for training, and more adequate data contribute to a better generalization of the model. However, in practice, labeled data are valuable and rare. In optical communication, it is difficult to collect both image data and sequential data, particularly experimental data and practical data from network operators or corporations. In addition to guaranteeing sufficient data, diversity is also essential to improving the robustness and generalization of DL models. Therefore, a lack of sufficient and diverse training data is one of the major limitations on DL to be well-applied in optical communication.
GAN was recently introduced as an emerging technique to implement data augmentation. At first, GAN was proposed by Ian Goodfellow et al. as a way to generate image data, including handwritten digits, human faces, and animal images (Goodfellow et al., 2014). The idea behind GAN was based on the concept of zero-sum game theory, as shown in Figure 5. The framework of GAN consists of two neural network models: a generative model called generator captures the data distribution and output of the generated samples, and a discriminative model that distinguishes whether a sample came from the real dataset or a generated one. During the training procedure, the two models compete with each other. The generator is designed to generate data as realistic as possible so that it is difficult to distinguish them, while the discriminator as a binary classifier aims to identify real and fake data as accurately as possible. The generator and discriminator are optimized alternately until the augmented data are indistinguishable from the actual data.
Inspired by GAN, a number of new applications have been discovered in terms of images, such as image synthesis, image style transfer, image-to-image translation, and image reconstruction (Gui J. et al., 2020). For optical communication, except for image data, other data types can also be combined with GAN. A network traffic data augmentation technique using GAN was proposed to augment the traffic dataset adaptively for various scenarios . Based on limited experimental traffic data, GAN captured distribution characteristics and then generated massive diverse traffic data, which significantly expanded the training dataset and improved the performance of DL models. Therefore, not limited to image data, GAN can be applied to arbitrary data types by designing appropriate generators for specific application requirements in optical communication.
FIGURE 5 | Schematic of the generative adversarial network, consisting of two neural networks: a generator and a discriminator. The generator is used to produce the approximated samples from the N-dimensional random noise. The discriminator is used to identify whether a sample is real or fake. These two networks compete with each other and are optimized gradually to realize data augmentation.

DEEP REINFORCEMENT LEARNING FOR NETWORK AUTOMATION
Reinforcement (RL) has made great breakthroughs in solving complicated controlling problems based on environment-aware mechanisms. DL plays an important role in perception that can acquire information from observation of the environment and provide current state information, while RL shows powerful advantages in decision-making that can sense complex system states and learn best policies through repeated interactions with the environment, as shown in Figure 6. DRL combines the perception of DL and the decision of RL to learn a policy that maximizes the cumulative rewards for various tasks, like playing Go, competitive video games, controlling continuous systems in robotics.
The schematic of DRL is displayed in Figure 6. It can be observed that in DRL there are two main elements (agent and environment) and two core steps (observation and action).
The observation provides the current state information of the environment and the action represents the adjustment that the DRL agent makes according to the rewards or punishments from the environment. Therefore, DRL reflects a universal truth that the machine learns from failures in the past and grows after correcting them. Similarly, the agent of DRL learns from rewards and punishments rather than explicit instruction. Through repeated training and learning for a specific purpose, the agent grows powerful gradually to earn more rewards and avoid making mistakes, even exceeding human capacity in many domains.
In the context of optical communication, DRL is particularly useful for network control and automation and thus has been applied in the network layer to automatize the resolutions of routing, resource allocation, orchestration, and configuration (Chen et al., 2019a,b;Suárez-Varela et al., 2019;Andreoletti et al., 2020;Wang et al., 2021). A DRL-based routing solution was proposed for the optical transport network (OTN) that can better capture the crucial relationships among the lightpaths and paths in OTN topologies (Suárez-Varela et al., 2019). Considering the real network topologies and traffic profiles, the routing policy learned by the agent outperformed well-known routing heuristics. Moreover, the elastics optical network (EON), where the spectrum distribution becomes extremely flexible and spectrum resource management confronts big challenges (Yin et al., 2013;Zhu et al., 2013;Gong and Zhu, 2014), requires more automatic and smart control schemes. Accordingly, a DRL-based spectrum assignment scheme was introduced in A DRL-based observer to select the duration of each service cycle adaptively for realizing adaptive and high-quality virtual network function services (Li B. et al., 2020). This study obtained superior results, especially under dynamic, flexible, and complex scenarios.
Additionally, we proposed an adaptive optical transceiver configuration technique using DRL for data center optical networks and passive optical networks . The traditional transceivers are only suitable for static scenarios, where the transmission capability is fixed and massive spectrum resources are wasted. Therefore, the flexible optical transceiver is considered as a promising candidate to realize flexible services provisioning but faces the challenges of searching for optimum transceiver parameter sets when considering complex network conditions, including diverse user types, modulation formats, multi-level access distances, quality of transmission, and transmission speed. With the help of DRL, flexible transceivers can be adaptively configured according to network environment states. To improve throughput and spectral efficiency, the agent gradually learns the relationship between network state and the reward of varied configuration actions.