Realizing asynchronous finite-time robust tracking control of switched flight vehicles by using nonfragile deep reinforcement learning

In this study, a novel nonfragile deep reinforcement learning (DRL) method was proposed to realize the finite-time control of switched unmanned flight vehicles. Control accuracy, robustness, and intelligence were enhanced in the proposed control scheme by combining conventional robust control and DRL characteristics. In the proposed control strategy, the tracking controller consists of a dynamics-based controller and a learning-based controller. The conventional robust control approach for the nominal system was used for realizing a dynamics-based baseline tracking controller. The learning-based controller based on DRL was developed to compensate model uncertainties and enhance transient control accuracy. The multiple Lyapunov function approach and mode-dependent average dwell time approach were combined to analyze the finite-time stability of flight vehicles with asynchronous switching. The linear matrix inequalities technique was used to determine the solutions of dynamics-based controllers. Online optimization was formulated as a Markov decision process. The adaptive deep deterministic policy gradient algorithm was adopted to improve efficiency and convergence. In this algorithm, the actor–critic structure was used and adaptive hyperparameters were introduced. Unlike the conventional DRL algorithm, nonfragile control theory and adaptive reward function were used in the proposed algorithm to achieve excellent stability and training efficiency. We demonstrated the effectiveness of the presented algorithm through comparative simulations.


Introduction
Aerospace technology has developed rapidly since the 20th century (Wang et al., 2021;Giacomin and Hemerly, 2022;Wang and Xu, 2022).To satisfy the requirements of scientific exploration, military attack, transportation, industrial assistance, and other domains (Bao et al., 2021), flight vehicle systems are becoming increasingly complex (Wu et al., 2021;Lee and Kim, 2022).As an effective tool for the analysis of complex nonlinear systems, switched systems exhibit considerable potential for use in fast time-variation (Hu et al., 2019), full envelope, structural model mutation (Grigorie et al., 2022), re-modeling (Yue et al., 2019), among others (Chen et al., 2022;Yang et al., 2022).
Switched systems are a critical component of a series of discrete/ continuous subsystems, and a switching signal controls the switching logic between these subsystems (Zhang et al., 2019).The switched system exhibits considerable potential for use in theoretical research and engineering applications (Sun and Lei, 2021), such as modeling (Huang et al., 2020), stability analysis (Yang et al., 2020;Zhang and Zhu, 2020), and control problems (Gong et al., 2020;Xiao et al., 2020).The stability analysis of the switched systems is typically used for controller design (Liu et al., 2020).The common Lyapunov function (CLF) method is widely used for stability analysis of arbitrary switching (Jiang et al., 2020).However, ensuring that a CLF is shared by all the subsystems remains challenging.This method is conservative to some degree, which leads to the research is required on the MLF and average dwell time (ADT) methods.Zhao et al. (2012) first studied the stability of the switched systems with ADT switching.In another study, the linear copositive function was extended to the MLF, and the multiple linear copositive Lyapunov function method was used to obtain a sufficient stability criterion for switched systems (Cheng et al., 2017).To obtain tight bounds on the dwell time, the mode-dependent average dwell time (MDADT) method was proposed to overcome the sharing problem of common parameters, and the worst cases were considered in the ADT method.The results were extended to a general case, and the properties of subsystems were considered.Generally, unstable modes may exist during the switching intervals.Therefore, a piecewise multi-Lyapunov function method was proposed in Zhao et al. (2017) for the stability analysis of unstable modes.To avoid dwelling for a long time in subsystems with poor performance and considering the MDADT methods, the slow switching is typically applied to stable modes, and fast switching is applied to unstable modes.Xu et al. (2019) proposed a time-dependent quadratic Lyapunov function method to solve the stability problem with all subsystems unstable.The bounded maximum ADT method is used to obtain the stability conditions of the linear switched system.However, these studies have only focused on infinite-time stability, whereas in finite time, the performance of the systems cannot be guaranteed.Unlike conventional Lyapunov stability, the FTS can achieve superior transient performance in finite time.Wei et al. (2020) proposed a novel MDADT switching signal.The dynamic decomposition technique was used to generate the switching signals, and sufficient conditions for FTS were detailed.For nonlinear switched systems with time delay, the Lyapunov-Razumikhin approach and Lyapunov-Krasovskii function method were used to investigate FTS problems (Wang et al., 2020).Furthermore, the tracking control is widely applied in flight vehicles (Liu et al., 2021).The finite-time tracking control problems in Wang et al. (2017) furthers research on finite-time robust tracking control of switched flight vehicles.
The tracking control problem for uncertain systems is investigated as follows (Liu et al., 2019;Chen et al., 2020;Lu et al., 2022): (1) constant parameter control, such as robust control, proportional integral derivative control, and optimal control, in which the worst case is considered for the bounded uncertainties and disturbances; (2) variable parameter control, such as adaptive and observer-based controls, in which the uncertainties and disturbances are compensated in real time; (3) learning-based control policy, such as reinforcement learning, which compensates uncertainties without prior knowledge and learns a control law through trial and error.In constant parameter control, the model uncertainties and external disturbances are assumed to be bounded with known boundaries, which result in performance degradation and conservative control laws.The variable parameter control method can be used to mitigate the problem of time-varying uncertainties with unknown boundaries.However, the model uncertainties are assumed to be linearly parameterized with predefined structure and unknown time-varying parameters.The learning-based control method can be used for addressing system uncertainties with unknown boundaries and unknown structures (Yuan et al., 2017).However, this method cannot ensure stability, and computational complexities increase.A novel model-reference adaptive law and a switching logic were developed for uncertain switched systems.Ban et al. (2018) designed an H ∞ controller for polytopic uncertain switched systems.Introducing scalar parameters reduced the conservatism of the linear matrix inequality (LMI) conditions and simultaneously ensured robust H ∞ performance of the system.The problems of nonfragile control for nonlinear switched systems considering actuator failures and parametric uncertainties were studied in Sakthivel et al. (2018).The Lyapunov-Krasovskii function method and ADT approach were used to design a nonfragile reliable sampled-data controller.These studies have focused on control in the ideal environment.However, in practice, because of the limitation of network bandwidth, a network delay and packet loss always exist, which cause inevitable asynchronous switching.Thus, the control switching lags behind state switching.This phenomenon results in performance degradation and instability.Li and Deng (2018) investigated the pth moment exponential input-to-state stability (ISS) of the switched systems with asynchronous switching.The indefinite differentiable Lyapunov function was combined with ADT to establish the ISS conditions of the switched systems with Lévy noise.The conclusion of these results (Zhang and Zhu, 2019) were generalized in Li and Deng (2018), and the ISS problems, stochastic-ISS, and integral-ISS for asynchronously switched systems with asynchronous switching were investigated.Fast ADT switching was introduced to mitigate the increase in the Lyapunov-Krasovskii function when active subsystems matches the controller.However, in most existing results on controller design for flight vehicles, although stability and robustness can be attained, achieving optimal control performance in real-time challenging.
With improvement in the calculating ability of computing devices, machine learning has been widely applied in many fields, including the control field (Cheng and Zhang, 2018;Guo et al., 2019;Gheisarnejad and Khooban, 2021).Xu et al. (2019) proposed a modeldriven DDPG algorithm for robotic multi-peg-in-hole assembly to avoid the analysis of the contact model.A feedback strategy and a fuzzy reward function were proposed to improve data efficiency and learning efficiency.In Tailor and Izzo (2019), optimal trajectory for a quadcopter model in two dimensions was investigated.A nearoptimal policy was proposed to construct trajectories that satisfy Pontryagin's principle of optimality through supervised learning.With improved aircraft performance, the guidance and control system require rapidity, stability, and robustness.Therefore, deep learning and the exploration of reinforcement learning are an effective solution to this problem, which cannot be solved using conventional control.Cheng et al. (2019) and Gaudet et al. (2020) studied the fuel-optimal landing problems based on DRL.The optional control algorithms were designed considering the uncertainties of environment and system parameters by using deep neural networks and policy gradient methods to ensure the real-time performance and optimality of the landing mission.The design of the reward function is a critical factor for controller/filter design with DRL.In this method, the final performance of the training networks was determined but not treated satisfactorily.This study is motivated to solve this problem.However, the methods proposed in Tailor and Izzo (2019) and Gaudet et al. (2020) could not ensure the robustness and stability of the given system.Considering the advantages and limitations of the model-based and model-free methods, we proposed a novel nonfragile DRL for achieving asynchronously finite-time robust tracking control of switched flight vehicles.In this method, the best compromise was realized between system stability, robustness, and rapidity.The intelligent controller based on nonfragile H ∞ control and DRL was proposed to compensate model uncertainties and realize superior control performance.The FTS and finite-time robustness were realized by nonfragile H ∞ control, and the transient performance was optimized by using the adaptive deep deterministic policy gradient (ADDPG) algorithm.Because of the significance of reward function design in the training process, adaptive hyperparameters were introduced to construct a generalized reward function to improve the performance and achieve robustness.Therefore, the contributions of the paper can be summarized as follows: (1) A novel control structure consisting of dynamics-based and learning-based controllers was proposed for the finite-time tracking control of switched flight vehicles.The robust control is focused on the worst case of uncertainties.However, transient performance is not ensured.The learning-based method, such as DRL, can address uncertainties with unknown boundaries and structures.However, stability is not guaranteed.
Compared with the conventional method, in such a design structure, the advantages of both conventional robust control method and pure DRL are combined.The DRL is used to enhance control performance without exploiting their structures or boundaries, and the robustness is guaranteed by using model-based robust control.Thus, an optimal compromise between robustness and dynamic performance was achieved.
(2) The stability and robustness of closed-loop system were guaranteed by using non-fragile control theory.The restricted DRL algorithm was proposed, in which the boundaries of scheduling intervals were predefined.The scheduling of parameters can be viewed as the perturbation of parameters within a given interval.Compared with pure DRL, the proposed method improved training efficiency and ensured stability of the closed-loop system.(3) The adaptive reward functions were proposed to realize rapid training convergence.The reward functions were crucial for the DRL algorithm.The conventional method of reward functions typically depends on the designing experience of the researchers, which degrade training efficiency and result in trial and error.Therefore, in the proposed method, adaptive factors for reward functions were used to improve training efficiency.
The rest of the paper is organized as follows.In Section 2, the structure of intelligent switched controllers is presented.In Section 3, the finite-time robust tracking control algorithm using DRL and H ∞ control was proposed.A numerical example is provided in Section 4. Finally, Section 5 presents the summary and directions for future studies.

Problem statement
The HiMAT vehicle was studied, which is an unmanned flight vehicle.Its nonlinear model can be described in Eq. ( 1).
where m f and v denote the mass and velocity of the flight vehicle, respectively.Here, α , θ , ϕ, and q are the attack angle, flight path angle, pitch angle, and pitch rate, respectively.Furthermore, M yy and I y are the pitch moment and the moment of inertia about the pitch axis, respectively.Furthermore, g denotes the gravitational constant.The notations of T , D, and L represent the thrust, drag force, and lift force, which can be expressed as follows: . ρ , in which ρ and δ c are the air density and throttle setting.
Based on Jacobian linearization, the nonlinear model of HiMAT vehicle can be converted into the linear model to bridge the connection between complex nonlinear and linear models.Therefore, the longitudinal short-period model of the HiMAT vehicle can be modeled as switched systems as follows: R with δ e , δ v , and δ c representing the elevator, elevon, and canard deflection, and y k y ( )∈ R denoting the control and output signals.Here, σ k i n , , ,  is the switching function, which is a piecewise continuous constant function.Furthermore, n > 1 is the number of subsystems.The characteristic of subsystems is assumed to depend on the switching signal, which are known previously.Here, A i , B i , C i , and D i are system matrices with appropriate dimensions.
In the network environment, because of the limit source of network bandwidth, the packet dropouts should be considered.The packet dropouts are considered in the channel of sensors-controllers to satisfy the Bernoulli distribution (Cheng et al., 2018).Therefore, the measured output is described as follows: where  y k ( ) is the measured output, θ k ( ) represents a stochastic variable satisfying the Bernoulli distribution and takes value of 0 1 , { }, , is the probability of packet dropouts.The control structure of switched flight vehicles to ensure stability and improve transient performance is displayed in Figure 1.
The controller diagram reveals that the controller is composed of two parts: (5 where , and the objective of tracking control is as follows: where r c k ( ) denotes the command signal.
We set the integral of tracking error as follows: The feedback controller is proposed as follows: and K n,2i are the gain matrices to be determined.
Nominal controller parameters K n,1i and K n,2i can be designed by the H ° control, the variation internal of learning-based controller u c k ( ) in subsystem i can be perceived as the additional bounded uncertainties of the dynamics-based controller.Thus, the parameters vary in the interval and the stability of learning-based controller can be analyzed by using nonfragile control theory.Here, ∅K c i , is defined as the additional compensation to obtain the actual gain matrices as follows: where ∅K c i , and ∅K c i , denote the lower and upper bounds of , M i and N i are known parameters with appropriate dimensions, and F i are uncertain matrices satisfying the following equation:  ADDPG algorithm to achieve superior performance in real time.The output of u c k ( ) varies in the neighbor interval of u n k ( ) with given bounds.Therefore, the nonfragile control can be used to ensure the stability of u c k ( ).As mentioned, ensuring stability, robustness, and optimal performance simultaneously remains difficult.To improve training efficiency, adaptive factors for reward functions were applied in DDPG algorithm.With inspiration from the achievements in the DDPG algorithm and robust control, the advantages of model-based method (H ° control) and model-free method (DRL) were considered the problem.
Remark 2: The compensation of learning-based controller is considered as an additional gain value on the controller parameters with known bounds, which can be predefined and can presented by M i and N i .The optimal control policy can be realized in the scheduling interval by using the ADDPG algorithm.
The switching of controller always lags the switching of system mode because of packet dropouts.The ith subsystem is assumed to be activated at k i , and the controller of ith subsystem is activated at k i i + ∆ , where ∅ i denotes the length of unmatched periods.The condition in which unmatched and matched periods exist simultaneously is called asynchronous switching.The Lyapunov-like function decreases in matched periods and increases in unmatched periods with bounded rates, where a i are introduced to represent the decreasing rate in matched periods, and b i represent the increasing rate in unmatched periods.The increasing coefficients of the Lyapunov-like function at switching instants are set to be ∝ i .
For proof, the following assumptions are introduced.
Assumption 1 (Cheng et al., 2017): For given positive constant N f , the time-varying exogenous disturbance ω k ( ) satisfies the following equation: where ω is the upper bound of external disturbance.Assumption 2 (Cheng et al., 2017): The maximum number of consecutive data missing is set to be N 1 , and the maximum probability of data missing is set to be ρ.
According to the aforementioned statement, the closed-loop switched systems can be described as follows: where Furthermore, the definitions of finite-time stable, finite-time boundedness, and finite-time H ∞ performance for switched systems are expressed as follows: Definition 1 (Wei et al., 2020): For given appropriate constant positive matrix R s , positive constants c 1 0 > , c 2 0 > , and < , respectively.The switched systems in Eq. ( 12) with u k if Eq. ( 13) holds.
Definition 2 (Wei et al., 2020): For given appropriate constant positive matrix R s , constants c 1 0 > , c 2 0 > , ω, and N f with c c 1 2 < , respectively.The switched system in Eq. ( 12) is finite-time bounded ) such that the following expression holds: where the external disturbance satisfies Assumption 1. Definition 3 (Wei et al., 2020): For a given appropriate constant positive matrix R s , constants c 1 0 > , c 2 0 > for ω and N f with c c 1 2 < .The system in Eq. ( 12) exhibits finite-time H ° performance γ d if the system is FTB and satisfies the following expression: Thus, the main purposes of controller design is to ensure that the switched system is FTS with prescribed ) , which is equivalent to design the robust controller, such that the following condition is satisfied: 1.The switched systems in Eq. ( 12) is FTB. 2. For given constant γ d > 0, the system in Eq. ( 12) satisfies Eq. ( 15) under zero-initial situation for all external disturbance satisfies Eq. ( 11).
Based on the structure of control diagram, the design process is categorized into two steps: Step 1: The scheduling interval of control parameters can be assumed to be the uncertain compensation of dynamics-based controller.Considering the controller uncertainties and asynchronous switching caused by packet dropouts, the finite-time H ° controllers are derived as dynamics-based controller according to nonfragile control theory and finite-time robust control theory in terms of LMI.
Step 2: The variations of controller parameters are assumed to be the action, and the dynamic model of flight vehicles is assumed to be the environment.The DRL algorithm was introduced to derive the learning-based controller to realize optimal control policy, in which the ADDPG algorithm was proposed as the model-free method in the actor-critic framework.

Main results
A dynamics-based controller was proposed to ensure stability and a prescribed performance index.The ADDPG algorithm was developed to realize performance and ensure controllers can adaptively schedule parameters.
then τ ai is called the MDADT and N i 0 is called the modedependent chatter bounds.
Lemma 1 (Cheng et al., 2017): then we can obtain the following: where F satisfies F F I T < .
Lemma 2 (Aristidou et al., 2014): For given matrix Q, which satisfies where = T , and Q 11 and Q 22 are invertible matrices.Then we can conclude that the following three conditions are equivalent, which is called Schur Complement.
Theorem 1: Given system Eq.( 12) and constant scalars 0 , then the following expression is obtained: then the switched system in Eq. ( 12) is FTB with respect ) if the MDADT satisfies the following equations: where as the switching instants over the interval 0,k [ ], suppose the following Lyapunov functions exist: Class κ ∞ functions exist as follows: where , and combining with Eqs. ( 12) and ( 27), we can obtain the following expression: where ( ) Setting S P i i = −1 and performing a congruence transformation to Eqs. ( 29 we can obtain the following expression: The inequality We can conclude that Eq. ( 31) is equivalent to Eq. ( 21) and Eq. ( 32) is equivalent to Eq. ( 22), such that the following expression holds true: Combining Eqs. ( 25), ( 26), ( 28), (34), we can obtain the following equations by iteration operation: With the definitions of η 1 and η 2 , we have the following expression: ( ) ( )≤ , we can obtain the following expression:  23), ( 24) hold, then we can conclude that the following expression is true: Thus, the switched system in Eq. ( 12) is FTB, which completes the proof.The sufficient guarantees of FTS are given in Theorem 1, and the prescribed attenuation performance are discussed in Theorem 2.
Theorem 2: Given system Eq.( 12) and constant scalars 0 , Ω , such that the following expression holds: then the system with MDADT satisfying the following expression is FTS with H ° performance γ d with respect to 0 2 , , , , , where Proof: The Lyapunov functions are determined in Eq. ( 25).We can obtain the following equations under the zero-initial condition. where The system in Eq. ( 12) is stable with predefined performance such that Similar to in Eq. ( 33), we can obtain the following expression: With Eqs.( 40), ( 41), we have Z ii < 0 and Z ij < 0 , which implies that the following expression: The following equation can be obtained by setting . Moreover, the system in Eq. ( 12) is FTB with respect to 0 According to V k k σ ( ) ( ) ≥ 0 and zero-initial condition, we have the following expression: Multiplying both sides of Eq. ( 53) by ) , we obtain the following equation: Based on the definition of MDADT and Eq. ( 42), we have the following: Combining with Eqs. ( 43), ( 45), we infer the following: Thus, we have the following equation: Next, we have the following expression: , we can obtain the following: Therefore, the system Eq.( 12) is FTB with given attenuation index , which completes the proof.
Based on Theorems 1 and 2, the parameters of finite-time tracking controller of switched systems is derived in Theorem 3.

Online scheduling based on the ADDPG algorithm
Based on the finite-time H ∞ control, the sufficient conditions to ensure the FTS and prescribed performance are presented.The process of online scheduling can be formulated as the Markov decision process (MDP).Because the control process is a series of continuous decision process, the ADDPG algorithm was proposed based on the actor-critic framework to realize superior control performance of switched flight vehicles.
The DRL is composed of an agent and the interacting environment.At each time, the agent obtains a state s k , selects an action a k , and can receive reward r k and s k +1 by interacting with the environment, in which r k is used to evaluate the performance of state-action pair at the time instant.In this study, the switched tracking controller can be viewed as the agent, whose purpose is maximizing the sum of the expected discounted reward function over a series of future steps: where γ d , ∈[ ] 0 1 denotes the discount factor.Here, K f denotes the terminal step of reinforcement learning.The value of reward depends on the action undertaken and the current state.The action and state are defined as follows: The ADDPG algorithm is provided based on the DDPG algorithm, in which the advantages of both deep Q learning and actor-critic framework are used to realize the optimal action, which is updated in continuous action spaces based on policy gradient theory.The ADDPG algorithm is realized in sections: the action-value in each step is approximated by the critic network ) with weights ς Q , the current control policy is obtained by the actor network ϖ ς ϖ s k ( ) with weights ς ϖ .The weights of the critic network are updated by minimizing the loss function, which can be described as follows: where The weights of actor network are updated according to the policy gradient in the following equations: where L an is the learning rate of ϖ ς ϖ s k ( ) .
To overcome the divergence of Q learning, two separated networks were adopted: the actor target network ′ ( ) ' and the critic target network ′ ( ) , the mentioned two networks can update their weights as follows: where L atn and L ctn are the learning rates.Moreover, an exploration noise N a is added to the actor to realize exploration and actual control policy, which is generated by actor and can be rewritten as follows: Unlike the conventional DDPG algorithm, the adaptive parameters were introduced to achieve superior convergence and robustness, respectively.By introducing robustness as a continuous parameter, the reward function enables the convenient exploration to realize adaptive training.The control policy is used to reduce the tracking error with lower control input and unsaturated actuator, therefore, the reward function depends on the tracking error, amplitude of control signal, and the saturation of actuator, which can be expressed as follows: where r k e1 ( ) represents the reward of tracking error, r k e2 ( ) denotes the reward of control input, and r k e3 ( ) is the reward of saturation, respectively.Here, g 1 , g 2 , and g 3 denote the weights of r k e1 ( ), r k e2 ( ), and r k e3 ( ) in the reward function.Furthermore, υ 1 , υ 2 are the adaptive shape parameters, which determine the robustness of the reward function.l 1 0 > and l 2 0 > are the parameters that controls the size of the quadratic bowl near the origin, respectively.Here, δ p is predefined constant and u k ( ) denotes the upper bound of the actuator.Next, the final reward function r k e1 ( ) and r k e2 ( ) with adaptive parameters can be rewritten as follows: The adaptive updating law of hyper parameters are defined as follows to improve transient performance and robustness of the algorithm: where v 1max and v 1min denote the maximum and minimum values of v 1max .Similarly, we can obtain the definitions of v 2max , v 2min , l 1min , and l 2min .The length of each segment is determined by training episodes.
Based on the statement, the pseudocode for the ADDPG algorithm proposed in this paper is presented in Algorithm 1.

Randomly initialize the weights of networks
) and ϖ ς ϖ with ς ϖ and ς Q .
4. Initialize the weights of ′ ( ) 12. Store the variable transition pair in the replay buffer, which consists of s k , a k , r k , and s k +1 .
13. Randomly sample a mini-batch of N transition pairs from the replay buffer R.
) as follows: 16. Update the weights of network ϖ ς ϖ s k ( ) as follows: 17. Update the weights of target networks: Remark 3: Although the conventional DDPG algorithm can realize parameter optimization (Xu et al., 2019;Gaudet et al., 2020;Gheisarnejad and Khooban, 2021), guaranteeing data efficiency and system stability because it attempts to explore the optimal control policy for all possible action in the action space is difficult.Moreover, the proposed adaptive hyper parameters can increase robustness and achieve generalized case because the reward function determines training performance.

Numerical examples
In this study, the HiMAT vehicle is given to validate the proposed method.The three-view drawing and trim condition for operation points can be obtained from the study performed by Wang et al. (2015).The flight condition and the model of longitudinal motion dynamics are given as Wang et al. (2015).
Based on the trim condition within the flight envelope, the longitudinal motion dynamics can be described by switched systems.We set the sampling time T s = 0 02 . and obtain the system matrices A i and B i , which can be described as follows: The switching of subsystems in the flight envelope is supposed to be 19-18-12-9-8-4-2-1, which is described in Figure 2.
The harmonics wind gust is considered in the paper, which is described in Eq. ( 83).[ ].Furthermore, a command filter was provided to improve the performance of the intelligent tracking controller, which can be generated as follows: where J k ( ) denotes the state vector; z k ( ) represents the output of the filter; ζ n and ω n are the damping ratio and band width; S a and S v denote the transfer functions of the amplitude limiting and the rate limiting filters.The parameters of the switched systems are given as c 1 0 = , c 2 1 5 = ., N f = 25, ω = 5, and R I = .Compared with the conventional ADT method, tighter bounds on FTS analysis can be obtained.The ADT method can be considered to be a special case of the MDADT method, and we can obtain that τ τ ≤ , which is illustrated in Table 1.Therefore, the proposed method can realize limited conservative results than the ADT method.We set the probability of data missing as ρ = 0 95 ., the maximum number of consecutive data missing N 1 is set to be 5.Moreover, the matrices U 1i , U 2i , S 1i , and S 2i can be solved by Eqs. ( 62), ( 63) in Theorem 3. The dynamics-based controller was constructed, and its parameter matrices and structure are given as follows: Switching logic of HiMAT in the flight envelope.
Moreover, to overcome the problem of operation points with static instability, an angular rate compensator was introduced as follows: where T s f ( ) denotes the transfer function of angular rate compensator, t q and k q are the parameters of compensator.Next, we presented two examples to validate the proposed method.
Example 1: The tighter bounds on the dwell time can be obtained by the proposed method according to the data in Table 1.Moreover, because the characteristic of each subsystem is considered, the transient performance can be achieved by using the MDADT method.The switching of subsystems is displayed in Figure 2. Notably, the parameters of flight vehicles switch at the switching instants.First, to compare the difference between the two switching logic mechanisms, the simulation results under ADT switching logic and MDADT switching logic are displayed in Figures 3, 4, in which the labels are defined as ADT and MDADT, respectively.Figures 3, 4 reveal that the curves of the attack angle highlight the tracking performance in the flight envelope of switched controllers under ADT switching logic and MDADT switching logic.Thus, the tracking error can converge within the given time interval, and the transient performance of MDADT method is superior.Moreover, in Figures 3, 4, we provide the detailed enlargement of simulation curves near the switching time and steady process.Switched controllers with MDADT logic can achieve better transient performance than the those of controllers with ADT logic.Furthermore, the MDADT method corresponds to smoother response.The switched controllers with MDADT logic can obtain excellent transient performance with tighter bounds on the dwell time, which is less conservative than the ADT logic.
Example 2. In this section, the feasibility of the ADDPG algorithm for flight aircraft is validated.The weights of actor network and critic network are updated such that the learning-based controller adaptively compensates the model uncertainties and external disturbance in the environment.The action of supplementary control is added to the dynamics-based controller, which constitutes the real-time finite-time adaptive tracking control for the flight vehicles.The design parameters of the ADDPG algorithm are defined in Table 2.

Switching logic Parameter Result
MDADT a 1 0 22 = ., a 2 0 24 = ., a 4 0 23 = ., a 8 0 19 = .,a 9 0 31 = ., a 12 0 26 = ., a 18 0 27 = .The input is into two paths for critic networks, corresponding to the observation and action.The number of neurons in the input layer of the observation path is the dimension of the observed states, which is represented by obs.The number of neurons in the input layer of the action path corresponding to the controller parameters.The critic networks are updated based on the adaptive moment estimation (Adam) algorithm.The regularization factor is set to be 2 10 4 × − .We define the input of actor network is the observed states and the output is the compensated controller parameters.The activation function of fully connected layers is set to be ReLu and the activation function of output layer is tanh.The weights of actor network are updated based on the Adam algorithm.The variance of noise is set to be 0.1 and the variance decay rate is 1 10 5 × − .Because the stability and robustness of the closed-loop system are guaranteed by the switched control theory and robust control theory, we consider wind gust in the training environment, the perturbations of aerodynamic parameters and wind gust are introduced in the testing environment.Then the algorithms can be implemented on a desktop with Intel Core i7-10700K @3.80GHz RAM 16.00 GB and operation system of Windows 10.
The DDPG algorithm was simulated to verify the advantages of the proposed method in terms of control performance and convergence for algorithms.The robust controller proposed by the MDADT method was designed as the dynamics-based controller.Both the ADDPG and DDPG algorithms are given in the simulation as the learning-based controller to compensate the unexpected uncertainties in the flight environment.The simulation results are displayed in Figures 5-9, in which the MDADT method, MDADT with DDPG method, and MDADT with ADDPG method are labeled as MDADT, DDPG, and ADDPG, respectively.As displayed in Figures 5, 6, the ADDPG algorithm outperformed the episodes reward convergence of DDPG algorithm, required fewer episodes to converge in the neighbor of the origin.Therefore, the ADDPG algorithm outperformed the conventional DDPG algorithm in terms of the control performance and steady error.The responses of attack angle are displayed in Figure 7.Both DDPG and ADDPG algorithms could achieve convergence and efficient performance.However, the transient convergence of the ADDPG algorithm was superior to that of the DDPG algorithm.The tracking errors are displayed in Figure 8.The controller compensated with the DDPG and ADDPG algorithms can exhibit improved     Response of the attack angle.
performance of steady-state response.However, the steady-state error of the ADDPG algorithm was less than that of the DDPG algorithm.
The reward function of an episode is displayed in Figure 9.The ADDPG algorithm can achieve superior final performance.
The average tracking errors of methods are presented in Table 3.The online scheduling through DDPG and ADDPG can efficiently reduce the average tracking error; the adaptive reward function can improve the tracking performance.The proposed method can overcome the undesirable response caused by asynchronous switching and uncertainties in the flight environment.
Moreover, to show the effectiveness to deal with system uncertainties and disturbance, we give the simulation results of HiMAT vehicle with disturbances and uncertainties of aerodynamic parameters, which can also illustrate the potential application prospects for practical environment.The results are described in Figures 10,11, in which we consider the cases where the aerodynamic parameter perturbations are 10, 15, and 20%.The responses of attack angle are given in Figure 10 and the tracking errors are given in Figure 11.The average tracking errors in the presence of aerodynamic perturbations are also provided in Table 4.We can see that the stability and tracking performance can be guaranteed with uncertainties and disturbances by using the proposed method, which illustrates that the proposed method can ensure the control accuracy, stability, and robustness simultaneously.
Remark 4: We draw inspiration from the traditional method of dealing with the sim-to-real transfer issue.Firstly, the nonlinear model is converted to a linear model by employing Jacobian linearization.Then we can design the nominal controller on the reference points.In most engineering applications, the stability margin is introduced and analyzed to ensure the robustness.Similarly, in this paper, we developed finite-time robust control theory to ensure the stability and attenuation performance.The uncertainties and disturbances in practical environment can be overcome.However, we noticed that it is difficult to realize optimal compromise between robustness and transient performance.The ADDPG algorithm is given to improve the control accuracy.Moreover, the non-fragile control theory is introduced, which ensures the stability and prescribed attenuation performance on the scheduling intervals.Remark 5: The problem of finite-time tracking for switched flight vehicles was investigated.According to the numerical examples, the advantages of the suggested control method to address the flight vehicle considering disturbances and uncertainties over the existing control methods are demonstrated, which can be described as follows: (1) Unlike the conventional model-based control methods, the proposed method was developed by using DRL, which can improve control performance and overcome the undesirable response caused by uncertainties.(2) In the proposed method, the advantages of model-based and model-free method are combined.The dynamics-based controller was developed to ensure stability and robustness, and the learning-based controller was proposed to compensate the uncertainties in the flight environment.(3) The established adaptive generalized reward function can improve convergence and robustness.

Conclusion
The control of switched flight vehicles with asynchronous switching was realized using a novel nonfragile DRL method.The flight vehicles were modeled as the switched system, and the asynchronous switching caused by packet dropouts was considered.The MDADT and MLF methods were used to ensure FTS and weighted prescribed attenuation index.LMIs were used to determine the solutions of the finite-time tracking controller.To compensate the external disturbance and improve tracking performance, the ADDPG algorithm based on the actor-critic framework was provided to optimize the parameters of tracking controllers.To improve optimization efficiency and decrease computational complexity, parameter optimization was assumed to be limited in the given range.The compensation of control policy in a given range is considered as the uncertainties of the controller parameters, and the FTS is ensured by nonfragile control theory.Compared with the conventional DDPG algorithm, the adaptive hyper parameters of reward function were introduced to achieve superior control performance and realize a general case.The FTS, robustness, and transient performance were ensured simultaneously by the proposed method.In the future, the following four points should be studied: (1) The event-triggered control structure should be considered to reduce the load and improve the robustness of information transformation.(2) The parallel optimization methods should be presented to improve training efficiency.(3) The fitting ability and generalization ability of neural networks should be studied to improve the robustness in the complex environment.(4) The semi physical simulations and flight tests of mini drones should be developed to further demonstrate the engineering feasibility of the proposed method.
10) Remark 1: The model of flight vehicle can be given based on switched systems.The variation of states in the envelope can be viewed as the switching between subsystems.The tracking controller is composed of two parts, namely dynamics-based controller u n k ( ), which is developed based on finite-time H ° control to ensure stability and prescribed attenuation index; the learning-based controller u c k ( ), which is based on

FIGURE 1
FIGURE 1Structure of the controller.
If positive numbers N i 0 and τ ai , exist such that ), (30) by matrices diag i i i S

Algorithm 1 .
Parameter optimization based on ADDPG 1. Set the variation range of controller parameters.2. Design the switched tracking controllers for flight vehicles based on Theorem 3.

5.
Initialize the replay buffer, episode = 0 6. for episode = 1 to M do 7. Randomly initialize exploration noise N a .8. Randomly initialize the state vector of the agent with s 1 , then the initial observation can be obtained.9. for t = 1 to K do 10environment based on the state s k and uncertain noise.11.Receive the adaptive reward r k and the state of next time instant s k +1 .
( ) represents the state of external disturbance with initial value of 0 01 0 .; FIGURE 2

FIGURE 3
FIGURE 3Response of the attack angle.

FIGURE 5
FIGURE 5Episodes reward of the ADDPG.

FIGURE 6
FIGURE 6Episodes reward of the DDPG.

FIGURE 7
FIGURE 7Response of the attack angle.

FIGURE 9
FIGURE 9Response of reward function.

TABLE 2
Parameters setting of the ADDPG.

TABLE 1
Dwell time of various switching logics.
TABLE Average tracking errors.

TABLE 4
Average tracking errors in the presence of aerodynamic perturbations.