Feedback stabilization of probabilistic finite state machines based on deep Q-network

Background As an important mathematical model, the finite state machine (FSM) has been used in many fields, such as manufacturing system, health care, and so on. This paper analyzes the current development status of FSMs. It is pointed out that the traditional methods are often inconvenient for analysis and design, or encounter high computational complexity problems when studying FSMs. Method The deep Q-network (DQN) technique, which is a model-free optimization method, is introduced to solve the stabilization problem of probabilistic finite state machines (PFSMs). In order to better understand the technique, some preliminaries, including Markov decision process, ϵ-greedy strategy, DQN, and so on, are recalled. Results First, a necessary and sufficient stabilizability condition for PFSMs is derived. Next, the feedback stabilization problem of PFSMs is transformed into an optimization problem. Finally, by using the stabilizability condition and deep Q-network, an algorithm for solving the optimization problem (equivalently, computing a state feedback stabilizer) is provided. Discussion Compared with the traditional Q learning, DQN avoids the limited capacity problem. So our method can deal with high-dimensional complex systems efficiently. The effectiveness of our method is further demonstrated through an illustrative example.


Introduction
The finite state machine (FSM), also known as finite automata (Yan et al., 2015b), is an important mathematical model, which has been used in many different fields, such as manufacturing system (Wang et al., 2017;Piccinini et al., 2018), health care (Shah et al., 2017;Zhang, 2018;Fadhil et al., 2019), and so on.The deterministic finite state machine (DFSM) is known for its deterministic behaviors, in which each subsequent state is uniquely determined by its input event and preceding state (Vayadande et al., 2022).However, DFSMs may not be effective in dealing with random behaviors (Ratsaby, 2019), for example, the randomness caused by component failures in sequential circuits (El-Maleh and Al-Qahtani, 2014).To address the challenge, a probabilistic finite state machine (PFSM) was proposed in the study by Vidal et al. (2005), which provides a more flexible framework for those systems that exhibit random behaviors.Especially, it gives an effective solution to practical issues, such as the reliability assessment of sequential circuits (Li and Tan, 2019).Therefore, the PFSM offers a new perspective for the theoretical research of FSMs.
On the other hand, the stabilization of systems is an important and fundamental research topic, and there have been many excellent research results in various fields, for example, Boolean control network (Tian et al., 2017;Tian and Hou, 2019), timedelay systems (Tian and Wang, 2020), neural networks (Ding et al., 2019), and so on.
The stabilization research of FSMs is no exception and has also attracted the attention of many scholars.The concepts of stability and stabilization of discrete event systems described by FSMs were given in the study by Özveren et al. (1991).A polynomial solution of stability detection and a method for constructing stabilizers were presented.Passino et al. (1994) utilzed the Lyapunov method to study the stability and stabilization of FSMs.Tarraf et al. (2008) proposed some new concepts, including gain stability, incremental stability and external stability, and then established a research framework for robust stability of FSMs.Kobayashi et al. developed a linear state equation representation method for modeling DFSMs in the study by Kobayashi (2006) and Kobayashi and Imura (2007) and derived a necessary and sufficient condition for DFSM to be stabilizable at a target equilibrium node in the study by Kobayashi et al. (2011).
However, as we know, the FSM is most often non-linear.Moreover, none of the above methods are convenient when analyzing and designing various FSMs.In the last decade, scholars applied the semi-tensor product (STP) of matrices to FSMs and derived many excellant results.First, with the help of STP, an algebraic form of DFSMs was given in the study by Xu et al. (2013).This algebraic form is a discrete-time bilinear equation.Then, the classic control theory can be used to investigate FSMs.Especially, under the algebraic form, necessary and sufficient conditions for the stabilizability of DFSMs were derived in the study by Xu et al. (2013), and a state feedback controller was obtained by computing a corresponding matrix inequality.Moreover, Yan et al. (2015a) provided a necessary and sufficient condition to check whether a set of states can be stabilized.Han and Chen (2018) considered the set stabilization of DFSMs and provided an optimal design approach for stabilizing controllers.Later, Zhang et al. used the STP method to investigate PFSMs and non-deterministic FSMs.Specifically, a necessary and sufficient condition for stabilization with probability one and a design method for optimal state feedback controller were provided in the study by Zhang et al. (2020a).Moreover, a systematic procedure was designed to get a static output feedback stabilizer for non-deterministic FSMs in the study by Zhang et al. (2020b).Although the STP method is very useful in analyzing discrete event systems, including various FSMs, it suffers from high computational complexity and can only handle small-scale or even micro-scale discrete event systems.To solve the problem, this study refers to techniques developed by Acernese et al. (2020) to solve the stabilization problem of high-dimensional PFSMs, and then provides a reinforcement learning algorithm to compute a state feedback stabilizer for PFSMs.The algorithm is especially advantageous in dealing with high-dimensional systems.
The rest of this study is arranged as follows: Section 2 introduces some preliminary knowledge, including PFSM, Markov decision process (MDP), deep Q newtwork (DQN), and ǫ-greedy strategy.In Section 3, a stabilizabillity condition is derived and an algorithm based on DQN is provided.An illustrative example is employed to show the effectiveness of our results, as shown in Section 4, which is followed by a brief conclusion in Section 5.

Methods
For the convenience of statement, some symbol explanations are provided first. .Probabilistic finite state machine where the set The state transition function f : X × U → 2 X describes that PFSM (1) may reach different states from one state under the same input event, where 2 X is the power set of X .

. Markov decision process and optimization methods
A Markov decision process (MDP) is characterized by a quintuple where S is a set of states, A is a set of actions, P is a state transition probability function, R is a reward function, and γ ∈ [0, 1] is a discount factor that determines the trade-off between short-term and long-term gains.MDP (2) may reach state s t+1 from state s t ∈ S under the chosen action a t ∈ A, and its probability is determined by the function P a t s t ,s t+1 = P(s t+1 | s t , a t ).The expected one-step reward from state s t to state s t+1 via action a t is as follows: where r t+1 = r t+1 (s t , a t , s t+1 ) represents the immediate return after adopting action a t at time t, and The objective of MDP ( 2) is to determine an optimal policy π .This policy can maximize the expected return E π [G t ] under policy π where For a given policy π , the value function of a state s t , denoted by v π (s t ), is the expected return of MDP (2) taking an action according to the policy π at time step t: (3) Frontiers in Computational Neuroscience frontiersin.orgTian et al.
The optimal policy is as follows: where is the set of all admissible policies.From (4), it is easy to understand v * (s t ) = v π * (s t ).Since v π (•) satisfies the Bellman equation, we have Similarly, the action-value function describes the cumulative return from state-action (s t , a t ) under policy π By substituting (3) into (6), we can obtain which represents the expected return of action a t adopted by MDP (2) at state s t , following policy π .The action-value function under optimal strategy π * is called as the optimal action-value function,i.e., q * (s t , u t ) : = q π * (s t , a t ), ∀s t ∈ S, ∀a t ∈ A. Since v * (s t ) = max a q * (s t , a), from (5), we can get Therefore, if MDP (2) exists an optimal deterministic policy, it can be expressed as follows: DQN is such a technique that combines Q leaning with arificial neural networks (ANNs), providing an effective approach to decision-making problems in dynamic and uncertain environments.It uses ANNs to construct parametric models and estimate action value functions online.Compared with Q learning, the main advantages of DQN are as follows: (1) DQN uses ANNs to approximate Q functions, overcoming the issue of limited capacity in Q tables and enabling the algorithm to handle high-dimensional state spaces.(2) DQN makes full use of empirical knowledge.
Q learning updates the value function according to the following temporal difference (TD) formula: is the TD error δ, and 0 < α ≤ 1 is a constant that determines how quickly the past experiences are forgotten.
When dealing with high-dimensional complex systems, the action-value function q(s, a), as described in Equation ( 7), is approximated by an ANN to reduce computational complexity.This can be achieved by minimizing the following loss function where the parameter θ − t is a periodic copy of the current network parameter θ t .
By differentiating Equation ( 8), we have where ▽ θ t q(s t , a t ; θ t ) represents the gradient of q(s t , a t ; θ t ) with respect to the parameter θ t .We choose the gradient descent method as the optimization strategy By substituting Equations ( 9) into (10), we obtain an update formula for parameter θ t Finally, the ǫ-greedy strategy is used for action selection.Specifically, an action is chosen randomly with probability ǫ ∈ R(0 < ǫ ≤ 1), and the best estimated action is chosen with probability 1 − ǫ.As learning progresses, ǫ gradually decreases, and the policy is shifted from exploring the action space to exploiting the learned Q values.The policy π (a | s) is as follows:

Results
We first give a definition.Definition 1: Assume that X e is an equilibrium state of PFSM (1).The PFSM is said to be feedback stabilizable to X e with probability one, if for any initial state X i ∈ X , there exists a control sequence We define an attraction domain ℜ k (X e ) for an equilibrium state X e , which is a set of states that can reach X e in k steps.
Next, we give an important result.Theorem 1: Assume that X e is an equilibrium state of PFSM (1).The PFSM is feedback stabilizable to X e with probability one, if and only if there exists an integer ρ ≤ n − 1 such that  Definition 1, for any initial state X i , there exists a control sequence ). Due to the fact that the state space is a finite set, there must be an integer ρ, such that ℜ ρ (X e ) = X holds.
(Sufficiency): Assume that Equation ( 12) holds.For any initial state X i ∈ X , we have X i ∈ ℜ ρ (X e ).From Equation ( 11), there exists a positive integer ρ and a control sequence U : = U l 1 , U l 2 , • • • , U l ρ such that X i can be driven to X e by U in ρ steps with probability one.According to Definition 1, PFSM ( 1) is feedback stabilizable to X e with probability one.
We cast the feedback control problem of PFSM ( 1) into a model-free reinforcement learning framework.The main aim is to find a state feedback controller, which can guarantee the finite time stabilization of PFSM ( 1).This means that all states can be controlled and brought to an equilibrium state within finite steps.Therefore, PFSM ( 1) is rewritten as (X , U, P, R, γ ), where P is ./fncom. .

FIGURE
Evolution of the closed-loop system ( ).

FIGURE
The number of steps required to stabilize PFSM ( ) to X .
Frontiers in Computational Neuroscience frontiersin.org where The objective of Equation ( 13) is to find an action U that maximizes the action-value function q * among all possible actions in U. Therefore, for any state X t and external condition X e , the optimal state feedback control law of PFSM ( 1) is as follows: Based on the above discussion, we are ready to introduce an algorithm to design an optimal feedback controller (see Algorithm 1).It should be noted that in this algorithm, DQN uses two ANNs.The structure diagram of DQN is shown in Figure 1.
Remark 1: This algorithm is mainly used to solve the stabilization problem of high dimensional PFSMs.For small or micro-scale PFSMs, it is slightly more complex.In this case, we can choose the STP method.Therefore, Algorithm 1 and the STP method complement each other.
According to the results calculated by Algorithm 1, a state feedback controller can be given.Specifically, from Algorithm 1, the result is an optimal policy.Assume that µ * (X i , X e ) is the calculation result.Then, we get a state feedback controller µ *
We now use Algorithm 1 to compute a state feedback controller to stabilize PFSM (14) to X 3 .The computation is performed on a computer with Intel i5-11300H processor, 2.6 GHz frequency, 16 GB RAM, and Python 3.7 software.We adopt TensorFlow in Keras to train the DQN model, where the discount factor γ is 0.99, the rang for ǫ in ǫ-greedy policy is from 0.05 to 1.0, and the sizes of memory buffer B and mini-batch M are 10,000 and 128, respectively.
Through calculation, we obtain a state feedback controller which is shown in Table 1.Model ( 14) is a PFSM with 20 states, which is not a simple system.Here, we utilize average rewards to track the performance during training (see Figure 2).It is easy to observe that as training time goes on, the performance inceases and tends to be stable.We put the state feedback controller (15), as shown in Table 1, into PFSM ( 14) and get a closed-loop system.

Conclusion
This article studied the state feedback stabilization of PFSMs using the DQN method.The feedback stabilization problem of PFSMs was first transformed into an optimization problem.A DQN was built, whose two key parts: TD target and Q function, are approximated through neural networks.Then, based on the DQN and a stabilizability condition derived in this paper, an algorithm was developed.The algorithm can be used to calculate the optimization problem mentioned above and then solves the feedback stability problem of PFSMs.Since DQN avoids the limited capacity problem of Q learning, our algortithm can handle high-dimensional complex systems.Finally, an illustrative example is provided to show the effectiveness of our method.
Notation: R expresses the set of all real numbers.Z + stands for the set of all positive integers.Z + a,b denotes the set {a, a + 1, • • • , b}, where a, b ∈ Z + , a ≤ b. |A| is the cardinality of set A.
π (a | s) is the probability of MDP (2) selecting action a at state s.arg max a∈A q(s, a) stands for the action with the highest estimated Q value for state s.

FIGURE
FIGUREPerformance of Algorithm in Example .