Bi-level energy management strategy for power-split plug-in hybrid electric vehicles: A reinforcement learning approach for prediction and control

Yang, Xueping; Jiang, Chaoyu; Zhou, Ming; Hu, Hengjie

doi:10.3389/fenrg.2023.1153390

ORIGINAL RESEARCH article

Front. Energy Res., 16 March 2023

Sec. Energy Storage

Volume 11 - 2023 | https://doi.org/10.3389/fenrg.2023.1153390

Bi-level energy management strategy for power-split plug-in hybrid electric vehicles: A reinforcement learning approach for prediction and control

Xueping Yang

Chaoyu Jiang

Ming Zhou

Hengjie Hu*

Yunnan Vocational College of Mechanical and Electrical Technology, Kunming, China

The implementation of an energy management strategy plays a key role in improving the fuel economy of plug-in hybrid electric vehicles (PHEVs). In this article, a bi-level energy management strategy with a novel speed prediction method leveraged by reinforcement learning is proposed to construct the optimization scheme for the inner energy allocation of PHEVs. First, the powertrain transmission model of the PHEV in a power-split type is analyzed in detail to obtain the energy routing and its crucial characteristics. Second, a Q-learning (QL) algorithm is applied to establish the speed predictor. Third, the double QL algorithm is introduced to train an effective controller offline that realizes the optimal power distribution. Finally, given a reference battery's state of charge (SOC), a model predictive control framework solved by the reinforcement learning agent with a novel speed predictor is proposed to build the bi-level energy management strategy. The simulation results show that the proposed method performs with a satisfying fuel economy in different driving scenarios while tracking the corresponding SOC references. Moreover, the calculation performance also implies the potential online capability of the proposed method.

1 Introduction

The contradiction between energy shortage and the booming development of the automotive industry has been increasingly prominent in recent years. Vehicle electrification that substitutes fossil fuel with cleaner electrical energy has become a critical development trend in this field (Li et al., 2017). The plug-in hybrid electric vehicle (PHEV), which is widely known as a promising solution in new energy vehicles, takes the considerations of both driving range and energy saving. PHEVs contain two power sources, generally: the electricity stored in batteries or super-capacitors (as the primary power source) and fuel (as the secondary power source). Therefore, the PHEVs can coordinate the motor and engine according to their respective energy characteristics in complex driving conditions so as to avoid low operational efficiency that may lead to unnecessary energy consumption and emission (Biswas and Emadi, 2019). However, the effective conversion between the two different power sources is usually reflected as a time-varying and non-linear optimization problem that makes it difficult to design a general energy management strategy (EMS) for PHEVs, and the precise control of PHEV powertrain has become the focus of current academic research.

Currently, the EMS of PHEVs can be divided into two types: rule-based EMSs and optimization-based EMSs (Han et al., 2020). Among these, the rule-based EMSs are generally based on the experience of engineering implementation, and a series of control rules are preset to realize the energy distribution of the power system. The charge-depleting/charge-sustaining (CD/CS) strategy is the most widely used rule-based EMS (Overington and Rajakaruna, 2015). Taking advantage of the large battery capacity, in the CD mode, the battery serves as the unique power source that drives the vehicle. When the battery's state of charge (SOC) drops to a certain threshold, the strategy switches to the CS mode and the power for battery charging and vehicle driving is provided by the engine to ensure that the SOC runs near the threshold. However, there is an obvious downside to this strategy that with increasing driving mileage, the fuel economy worsens (Singh et al., 2021). Rule-based EMSs highly rely on engineering experience and find it difficult to adapt to various operating conditions while ensuring satisfactory fuel economy.

Optimization-based EMSs can be further classified into two categories: global optimization and instantaneous optimization. The global optimal EMS is featured as knowing the global information about the working conditions in advance and then allocating the optimal energy to the power source that can be solved by common algorithms such as the dynamic programming (DP) (Peng et al., 2017), Pontryagin’s minimum principle (PMP) (Chen et al., 2014), and game theory (GT) (Cheng et al., 2020). In Lei et al. (2020), DP is first applied to perform offline global optimization for a PHEV, and by combining the K-means clustering method, a hybrid strategy considering the driving conditions is proposed, which achieves a similar fuel economy to that of the DP. In Sun et al. (2021), the formal characteristics of the bus on a fixed section of a road are fully taken into account, while the authors propose a PMP algorithm that can be applied in real time to achieve near-optimal fuel economy. However, global optimization methods have a common characteristic of being too computationally intensive for online applications (Jeong et al., 2014). Thus, they are frequently employed as evaluation criteria for other methods or for extracting optimal control rules in general. The instantaneous optimization methods, such as the equivalent consumption minimization strategy (ECMS) (Zhang et al., 2020a; Chen et al., 2022a), model predictive control (MPC) (Guo et al., 2019; Ruan et al., 2022), and reinforcement learning (RL) (Chen et al., 2018; Zhang et al., 2020b), have become common approaches in solving energy management online application problems. The MPC method can effectively deal with multivariate constraint problems with strong robustness and stability and has been widely employed in control problems that are strongly non-linear (He et al., 2021). In Quan et al. (2021), a speed prediction MPC controller was developed, and on the basis of the Markov speed predictor, an exponential smoothing rate had been hired to modify the Markov speed predictor. In Zhang et al. (2020c), Markov and back propagation (BP) neural networks were engaged for speed prediction, and an EMS combined vehicle speed prediction based on the adaptive ECM strategy (AECMS) algorithm was presented, which could improve fuel economy by 3.7% when compared to the rule-based method. In Zhou et al. (2020a), a fuzzy C-mean clustering integrating Markov co-rate prediction had been exerted to regulate the battery's SOC rate under different conditions. In Guo et al. (2021a), a real-time predictive energy management strategy was proposed, a model predictive control problem was formulated, and numerical simulations were carried out all yielding a desirable performance of the proposed PEMS in fuel consumption minimization and battery aging restriction.

With the rapid development of artificial intelligence technology, RL has attracted much attention for its strong learning ability and real-time capability in tackling high-dimensional complex problems due to its unique learning behavior (Ganesh and Xu, 2022). Chen et al. (2020) proposed a stochastic MPC controller based on Markov’s speed prediction and Q-learning (QL) algorithm, which can achieve fuel economy similar to that of the stochastic DP (SDP) strategy. In Yang et al. (2021), considering the long-term nature of direct reinforcement learning training processes, an indirect learning EMS based on a higher-order Markov chain model was proposed.

Based on the abovementioned literature review, MPC and RL have been widely applied in the energy management of PHEVs. According to the authors’ knowledge, in the design EMS using MPC method, general speed prediction methods such as the Markov (Zhou et al., 2020b), neural network (Chen et al., 2022b), or combination (Lin et al., 2021; Liu et al., 2021). However, the RL algorithm is rarely applied in designing the speed prediction controller. In addition, considering that the RL feature regulates better and avoids the random error generated by the Markov and neural network methods in predicting vehicle speed, a bi-level EMS based on RL speed prediction is proposed in this study. The RL algorithm is adopted in the upper layer controller to establish the speed predictor, and the double QL algorithm is exercised in the lower layer to perform rolling optimization. Numerical simulations are conducted to validate and evaluate the fuel economy effect of the proposed method, and the computational efficiency and applicability of the proposed method on different reference trajectories are further analyzed. The main contributions of this study are as follows: 1) the speed prediction problem is solved by the RL method and 2) an RL controller combining RL velocity prediction and RL rolling optimization is established, which provides effective support for the online application of machine learning methods on PHEVs.

The remainder of this article is assigned as follows: Section 2 constructs the EMS objective function, and the powertrain structure of the PHEV is analyzed in detail by using the mathematical model. In Section 3, the speed predictor is established by QL, and its prediction accuracy is analyzed. In Section 4, the bi-level EMS framework is built, and the double QL controller is employed to carry out the rolling optimization process. Section 5 verifies the effectiveness, applicability, and practicality of the proposed method. Section 6 provides the conclusion of this study.

2 Modeling of PHEV

A power-split PHEV is taken as the research object in this study, of which the prototype model is the Toyota Prius. The powertrain transmission configuration of the PHEV is shown in Figure 1, which consists of an engine, a lithium-ion battery pack, two electric motors, a planetary gear power distribution unit, and two electrical energy converters. Thereinto, the engine is connected to the planet gear, motor 1 is connected to the ring gear, and motor 2 is connected to the sun gear. The planetary gear power distribution unit can make the engine, motor, and wheels operate without interfering with each other so as to realize the reasonable distribution of the driving force of the whole vehicle through the power coupling relationship. The detailed vehicle structure parameters are listed in Table 1.

FIGURE 1

FIGURE 1. Powertrain transmission configuration of the PHEV.

TABLE 1

TABLE 1. Main parameters of the power-split PHEV.

In this article, the main objective is to rationalize the energy transfer relationship between the engine and battery such that the total fuel consumption of the vehicle in a driving cycle is minimized. The cost function can be expressed as

J = \min F u e l_{t o t a l} = \min \int_{0}^{T} F u e l_{r a t e} d t, (1)

F u e l_{r a t e} = f (ω_{e n g}, T_{e n g}), (2)

where $F u e l_{t o t a l}$ denotes the total fuel consumption during the whole driving cycle, $F u e l_{r a t e}$ indicates the instantaneous fuel consumption, $T$ represents the total time of the whole driving cycle, $ω_{e n g}$ stands for the engine speed, and $T_{e n g}$ means the engine torque. In order to obtain the instantaneous fuel consumption of the PHEV, the energy demand model and energy flow relationship of the PHEV are analyzed.

In the study, the lateral dynamic effects of the vehicle are ignored, and the complex vehicle model is considered to be a simple quasi-static model. For a given driving condition, according to the longitudinal dynamics model of the vehicle, the power demand of the vehicle can be deduced as

P_{d r i v e} = (F_{f} + F_{w} + F_{i} + F_{j}) v, (3)

where $P_{d r i v e}$ indicates to the power demand of the vehicle; $F_{f}$ , $F_{i}$ , $F_{w}$ , and $F_{j}$ , respectively, signify the rolling resistance, grade resistance, air resistance, and acceleration resistance; and $v$ is the vehicle speed. The four types of resistances can be formulated as

\{\begin{array}{l} F_{f} = m g f \cos α \\ F_{w} = C_{d} A v^{2} / 21.15 \\ F_{i} = m g \sin α \\ F_{j} = δ m a \end{array}, (4)

where $m$ denotes vehicle mass, $g$ means gravitational acceleration, $f$ represents the rolling resistance coefficient, $α$ indicates to the slope of travel, $C_{d}$ defines the air resistance coefficient, $A$ stands for the windward area of the vehicle, $δ$ refers to the rotating mass conversion factor of the vehicle, and $a$ is the acceleration.

In integrating the PHEV energy transfer model as shown in Figure 1, the required power of the vehicle is provided by the engine and battery, and the battery drives two electric motors to provide kinetic energy, and the demand power is presented as

\{\begin{array}{l} P_{d r i v e} = P_{f i n a l} \cdot η_{f i n a l} \\ P_{f i n a l} = (P_{e n g} + P_{e s s}) \cdot η_{g e a r} \\ P_{e s s} = (P_{m o t 1} / η_{m o t 1} + P_{m o t 2} / η_{m o t 2}) + P_{e l e c} \\ P_{e s s} = ((ω_{m o t 1} \cdot T_{m o t 1}) η_{m o t 1} + (ω_{m o t 2} \cdot T_{m o t 2}) η_{m o t 2}) + P_{e l e c} \end{array}, (5)

where $P_{f i n a l}$ , $P_{e n g}$ , $P_{e s s}$ , $P_{m o t 1}$ , $P_{m o t 2}$ , and $P_{e l e c}$ represent the power of the main gearbox, engine, battery, motor 1, motor 2, and electrical accessories, respectively. $η_{f i n a l}$ , $η_{g e a r}$ , $η_{m o t 1}$ , and $η_{m o t 2}$ mean the transmission efficiency of the main reducer, transmission unit, motor 1, and motor 2, respectively. $ω_{m o t 1}$ and $ω_{m o t 2}$ denote the speed of motor 1 and motor 2, respectively. $T_{m o t 1}$ and $T_{m o t 2}$ stand for the torque of motor 1 and motor 2, respectively. Considering that the engine and the two motors work by means of a planetary gear unit, the coupling relationship can be expressed as

\{\begin{array}{l} ω_{e n g} = (1 / 1 + μ) ω_{m o t 1} + (μ / 1 + μ) ω_{m o t 2} \\ T_{e n g} = (1 + μ) T_{m o t 1} = (1 + 1 / μ) T_{m o t 2} \end{array}, (6)

where $μ$ indicates the gear ratio of the planetary gear.

Eqs 2–6 reveal that the total fuel consumption of the vehicle throughout the driving cycle is decided by controlling the engine speed and torque. Considering that two control degrees of freedom—the speed and torque—increase the complexity of the control strategy, the engine optimal operating line (OOL), as shown in Figure 2, is engaged to constitute their mapping relationship and simplify the calculation process (Chen et al., 2015). Giving an engine power request, an optimal engine speed and, consequently, the optimal engine torque can be obtained. Thus, the fuel consumption rate of the engine at each moment can be determined through the engine fuel consumption rate map, as shown in Figure 3. The corresponding mathematical relationship can be exhibited as

ω_{e n g} = h^{*} (P_{e n g}) . (7)

FIGURE 2

FIGURE 2. Engine optimal operating line.

FIGURE 3

FIGURE 3. Engine fuel consumption rate MAP.

With the introduction of the engine OOL, it can be found from Eqs 2–7 that the instantaneous fuel consumption of the engine can be determined from the battery power, power demand, and vehicle speed, that is,

f (ω_{e n g}, T_{e n g}) = f (P_{d r i v e}, P_{e s s}, v) . (8)

In this study, a simple equivalent circuit model that includes the internal resistance and open-circuit voltage is applied to characterize the performance of the battery as

\{\begin{array}{l} I_{e s s} = O C V - \frac{\sqrt{O C V^{2} - 4 R_{int} P_{e s s}}}{2 R_{int}} \\ P_{e s s} = O C V \cdot I_{e s s} - {I_{e s s}}^{2} R_{int}, \\ S O C (t) = S O C_{i n i t} - \frac{1}{C_{e s s} \int_{0}^{t} I_{e s s} d t} \end{array} (9)

where $I_{e s s}$ is the battery current, $O C V$ denotes the open-circuit voltage, $R_{int}$ represents the internal resistance, $S O C (t)$ indicates the $S O C$ value at time step t, $S O C_{i n i t}$ means the initial $S O C$ value, and $C_{e s s}$ stands for the battery capacity. In this equivalent circuit model, the open-circuit voltage and internal resistance are determined by the instantaneous SOC value, as shown in Figure 4. It can be found that when the SOC decreases, the open-circuit voltage decreases from 220 V to 165 V and internal resistance varies from 0.09 Ω to 0.14 Ω.

FIGURE 4

FIGURE 4. Variation of $O C V$ and $R_{int}$ with the SOC.

According to the analysis mentioned above, it is found that instantaneous fuel consumption can be obtained when the battery power is determined. Therefore, in this study, battery power is applied as the control variable to obtain fuel consumption. Considering the power limitations and performance requirements of the PHEV, the following constraints also have to be made:

\{\begin{array}{l} P_{d r i v e_m i n} < P_{d r i v e} < P_{d r i v e_m a x} \\ P_{e n g_m i n} < P_{e n g} < P_{e n g_m a x} \\ P_{e s s_m i n} < P_{e s s} < P_{e s s_m a x} \\ P_{m o t 1_m i n} < P_{m o t 1} < P_{m o t 1_m a x} \\ P_{m o t 2_m i n} < P_{m o t 2} < P_{m o t 2_m a x} \\ S O C_{\min} < S O C < S O C_{\max} \end{array}, (10)

where the subscripts indicate the minimum and maximum values of the variables, respectively.

Based on the abovementioned energy flow analysis, a bi-level EMS based on RL speed prediction will be developed to determine the optimal battery power at each moment, which is described in detail in the Section 3.

3 Speed prediction based on QL

3.1 QL algorithm

As an important milestone of the RL algorithm, QL has been widely used in many fields due to its characteristics of efficient convergence and easy implementation (Watkins et al., 1992). The main idea of the QL algorithm is to form a value function $Q$ that can be directly iterated and updated by state–action pairs and update the value function $Q$ through the interaction between the agent and environment to obtain the optimal action strategy set under certain conditions. The QL algorithm can be summarized in a simple five-tuple representation $\{S, A, γ, R, π\}$ , where $S$ denotes the state variable, $A$ denotes the action variable, $R$ denotes the reward function, $γ$ denotes the discount factor of the agent in the learning process, and $π$ denotes the optimal action strategy set for the agent to interact with the environment.

In the QL algorithm, the agent is a learner and the decision maker, interacting with different states of the environment at each moment. The agent decides $a_{t}$ according to the current state $s_{t}$ . After receiving the decision, the environment enters the new state $s_{t + 1}$ and gives the corresponding reward $r_{t + 1}$ , and the agent continuously learns and improves its actions on the basis of the reward received until the maximum cumulative reward is obtained. The cumulative expected reward obtained by the agent in the learning process is known as the expectation function, and it can be described as

V = E (\sum_{t = 0}^{T} γ^{t} r_{t}) . (11)

As future actions and states are unpredictable when the agent performs the current action, the state–action pair function $Q$ is introduced to estimate the expected future payoffs those result from the actions according to some future strategy in the currently known state. This can be expressed as

Q (s, a) = r (s, a) + γ E_{π} [Q (s^{'}, a^{'})], (12)

where $s^{'}$ denotes the next state, $a^{'}$ denotes the action corresponding to the next state, and after the learning task is accomplished by the agent, the optimal state–action pair function $Q^{*} (s, a)$ is obtained as

π^{*} = \underset{a \in A}{argmax} Q^{*} (s, a) . (13)

During the learning process of the agent, the updated rule of the value function can be expressed as

Q (s, a) \leftarrow Q (s, a) + β (r + γ \max_{a} Q (s^{'}, a) - Q (s, a)), (14)

where $β$ represents the learning efficiency. The greater the learning efficiency, the faster the convergence speed, but it should not be too large or otherwise it will lead to the problem of overfitting.

In Section 3.2, QL is employed for speed prediction in preparing the groundwork for bi-level energy management later on.

3.2 Speed prediction based on QL

In this study, the QL method is employed for speed prediction in the bi-level energy management framework. In the QL-based speed predictor, the state space, action space, and reward function of the controller system have to be determined first. The driving speed of the vehicle is taken as the state variable, and the vehicle speed is discretized into $m$ intervals, which can be expressed as

S \in \{v_{s p d}^{1}, v_{s p d}^{2}, v_{s p d}^{3}, \dots, v_{s p d}^{m}\}, (15)

where $v_{s p d}$ represents the current speed state. Moreover, the acceleration can be regarded as a random variable due to the strong uncertainty in the actual driving process of the vehicle. Therefore, the acceleration is taken as the control variable, and it ranges from $- 4 m / s^{2}$ to $4 m / s^{2}$ and is discretized into $n$ intervals as

A = \{a_{a c c}^{1}, a_{a c c}^{2}, a_{a c c}^{3}, \dots, a_{a c c}^{n}\} . (16)

The instantaneous reward is set to the absolute value of the difference between the predicted vehicle speed and the actual value as

v_{d i f f} (t) = |v_{p r e} (t) - v_{r e a l} (t)|, (17)

where $v_{p r e} (t)$ and $v_{r e a l} (t)$ , respectively, denote the predicted velocity and real velocity. The specific reward value is defined as

r (t) = \{\begin{array}{l} 100 & 0 \leq v_{d i f f} \leq 0.25 \\ 75 & 0.25 < v_{d i f f} \leq 0.5 \\ 50 & 0.5 < v_{d i f f} \leq 1 \\ 25 & 1 < v_{d i f f} \leq 1.5 \\ 0 & 1.5 < v_{d i f f} \leq 2 \\ - 1000 & v_{d i f f} > 2 \end{array} . (18)

After setting the state space, action space, and reward function of the controller, five standard cycles: CLTCP, JC08, WLTC, LA92, and FTP75, as shown in Figure 5, are applied as the training cycle to train the QL speed prediction controller, and the number of iterations is set to 500. As can be seen from Figure 5, the five standard operating cycles cover a variety of speed segments, such as low speed, medium speed, high speed, and rapid acceleration/deceleration. Here, the QL is engaged to cover future speeds, and the iteration process is tabulated in Table 2. The cumulative reward of the QL velocity controller for different prediction time domains is depicted in Figure 6, where the cumulative reward gradually converges to a constant value as the number of iterations increases. Figure 7 shows the cumulative reward difference at each iteration process with different prediction time domains; similarly, the difference flattens out and gradually converges to a stable value as the number of iterations increases continuously. From this point on, we find that the QL speed controller gradually converges and stabilizes after 500 iterations of learning.

FIGURE 5

FIGURE 5. Training cycle.

TABLE 2

TABLE 2. Iterative process of the QL speed predictor.

FIGURE 6

FIGURE 6. Cumulative reward.

FIGURE 7

FIGURE 7. Cumulative reward difference for each iteration.

3.3 Contrast analysis

During the actual vehicle operation, the acceleration of the vehicle features strong uncertainty, which can be described as a discrete Markov chain model (Ganesh and Xu, 2022); therefore, the Markov chain model is applied as an ordinary method for speed prediction. To effectively validate the proposed speed prediction method, the prediction results based on the Markov chain model are compared. Note that the proposed method and the Markov chain model are trained with the same training data, and the UDDS, HWFET, NEDC, and WVUSUB cycles are each applied to test the models.

In addition, two different error functions— $E r r$ and RMSE—are calculated to evaluate the speed prediction performance as

\{\begin{array}{l} E r r (t) = \sqrt{\sum_{i = 1}^{t_{p}} {(v_{t, i}^{p r e} - v_{t, i}^{r e a l})}^{2} / t_{p}} \\ R M S E = \sum_{t = 1}^{T} E r r (t) / T \end{array}, (19)

where $E r r (t)$ denotes the RMSE value of the predicted velocity series and actual velocity series in the predicted time domain at time $t$ , $t_{p}$ represents the predicted time domain, $v_{t, i}^{p r e}$ denotes the predicted speed at the $i ‑ t h$ second after time $t$ , $v_{t, i}^{r e a l}$ represents the actual velocity at the $i ‑ t h$ second after time $t$ , and $T$ indicates the total duration of the working cycle.

The comparison results of the two velocity predictors with the prediction length varying within 3 s, 5 s, 10 s, and 15 s that are based on the UDDS cycle are shown in Figure 8, and the statistic errors are given in Table 3.

FIGURE 8

FIGURE 8. Comparison of speed prediction results: (A) Markov speed method and (B) QL speed method.

TABLE 3

TABLE 3. Comparative analysis of speed prediction results.

It is noted from these comparison results that the Markov method is a random prediction method with different speed prediction values for each step, whereas the QL velocity prediction method is a highly regular prediction method with the same velocity prediction trajectory at different moments when the velocity value is in a certain state interval range. Therefore, the advantages of the QL speed prediction method can be obtained such that if the convergence of the QL training process can be guaranteed, the interference of the prediction accuracy caused by the randomness of speed prediction can be avoided.

In addition, Table 3 shows the RMSE indexes of the two methods. Taking the Markov method as the benchmark, it can be found that the prediction accuracy improves with the increase of prediction time, especially for high-speed conditions such as HWFET. Although the prediction accuracy is very low when compared with Markov speed prediction at 3 s, the prediction accuracy also improves with the prediction time, which can likewise indicate that the QL velocity prediction method can avoid the interference of prediction accuracy caused by the randomness of the prediction. Considering the influence of prediction duration on the design of EMS, the design prediction duration of bi-level energy management in the following is 10 s.

4 Bi-level energy management strategy

The MPC is a rolling optimization control algorithm implemented online that combines predictive information and a rolling optimization mechanism for better control of performance when dealing with non-linear models. The bi-level EMS designed in this study is shown in Figure 9. The upper controller is the speed prediction model developed in the section 3.2, of which the prediction length is 10 s; and the speed prediction results are input into the lower controller. In the lower controller, a valid and convergent double QL offline controller is trained first, and the SOC trajectory calculated by the double QL offline controller is treated as the reference trajectory of the MPC. Based on the prediction model of input vehicle speed information, the first sequence of the control sequences in the prediction time domain is output after correction by feedback. The double QL offline controller is first introduced as described in Section 4.1.

FIGURE 9

FIGURE 9. Framework of the bi-level EMS.

4.1 Double QL offline controller

The double QL algorithm is an improved QL algorithm proposed by Watkins et al. (1992). The double QL differs from the QL algorithm in employing two state–action pair functions to solve the optimal action according to Eq. 14. It is known that the update of the optimal value function $Q$ depends on $\max Q (s^{'}, a)$ , and the QL method updates the expected value by taking the maximum value of $Q$ before finding the expected value, which results in an overestimation of the action value to a large extent. Therefore, the double QL avoids overestimating the value function by constructing two Q functions and also ensures the iterative efficiency of the algorithm. To a certain extent, it can be considered that the double QL therefore achieves better results than the QL method.

In the double QL algorithm of this study, the power demand $P_{r e q}$ and battery $S O C$ are set as the state variables and can be discretized by $l$ and $p$ intervals. The battery power output $P_{e s s}$ is set as an action variable and discretized into $k$ intervals, and the state variables and action variables of the algorithm are represented as

\{\begin{array}{l} S \in \{(S O C^{1}, P_{r e q}^{1}), (S O C^{1}, P_{r e q}^{2}), (S O C^{1}, P_{r e q}^{3}), \dots, (S O C^{p}, P_{r e q}^{l})\} \\ A = \{P_{e s s}^{1}, P_{e s s}^{2}, P_{e s s}^{3}, \dots, P_{e s s}^{k}\} \end{array} . (20)

The instantaneous reward function $r (t)$ is judged by the engine’s on–off and $S O C$ values, and the specific reward function value is expressed as

r (t) = \{\begin{array}{l} - F u e l_{r a t e} \cdot 10^{6} & e n g_o n = 1 \cap 0.3 \leq S O C \leq 0.9 \\ (- F u e l_{r a t e} \cdot 10^{6}) \cdot 10 & e n g_o n = 1 \cap (S O C < 0.3 | S O C > 0.9) \\ 0.5 / F u e l_{r a t e_m a x} & e n g_o n = 0 \cap 0.3 \leq S O C < 0.5 \\ 1 / F u e l_{r a t e_m a x} & e n g_o n = 0 \cap 0.5 \leq S O C < 0.7 \\ 2 / F u e l_{r a t e_m a x} & e n g_o n = 0 \cap 0.7 \leq S O C < 0.9 \\ - 2000 & e n g_o n = 0 \cap S O C < 0.3 \end{array}, (21)

where $e n g_o n = 1$ indicates that the engine is turned on, $e n g_o n = 0$ indicates that the engine is turned off. $F u e l_{r a t e_m a x}$ indicates to the maximum value of engine fuel consumption MAP as shown in Figure 3. A special explanation has to be made here: the engine on–off mode is simply an on–off threshold, that is, when the engine power is greater than a certain threshold, the engine is turned on, and when it is less than this threshold value, the engine is turned off. This can be expressed as

\{\begin{array}{l} e n g_o n = 1 P_{e n g} \geq P_{e n g_o n} \\ e n g_o n = 0 P_{e n g} < P_{e n g_o n} \end{array}, (22)

where $P_{e n g_o n}$ is the threshold value for the engine when turned on. Since the two $Q$ functions $Q_{A}$ and $Q_{B}$ are employed to evaluate the value function, the choice of the optimal action of Eq. 13 can be rewritten as

π_{o p t} : a (t) \leftarrow \{\begin{array}{l} r a n d i (A) i f ε \leq c \\ argmax (Q_{A} (s (t), :) + Q_{B} (s (t), :)) i f ε > c \end{array}, (23)

where $c$ represents a random number from 0 to 1, and $ε$ is the greed factor. In the process of updating the two $Q$ functions, there is a random number $b$ ( $b \in [0,1)$ ) that is applied to select which $Q$ function is being updated, which can be represented as

\{\begin{array}{l} u p d a t e Q_{A} i f b > 0.5 \\ u p d a t e Q_{B} e l s e i f \end{array} . (24)

Similarly, the update process for the two Q functions can be rewritten as

\{\begin{array}{l} \bar{a} = {argmax}_{a} Q_{A} (s^{'}, :) \\ Q_{A} (s, a) \leftarrow Q_{A} (s, a) + β (r + γ Q_{B} (s^{'}, \bar{a}) - Q_{A} (s, a)) \\ \bar{b} = {argmax}_{a} Q_{B} (s^{'}, :) \\ Q_{B} (s, a) \leftarrow Q_{B} (s, a) + β (r + γ Q_{A} (s^{'}, \bar{b}) - Q_{B} (s, a)) \end{array}, (25)

where $s^{'}$ is the new state obtained by executing action $a$ , $\bar{a}$ and $\bar{b}$ are the actions with maximum values of $Q_{A}$ and $Q_{B}$ at state $s^{'}$ , respectively. It can be noted from Eq. 25 that each update of the $Q$ function requires the use of a sample value for the another $Q$ function, which can also be considered an unbiased estimate of the value function update. Theoretically, this method of updating functions avoids overestimation of the function values (Chen et al., 2015). After the two $Q$ functions update the function values with each other, the optimal strategy can be expressed as

π_{o p t}^{*} = argmax (Q_{A} + Q_{B}) . (26)

The training cycle shown in Figure 5 is also employed to train the double QL controller. The error value of each iteration of the double QL controller is shown in Figure 10. It can be noted that the difference of the value function of each iteration gradually decreases and levels off when the number of iterations gradually increases; this indicates that the algorithm gradually converges, indicating that the agent trained by the double QL algorithm is effective.

FIGURE 10

FIGURE 10. Difference of the value function of each iteration.

In this study, the double QL algorithm is applied as the basis for the bi-level energy management rolling optimization process, and the design process of the bi-level EMS will be analyzed in detail later.

4.2 Controller implementation

In this study, the state transfer equation of the MPC controller can be expressed as

x (t + 1) = f (x (t), u (t), w (t)), (27)

where $x (t)$ denotes the system state variable at time $t$ , $u (t)$ denotes the control variable at time $t$ , and $w (t)$ denotes the random perturbation variable, such as the predicted speed. In the energy management optimization problem of this study, the state variable of the system is the battery, i.e., $x = S O C$ ; the system control variable is the battery power, i.e., $u = P_{e s s}$ ; and the system stochastic perturbation is the predicted vehicle speed. The prediction time domain $N_{p}$ of the MPC controller designed in this study is equal to the control time domain $N_{c}$ , both of which are 10 s. The optimized indicator function in each prediction time domain can be expressed as

J_{t} = \min \sum_{t}^{t + N_{p}} f_{f u e l} (t) + f_{s o c} (t), (28)

where $J_{t}$ is the optimization target in the prediction time domain $[t, t + N_{p}]$ , $f_{f u e l} (t)$ represents the instantaneous fuel consumption function at each moment, i.e., $f_{f u e l} (t) = F u e l_{r a t e} (t)$ , and $f_{s o c} (t)$ denotes the cost of deviation of the battery $S O C (t)$ from the reference trajectory $S O C_{r e f} (t)$ at time $t$ , which is expressed as

f_{s o c} = \{\begin{array}{l} 0 S O C (t) > S O C_{r e f} (t) \\ α {(S O C (t) - S O C_{r e f} (t))}^{2} S O C (t) < S O C_{r e f} (t) \end{array}, (29)

where $α$ denotes a positive weighting factor. The purpose of setting the cost function for the battery's $S O C$ is to ensure that the actual $S O C$ fluctuates around the $S O C$ reference trajectory.

The rolling optimization processes of the designed bi-level energy management controller are narrated as follows.

1) According to the speed and acceleration of future driving conditions, the QL speed predictor is employed to estimate the speed sequence $v_{t + 1}, v_{t + 2}, \dots v_{t + N_{p}}$ in the prediction time domain.

2) The power demand sequence $P_{d r i v e, t + 1}, P_{d r i v e, t + 2}, \dots P_{d r i v e, t + N_{p}}$ in the predicted time domain is calculated from Eqs 3, 4 and the velocity sequence $v_{t + 1}, v_{t + 2}, \dots v_{t + N_{p}}$ .

3) The reference SOC trajectory $S O C_{r e f} (t, t + N_{p})$ is used for rolling optimization in combination with the double QL controller. The rolling optimization process is shown in Table 4. In the rolling optimization process, the two $Q$ matrixes obtained from the double QL controller training are denoted as $Q_{o r i g i n a l_A}$ and $Q_{o r i g i n a l_B}$ , and the two Q matrixes involved in the rolling optimization are denoted as $Q_{r o l l_A}$ and $Q_{r o l l_B}$ . The state space and action space in the roll optimization process are consistent with the settings in the double QL controller.

4) After a feedback correction session, the first control variable in the control time domain is output to the PHEV model. It should be noted here that since the double QL controller has converged during the training of the offline controller, the optimization process of the bi-level energy management is performed by rolling the optimization with the predicted time domain only and no more multiple iterations are performed.

TABLE 4

TABLE 4. Rolling optimization process.

The process of MPC rolling optimization is shown in Figure 11. For each step of the rolling optimization, the inputs are the speed prediction sequence, SOC reference value, and demand power sequence in the predicted time domain. The two $Q$ matrixes $Q_{o r i g i n a l_A}$ and $Q_{o r i g i n a l_B}$ are obtained from these three inputs, which are involved in the rolling optimization process, and the two matrixes are employed to perform optimization in the predicted time domain based on Eq. 25.

FIGURE 11

FIGURE 11. Principle of MPC rolling optimization.

5 Results and discussion

In this section, four standard driving conditions are applied as the test data set, i.e., WVUSUB, NEDC, and two real-world driving cycles (KM1 and KM2). These cycles are combined into three different sets, as shown in Figure 12; the first cycle is composed of three WVUSUB and three NEDC, referred to as Cycle 1; the second cycle is composed of five KM1, and the third cycle is composed of 12 KM2. There are two reasons for setting two actual working conditions KM1 and KM2: the first is to verify the effectiveness of the proposed strategy under training conditions of different time lengths, while on the other hand, it is to ensure that different energy management strategies can reduce the SOC to the lowest threshold. The performance of the proposed method is evaluated from the following three perspectives: first, the effectiveness of the proposed method is compared with the double QL offline controller, QL, CD/CS, and SDP. Second, considering that the SOC trajectory of double QL is utilized as the reference trajectory for the design of the proposed method, three different SOC reference trajectories are utilized for expanding the application scope of the proposed method to show the applicability of the proposed method in different SOC reference trajectories. Finally, the computational efficiency of the proposed method is analyzed to verify its practicality.

FIGURE 12

FIGURE 12. Test cycles: (A) Cycle 1; (B) 5 KM1; (C) 12 KM2.

5.1 Comparison with different methods

Table 5 lists the comparison results of fuel consumption for different EMSs under three cycle sets with SOC correction. It can be noted that the fuel-saving effect of the double QL method is better than that of the QL method because the double QL avoids the overestimation of the values caused by a single Q matrix in the QL method. Moreover, since the SOC curve of the double QL method is applied as the reference trajectory, fuel consumption of the proposed method is also closer to the double QL method. Furthermore, when compared to the SDP method, the fuel consumption of the proposed method is 2.73%, 3.38%, and 1.57% higher under the three different driving cycles, respectively, and is approximately closer to that of the SDP method. In addition, the fuel consumption of the proposed method is only 0.32% when compared to that of the QL method under a 12-KM2 driving cycle.

TABLE 5

TABLE 5. Comparison of the fuel consumption result of different methods.

The SOC curves of different EMSs are shown in Figure 13. As the penalty function of adding the SOC in the rolling optimization process, it can be noted that the SOC trajectory of the proposed strategy can effectively track the reference trajectory and fluctuate around the reference trajectory. To further verify the effectiveness of the proposed method, Figure 14 depicts the engine efficiency of the SDP, double QL, and the proposed method under different verification cycles, from which it can be seen that these control methods make the engine work in a more efficient region. Moreover, taking 12 KM2 as an example, the engine operating point of the proposed method is the closest one to the SDP method and therefore fuel consumption of the proposed strategy is also the closest method to that of the SDP method. In summary, different methods are utilized to compare fuel consumption, and the comparison results demonstrate that the proposed method features the effectiveness in fuel saving from different perspectives.

FIGURE 13

FIGURE 13. SOC curves of different methods: (A) Cycle 1; (B) 5 KM1; and (C) 12 KM2

FIGURE 14

FIGURE 14. Engine efficiency of different methods under different driving cycles: (A) Cycle 1; (B) 5 KM1; and (C) 12 KM2.

5.2 Tracking effect of different reference trajectories

The rolling optimization process and reference trajectory of the proposed method are based on the double QL offline controller. To further validate the learning effect of the proposed method, three different SOC reference trajectories, that include SDP, QL, and liner distance, are employed to validate the extension of the proposed method. Among them, the linear distance reference trajectory is given as

S O C_{d i s} (t) = S O C_{i n i t} - \frac{D i s_{d r i v e}}{D i s_{a l l}} (S O C_{i n i t} - S O C_{l o w}), (30)

where $S O C_{d i s} (t)$ denotes the linear distance for the reference SOC at time step $t$ . $S O C_{i n i t}$ indicates the initial SOC value, which is set as 0.9. $S O C_{l o w}$ means the final SOC value at the end of the driving cycle, which is set as 0.3. $D i s_{a l l}$ represents the distance of the entire driving cycle. $D i s_{d r i v e}$ stands for the distance that has been traveled.

Figure 15 shows the SOC curves of the proposed method for these three different SOC trajectories under 5 KM1 and 12 KM2 driving cycles. It can be noted that the SOC curves obtained by the proposed method are consistent with the decreasing trend under different working cycles, and all of them can be well tracked. Similarly, according to the enlarged figure of Figure 13, it can also be seen that the SOC curve of the proposed method basically floats above and below the reference trajectory due to the setting of the penalty function for the SOC during the rolling optimization. From this point of view, the proposed method can track the reference trajectory effectively.

FIGURE 15

FIGURE 15. SOC curves under different reference trajectories: (A) 5 KM1. (B) 12 KM2.

Table 6 shows the fuel consumption results of the proposed method under three different reference trajectories. Taking the CD/CS fuel consumption as the benchmark, the proposed method yields high fuel economy for all three different reference trajectories. In the 5-KM1 driving cycle, fuel saving of the proposed method under the three reference tracks are 5.58%, 4.22%, and 3.89%, respectively. In the 12-KM2 driving cycle, fuel saving of the proposed method under the three reference tracks are 6.64%, 5.06%, 4.33%, respectively. Comparing the fuel consumption of the three different reference trajectories, it can be observed that the SDP reference trajectory shows the best fuel saving performance, and the worst is the linear distance reference trajectory. The reason for this phenomenon can be attributed to the SDP strategy, which is a global suboptimal method, while the QL is a local optimal method, therefore the reference trajectories obtained by these two methods possess the global suboptimal or local optimal characteristics, while the linear distance reference trajectory does not feature the optimization characteristics, and the fuel economy is the lowest among these trajectories.

TABLE 6

TABLE 6. Fuel consumption results under three different reference trajectories.

5.3 Computational efficiency analysis

In this study, the calculation time for a single step of the proposed method is evaluated on a laptop computer, which is equipped with the Intel Core i7 @2.3 GHz processor and 16 GB RAM. Note that the computational time does not include the training time for the speed predictor and double QL offline controller but only includes the time for speed prediction and computation time for the MPC controller. The calculation time is tabulated in Table 7, and it can be noted that the calculation time of each step ranges from 8.36 to 10.07 ms, which indicates that the proposed method has the potential for online implementation.

TABLE 7

TABLE 7. Computational efficiency.

6 Conclusion

This study proposes a bi-level EMS to solve the energy management problem of PHEVs. First, considering the uncertainty of acceleration in the process of driving, the acceleration is taken as the action, and a QL-based speed predictor is constructed by the reinforcement learning algorithm. Second, considering different speed intervals during vehicle driving, the double QL method is utilized to establish an offline controller and its fuel economy is verified. Then, the QL speed predictor and double QL offline controller are integrated into the MPC, in which the double QL method performs the rolling optimization to construct a bi-level energy management controller. The effectiveness, applicability, and practicality of the proposed method are verified by standard and measured driving cycles. The results show that the proposed method is capable of exerting high fuel economy control for the PHEVs with favorable tracking performance for the different reference trajectories, and the calculation efficiency of the proposed method shows the potential capacity for real-time applications.

Our future work will focus on considering the impact of traffic information on vehicle fuel economy and the study of fuel economy of intelligently connected vehicles with intelligent traffic information. In addition, the proposed method should be further optimized by hardware-in-the-loop and real vehicle experiments.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material; further inquiries can be directed to the corresponding author.

Author contributions

XY: supervision, investigation, discussion, and writing. CJ: methodology. MZ: methodology. HH: writing, discussion, and editing.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, editors, and reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Biswas, A., and Emadi, A. (2019). Energy management systems for electrified powertrains: State-of-the-Art review and future trends. IEEE Trans. Veh. Technol. 68 (7), 6453–6467. doi:10.1109/tvt.2019.2914457

CrossRef Full Text | Google Scholar

Chen, Z., Gu, H., Shen, S., and Shen, J. (2022). Energy management strategy for power-split plug-in hybrid electric vehicle based on MPC and double Q-learning. Energy 245, 123182. doi:10.1016/j.energy.2022.123182

CrossRef Full Text | Google Scholar

Chen, Z., Hu, H., Wu, Y., Xiao, R., Shen, J., and Liu, Y. (2018). Energy management for a power-split plug-in hybrid electric vehicle based on reinforcement learning. Appl. Sciences-Basel 8 (12), 2494. doi:10.3390/app8122494

CrossRef Full Text | Google Scholar

Chen, Z., Hu, H., Wu, Y., Zhang, Y., Li, G., and Liu, Y. (2020). Stochastic model predictive control for energy management of power-split plug-in hybrid electric vehicles based on reinforcement learning. Energy 211, 118931. doi:10.1016/j.energy.2020.118931

CrossRef Full Text | Google Scholar

Chen, Z., Liu, Y., Zhang, Y., Lei, Z., Chen, Z., and Li, G. (2022). A neural network-based ECMS for optimized energy management of plug-in hybrid electric vehicles. Energy 243, 122727. doi:10.1016/j.energy.2021.122727

CrossRef Full Text | Google Scholar

Chen, Z., Mi, C. C., Xia, B., and You, C. (2014). Energy management of power-split plug-in hybrid electric vehicles based on simulated annealing and Pontryagin's minimum principle. J. Power Sources 272, 160–168. doi:10.1016/j.jpowsour.2014.08.057

CrossRef Full Text | Google Scholar

Chen, Z., Xia, B., You, C., and Mi, C. C. (2015). A novel energy management method for series plug-in hybrid electric vehicles. Appl. Energy 145, 172–179. doi:10.1016/j.apenergy.2015.02.004

CrossRef Full Text | Google Scholar

Cheng, S., Chen, X., Fang, S. n., Wang, X. y., Wu, X. h., et al. (2020). Longitudinal autonomous driving based on game theory for intelligent hybrid electric vehicles with connectivity. Appl. Energy 268, 115030. doi:10.1016/j.apenergy.2020.115030

CrossRef Full Text | Google Scholar

Ganesh, A. H., and Xu, B. (2022). A review of reinforcement learning based energy management systems for electrified powertrains: Progress, challenge, and potential solution. Renew. Sustain. Energy Rev. 154, 111833. doi:10.1016/j.rser.2021.111833

CrossRef Full Text | Google Scholar

Guo, J., He, H., Peng, J., and Zhou, N. (2019). A novel MPC-based adaptive energy management strategy in plug-in hybrid electric vehicles. Energy 175, 378–392. doi:10.1016/j.energy.2019.03.083

CrossRef Full Text | Google Scholar

Guo, N., Zhang, X., Yuan, Z., Guo, L., and Du, G. (2021). Real-time predictive energy management of plug-in hybrid electric vehicles for coordination of fuel economy and battery degradation. Energy 214, 119070. doi:10.1016/j.energy.2020.119070

CrossRef Full Text | Google Scholar

Guo, N., Zhang, X., Zou, Y., Du, G., Wang, C., and Guo, L. (2021). Predictive energy management of plug-in hybrid electric vehicles by real-time optimization and data-driven calibration. IEEE Trans. Veh. Technol. 71, 5677–5691. doi:10.1109/tvt.2021.3138440

CrossRef Full Text | Google Scholar

Guo, N., Zhang, X., and Zou, Y. (2022). Real-time predictive control of path following to stabilize autonomous electric vehicles under extreme drive conditions. Automot. Innov. 5, 453–470. doi:10.1007/s42154-022-00202-3

CrossRef Full Text | Google Scholar

Han, L., Jiao, X., and Zhang, Z. (2020). Recurrent neural network-based adaptive energy management control strategy of plug-in hybrid electric vehicles considering battery aging. Energies 13 (1), 202. doi:10.3390/en13010202

CrossRef Full Text | Google Scholar

Hasselt, H. V., Guez, A., and Silver, D. J. C. (2015). “Deep reinforcement learning with double Q-learning,”. arXiv:1509.06461.

Google Scholar

He, H., Wang, Y., Han, R., Han, M., Bai, Y., and Liu, Q. (2021). An improved MPC-based energy management strategy for hybrid vehicles using V2V and V2I communications. Energy 225, 120273. doi:10.1016/j.energy.2021.120273

CrossRef Full Text | Google Scholar

Jeong, J., Lee, D., Kim, N., Zheng, C., Park, Y.-I., and Cha, S. W. (2014). Development of PMP-based power management strategy for a parallel hybrid electric bus. Int. J. Precis. Eng. Manuf. 15 (2), 345–353. doi:10.1007/s12541-014-0344-7

CrossRef Full Text | Google Scholar

Lei, Z., Qin, D., Zhao, P., Li, J., Liu, Y., and Chen, Z. (2020). A real-time blended energy management strategy of plug-in hybrid electric vehicles considering driving conditions. J. Clean. Prod. 252, 119735. doi:10.1016/j.jclepro.2019.119735

CrossRef Full Text | Google Scholar

Li, S. E., Guo, Q., Xin, L., Cheng, B., and Li, K. (2017). Fuel-saving servo-loop control for an adaptive cruise control system of road vehicles with step-gear transmission. IEEE Trans. Veh. Technol. 66 (3), 2033–2043. doi:10.1109/tvt.2016.2574740

CrossRef Full Text | Google Scholar

Lin, X., Wu, J., and Wei, Y. (2021). An ensemble learning velocity prediction-based energy management strategy for a plug-in hybrid electric vehicle considering driving pattern adaptive reference SOC. Energy 234, 121308. doi:10.1016/j.energy.2021.121308

CrossRef Full Text | Google Scholar

Liu, Y., Li, J., Gao, J., Lei, Z., Zhang, Y., and Chen, Z. (2021). Prediction of vehicle driving conditions with incorporation of stochastic forecasting and machine learning and a case study in energy management of plug-in hybrid electric vehicles. Mech. Syst. Signal Process. 158, 107765. doi:10.1016/j.ymssp.2021.107765

CrossRef Full Text | Google Scholar

Overington, S., and Rajakaruna, S. (2015). High-efficiency control of internal combustion engines in blended charge depletion/charge sustenance strategies for plug-in hybrid electric vehicles. IEEE Trans. Veh. Technol. 64 (1), 48–61. doi:10.1109/tvt.2014.2321454

CrossRef Full Text | Google Scholar

Peng, J., He, H., and Xiong, R. (2017). Rule based energy management strategy for a series-parallel plug-in hybrid electric bus optimized by dynamic programming. Appl. Energy 185, 1633–1643. doi:10.1016/j.apenergy.2015.12.031

CrossRef Full Text | Google Scholar

Quan, S., Wang, Y.-X., Xiao, X., He, H., and Sun, F. (2021). Real-time energy management for fuel cell electric vehicle using speed prediction-based model predictive control considering performance degradation. Appl. Energy 304, 117845. doi:10.1016/j.apenergy.2021.117845

CrossRef Full Text | Google Scholar

Ruan, S., Ma, Y., Yang, N., Xiang, C., and Li, X. (2022). Real-time energy-saving control for HEVs in car-following scenario with a double explicit MPC approach. Energy 247, 123265. doi:10.1016/j.energy.2022.123265

CrossRef Full Text | Google Scholar

Singh, K. V., Bansal, H. O., and Singh, D. (2021). Fuzzy logic and Elman neural network tuned energy management strategies for a power-split HEVs. Energy 225, 120152. doi:10.1016/j.energy.2021.120152

CrossRef Full Text | Google Scholar

Sun, X., Zhou, Y., Huang, L., and Lian, J. (2021). A real-time PMP energy management strategy for fuel cell hybrid buses based on driving segment feature recognition. Int. J. Hydrogen Energy 46 (80), 39983–40000. doi:10.1016/j.ijhydene.2021.09.204

CrossRef Full Text | Google Scholar

Tang, X., Chen, J., Pu, H., Liu, T., and Khajepour, A. (2022). Double deep reinforcement learning-based energy management for a parallel hybrid electric vehicle with engine start–stop strategy. IEEE Trans. Transp. Electrification 8 (1), 1376–1388. doi:10.1109/tte.2021.3101470

CrossRef Full Text | Google Scholar

Watkins, C., Christopher, J., and Dayan, P. J. M. L. (1992). Technical note: Q-learning. Mach. Learn. 8, 279–292. doi:10.1023/a:1022676722315

CrossRef Full Text | Google Scholar

Wu, Y., Zhang, Y., Li, G., Shen, J., Chen, Z., and Liu, Y. (2020). A predictive energy management strategy for multi-mode plug-in hybrid electric vehicles based on multi neural networks. Energy 208, 118366. doi:10.1016/j.energy.2020.118366

CrossRef Full Text | Google Scholar

Yang, N., Han, L., Xiang, C., Liu, H., and Li, X. (2021). An indirect reinforcement learning based real-time energy management strategy via high-order Markov chain model for a hybrid electric vehicle. Energy 236, 121337. doi:10.1016/j.energy.2021.121337

CrossRef Full Text | Google Scholar

Zhang, L., Liu, W., and Qi, B. (2020). Energy optimization of multi-mode coupling drive plug-in hybrid electric vehicles based on speed prediction. Energy 206, 118126. doi:10.1016/j.energy.2020.118126

CrossRef Full Text | Google Scholar

Zhang, W., Wang, J., Liu, Y., Gao, G., Liang, S., and Ma, H. (2020). Reinforcement learning-based intelligent energy management architecture for hybrid construction machinery. Appl. Energy 275, 115401. doi:10.1016/j.apenergy.2020.115401

CrossRef Full Text | Google Scholar

Zhang, Y., Chu, L., Fu, Z., Xu, N., Guo, C., Zhao, D., et al. (2020). Energy management strategy for plug-in hybrid electric vehicle integrated with vehicle-environment cooperation control. Energy 197, 117192. doi:10.1016/j.energy.2020.117192

CrossRef Full Text | Google Scholar

Zhou, Y., Li, H., Ravey, A., and Pera, M.-C. (2020). An integrated predictive energy management for light-duty range-extended plug-in fuel cell electric vehicle. J. Power Sources 451, 227780. doi:10.1016/j.jpowsour.2020.227780

CrossRef Full Text | Google Scholar

Zhou, Y., Ravey, A., and Pera, M.-C. (2020). Multi-objective energy management for fuel cell electric vehicles using online-learning enhanced Markov speed predictor. Energy Convers. Manag. 213, 112821. doi:10.1016/j.enconman.2020.112821

CrossRef Full Text | Google Scholar

Nomenclature

PHEV plug-in hybrid electric vehicle

EMS energy management strategy

CD/CS charge depleting/charge sustaining

SOC state of charge

DP dynamic programming

PMP Pontryagin’s minimum principle

GT game theory

ECMS equivalent consumption minimization strategy

MPC model predictive control

RL reinforcement learning

BP back propagation

QL Q-learning

SDP stochastic DP

OOL engine optimal operating line

RMSE root-mean-square error

$F u e l_{t o t a l}$ total fuel consumption

$F u e l_{r a t e}$ instantaneous fuel consumption

$T$ total time of the whole driving cycle

$ω_{e n g}$ engine speed

$T_{e n g}$ engine torque

$P_{d r i v e}$ power demand of the vehicle

$F_{f}$ rolling resistance

$F_{i}$ grade resistance

$F_{w}$ air resistance

$F_{j}$ acceleration resistance

$v$ vehicle speed

$m$ vehicle mass

$g$ gravitational acceleration

$f$ rolling resistance coefficient

$α$ slope of travel

$C_{d}$ air resistance coefficient,

$A_{w i n d}$

windward area of the vehicle

$δ$ mass conversion factor of the vehicle

$P_{f i n a l}$ power of the main gearbox

$P_{e n g}$ power of the engine

$P_{e s s}$ power of the battery

$P_{m o t 1}$ power of motor 1

$P_{m o t 2}$ power of motor 2

$P_{e l e c}$ power of the electrical accessories

$η_{f i n a l}$ transmission efficiency of the main reducer

$η_{g e a r}$ transmission efficiency of the transmission unit

$η_{m o t 1}$ transmission efficiency of motor 1

$η_{m o t 2}$ transmission efficiency of motor 2

$ω_{m o t 1}$ speed of motor 1

$ω_{m o t 2}$ speed of motor 2

$T_{m o t 1}$ torque of motor 1

$T_{m o t 2}$ torque of motor 2

$μ$ gear ratio of the planetary gear

$I_{e s s}$ battery current

$O C V$ open-circuit voltage

$R_{int}$ internal resistance

$S O C (t)$ $S O C$ value at time step t

$S O C_{i n i t}$ initial $S O C$ value

$C_{e s s}$ battery capacity

$S$ state variable in QL

$A$ action variable in QL

$R$ reward function in QL

$a^{'}$ action corresponding to the next state

$β$ learning efficiency

$v_{s p d}$ current speed state

$v_{p r e} (t)$ predicted velocity

$v_{r e a l} (t)$ real velocity

$v_{d i f f} (t)$ difference between the predicted vehicle speed and actual value

$E r r (t)$ RMSE value of the predicted velocity series and actual velocity series

$t_{p}$ predicted time domain

$v_{t, i}^{p r e}$ predicted speed at the $i ‑ t h$ second after time $t$

$v_{t, i}^{r e a l}$ actual velocity at the $i ‑ t h$ second after time $t$

$e n g_o n = 1$ engine is turned on

$e n g_o n = 0$ engine is turned off

$F u e l_{r a t e_m a x}$ the maximum value of engine fuel consumption MAP

$P_{e n g_o n}$ threshold value for engine turned on

$c$ a random number from 0 to 1

$ε$ greed factor

$x (t)$ system state variable at time $t$

$u (t)$ control variable at time $t$

$w (t)$ random perturbation variable

$N_{p}$ prediction time domain

$N_{c}$ control time domain

$J_{t}$ optimization target in the prediction time domain $[t, t + N_{p}]$

$S O C_{r e f} (t)$ SOC reference trajectory

$f_{f u e l} (t)$ instantaneous fuel consumption function at each moment

$f_{s o c} (t)$ cost of deviation of the battery’s SOC from the reference trajectory at time $t$

$α$ a positive weighting factor

$S O C_{d i s} (t)$ linear distance for the reference SOC at time step $t$

$S O C_{l o w}$ final SOC value at the end of the driving cycle

Keywords: plug-in hybrid electric vehicle, reinforcement learning, speed prediction, bi-level energy management strategy, model predictive control (MPC)

Citation: Yang X, Jiang C, Zhou M and Hu H (2023) Bi-level energy management strategy for power-split plug-in hybrid electric vehicles: A reinforcement learning approach for prediction and control. Front. Energy Res. 11:1153390. doi: 10.3389/fenrg.2023.1153390

Received: 29 January 2023; Accepted: 27 February 2023;
Published: 16 March 2023.

Edited by:

Jiangwei Shen, Kunming University of Science and Technology, China

Reviewed by:

Ningyuan Guo, Beijing Institute of Technology, China
Zhongwei Deng, University of Electronic Science and Technology of China, China

Copyright © 2023 Yang, Jiang, Zhou and Hu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Hengjie Hu, aHVoZW5namllMTk5NUAxNjMuY29t

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.