Temperature Control of Proton Exchange Membrane Fuel Cell Based on Machine Learning

In order to improve the proton exchange membrane fuel cell (PEMFC) working efficiency, we propose a deep-reinforcement-learning based PID controller for realizing optimal PEMFC stack temperature. For this purpose, we propose the Improved Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm, a tuner of the PID controller, which can adjust the coefficients of the controller in real time. This algorithm accelerates the learning speed of an agent by continuously changing the soft update parameters during the training process, thereby improving the training efficiency of the agent, and further reducing training costs and obtaining a robust strategy. The effectiveness of the control algorithm is verified through a simulation in which it is compared against a group of existing algorithms.


INTRODUCTION
The proton exchange membrane fuel cell (PEMFC) (Cheng and Yu, 2019a), as a high-efficiency energy conversion device, has a high hydrogen energy utilization rate, and is expected to become a widely used electric power source in the future (Cheng et al., 2018).
The PEMFC converts chemical energy into electrical energy. During this process, the remaining energy is dissipated as heat due to the limited conversion efficiency of the fuel cell (Cheng et al., 2020). In order to maintain the necessary temperature required for sustaining the reaction inside the fuel cell, two heat dissipation inputs are usually used: cooling water, and air. These inputs differ in terms of the increase in power generation with increasing heat load. If the heat is not dissipated timeously, the heat will accumulate in the stack and the temperature will become excessive, which in turn will have an adverse effect on the working performance of the stack, even endangering operational safety (Cheng and Yu, 2019b).
Low-power stacks require air-injecting cooling equipment such as cooling fans, while high-power stacks require cooling water circulation systems with larger specific heat capacity (Ai et al., 2013).
However, the inclusion of auxiliary equipment in the thermal management system complicates the water-cooled fuel cell arrangement.
Control methods for fuel cell stack temperature control systems proposed by domestic and foreign scholars in recent years include proportional integral (PI) and state feedback control (Ahn et al., 2020;Liso et al., 2014;Zhiyu et al., 2014;Cheng et al., 2015a), Model Predictive Control (MPC) (Pohjoranta et al., 2015;Chatrattanawet et al., 2017), Fuzzy control (Wang et al., 2016;Hu et al., 2010;Cheng et al., 2015a;Ou et al., 2017), and Neural Network Control (NNC) (Li et al., 2006;Li and Li, 2006). However, the inherent nonlinearity of the PEMFC system and the uncertainty of model parameters greatly limit the effectiveness of these control methods . Since these algorithms cannot adapt easily to the nonlinearity of the PEMFC environment, and in many cases possess an overly complex architecture, the scope for their application in practice is greatly restricted Sun et al., 2020;. For these reasons, the PEMFC requires a model-free algorithm that can perform parameter tracking independent of the PEMFC, which is guided by simple control principles Yang et al., 2021a;Yang et al., 2021b;Yang et al., 2021c). The Deep Deterministic Policy Gradient (DDPG) algorithm in deep reinforcement learning (Lillicrap et al., 2015) is a model-free method (Yang et al., 2018;Yang et al., 2019a;Yang et al., 2019b;. Due to its strong adaptive ability, the DDPG algorithm can adapt to the uncertainty inherent in nonlinear control systems, and it is applied in various control fields Zhao et al., 2020;Zhang et al., 2021). However, due to its low robustness, DDPG is rarely used in the PEMFC control field.
In recognition of the low robustness of the DDPG algorithm, in this paper we propose an enhancement of the DDPG algorithm which can be used for PEMFC stack temperature control. We can improve the DDPG algorithm by combining it with the PID algorithm-based deep-reinforcement-learning based PID controller in order to realize more accurate stack temperature control in the PEMFC environment. For this purpose, an improved Twin Delayed Deep Deterministic Policy Gradient algorithm operates as a tuner of the PID controller, thereby adjusting the coefficients of the controller in real time. The algorithm accelerates the learning speed of an agent by continuously changing the soft update parameters during the training process, thereby improving the training efficiency of the agent, and further reducing training costs and thus obtaining a robust strategy.
The innovations detailed in this paper are as follows: 1) A deep-reinforcement-learning based PID controller for realizing optimal stack temperature control in the PEMFC is proposed.

2) The Improved Twin Delayed Deep Deterministic Policy
Gradient (ITD3) algorithm is proposed as a tuner of the PID controller as it can adjust the coefficients of the controller in real time.

PEMFC HEAT MANAGEMENT SYSTEM Heat Management System
To maintain the operation of the fuel cell stack in a safe, stable and efficient state, it is necessary to sustain a suitable temperature range. This is the core principle of the PEMFC heat management system. We propose a heat management system model for a water-cooled PEMFC, the design parameters of which reflect the law on the conservation of energy. The principle of the heat management system in the watercooled PEMFC stack is to adjust the internal temperature of the fuel cell by controlling the temperature and flow rate of the cooling water entering and leaving the stack, thus determining the heat taken away by the cooling water. It comprises a cooling water circulating pump, radiator fan, controller and sensor. The cooling water pump drives the cooling water into circulating in the stack at a certain flow rate. When it passes through the stack, the cooling water absorbs and removes heat, and so its temperature will increase. Then, when the cooling water from the stack flows through the radiator, the radiator fan rotates in order to create a convection flow between the air and the cooling water, so that excess heat can be eliminated, and the inlet temperature of the cooling water is restored to an acceptable level. In the method proposed in this paper, the cooling water flow rate is treated as the control quantity, the stack temperature is controlled by adjusting the cooling water flow rate, and the radiator is set to run at a fixed speed to meet the heat dissipation requirements.

PEMFC Stack Temperature
According to the law of the conservation of energy, when the hydrogen in the PEMFC reacts with oxygen, all the chemical energy released is converted into electric energy and heat. Then, according to the heat balance equation Q CMΔT, excluding the effective power generation and heat from various channels, the remaining energy of the generated chemical energy will affect the internal temperature of the reactor, and so the temperature change of the reactor per unit of time is closely related to the heat generation and heat dissipation of the reactor [44]: (1)

Chemical Energy
According to Figure 1, The chemical energy converted by hydrogen per unit time is:

Electric Power
The output electric power of the PEMFC stack is:

Gas Cooling
The stack gas cooling system is designed in accordance with the laws of conservation of energy and matter. The gas and water are consumed and generated in the stack. Based on the energy difference between the intake and the exhaust, the heat caused by the exhaust can be calculated as follows:

Circulating Cooling Water for Heat Dissipation
Circulating cooling water is the main method for dissipating heat in the PEMFC stack. The circulating water pump provides pressure, which drives the cooling water through the stack at a certain flow rate, thus removing excess heat, so that the stack can operate at a safe and efficient temperature. The heat dissipation is calculated as follows: The cooling water absorbs heat when it flows through the stack, and so the outlet temperature is much higher than the inlet temperature. In order to ensure that the inlet temperature remains at 339.15K, a cooling fan supplies air flow sufficient for transferring heat from the cooling water to the air. The relationship is as follows: Q water C water W mater (T st − T in ) C air W air T fan,out − T atm (6)

Heat Radiator
Any material of a sufficient temperature will radiate heat in the form of electromagnetic radiation, and the same is true for the PEMFC stack. The heat radiated by it is related to the temperature of the stack:

INTELLIGENT CONTROL OF STACK TEMPERATURE BASED ON ITD3 ALGORITHM DDPG
The Deep Deterministic Policy Gradient (DDPG) is an improved algorithm based on Deep Q-learning (DQN), which effectively solves the problem of multi-dimensional continuous action output. In addition, similar to other model-free reinforcement learning algorithms, the DDPG algorithm is capable of black-box learning. It only needs to pay attention to the state, action, and reward value at runtime, rather than rely on a detailed mathematical model of the system. The loss function of the current value network is calculated as follows: The loss function of the real value network is calculated as follows: Among them, by using the gradient descent method to find the minimum value of the loss function J(θ p ), the maximum action value Q(S j t , A j t /θ Q ) can be determined. The target value network and target strategy network are updated in the following ways: Τ is the soft update coefficient, and thus the update speed of the neural network can be controlled by adjusting τ. In order to avoid excessive updating of the neural network, τ usually ranges between 0.01 and 0.1. The update frequency of the target value network and the target strategy network is specified by the parameter f. Therefore, every time step t reaches an integer multiple of f, the target network is updated once.

Clipped Double Q-Learning
According to Figure 2, In ITD3, the Clipped Double Q-learning method is used to calculate the target value:

Policy Delay Update
After every d times of the critic network update, the actor network is updated once to ensure that the actor network can be updated with a low Q error, so as to improve the update efficiency of the actor network.

Smooth Regularization of Target Strategy
The ITD3 algorithm introduces a regularization method for reducing the variance of the target value, and smoothes the Q value estimation by bootstrapping the estimated value of the similar state action pair: y t r(s t , a t ) + E ε Q θ′ s t+1 , π ϕ′ (s t+1 ) + ε Smooth regularization is achieved by adding a random noise to the target strategy and averaging on the mini-batch: Frontiers in Energy Research | www.frontiersin.org September 2021 | Volume 9 | Article 763099 3 Li et al.

Changeable Soft Update Coefficient
The DDPG algorithm uses a soft update method to update the target deep neural network parameters; however, this method undermines the training efficiency of the DDPG algorithm and increases the training cost. In order to overcome this problem, the soft update coefficient increases with the increase in episodes, as detailed below: In the experiment, the operating time in the working condition is 120s.
It is demonstrated in Figure 3 that when the load current changes step by step, the ITD3 controller can more effectively realize the stack temperature control and effectively control the output characteristics of the PEMFC, compared with the other algorithms. The overshoot of the output voltage is small, with quick response. The ITD3 controller has better adaptive ability and robustness, which makes it possible to obtain a faster response speed for restoring the temperature at the midpoint prior to the early stage of the disturbance, and thus obtain better stability at the later stage of the disturbance, which leads to less overshoot of the stack temperature, and no static error when the system is stable. In addition, because the proposed method can learn a large number of samples under different load conditions during offline training, it has extremely high adaptive ability and robustness, so it is able to automatically arrive at the best decision in the current state according to the collected PEMFC state. Therefore, the proposed method can smoothly control the stack temperature and obtain better control performance under variable load disturbances. By comparison, the TD3 algorithm, DDPG algorithm and NNC algorithm are less robust due to their low exploration ability and excessive reliance on samples. The other conventional algorithms in the simulation lack the capacity for adapting to the time-varying characteristics and nonlinearity of the PEMFC environment.
The ITD3 algorithm has better static and dynamic performance and is able to control the output voltage more effectively than the existing algorithms involved in the simulation.

CONCLUSION
In this paper, we have proposed a deep reinforcement learningbased PID controller for optimal stack temperature of the PEMFC. To this end, we have devised and tested what we term the ITD3 algorithm. This serves as the tuner of the PID controller by adjusting the coefficients of the controller in real time. The algorithm introduces Clipped Double Q-learning, strategy delay update, smooth and smooth regularization of target strategy, and changeable soft update coefficients in the training process, in order to speed up agent learning, thereby improving agent training efficiency, reducing training costs, and obtaining a robust strategy.
The simulation results indicate that the proposed control algorithm can achieve effective control of the temperature of the PEMFC stack. In addition, it has been compared with other RL control methods, including adaptive FOPID algorithm, adaptive PID algorithm and PID algorithm with optimized parameters, and the neural network control algorithm. In summary, the results demonstrate that the proposed control method achieves better control performance and robustness.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

FUNDING
This work was jointly supported by the National Natural Science Foundation of China (U2066212).