- 1Electric Power Research Institute, China Southern Power Grid, Guangzhou, China
- 2Guangzhou Power Supply Bureau of Guangdong Power Grid Co., Ltd., Guangzhou, China
In modern power systems, Modular Multilevel Converter (MMC) plays an important role due to its advantages of convenient maintenance and easy expansion. However, because of its structural defects, it is easy to induce circulating current, which poses a challenge to stable operation. It is of great significance to develop efficient and reliable circulating current suppression technology. This paper introduces a Deep Reinforcement Learning (DRL) method for adaptive tuning of controller parameters, addressing the challenge of difficult parameter adjustment in the MMC circulating current suppression strategy employing a quasi-PR controller. It analyzes the feasibility of using the twin delayed deep deterministic (TD3) algorithm to tune the parameters of the PR controller, and designs a reasonable neural network and reward function to train the agent for control. Simulation results demonstrate the superiority of the TD3-based adaptive quasi-PR controller over the traditional fixed-parameter quasi-PR controller. The adaptive controller has a better effect on MMC circulating current suppression, with better dynamic response and smaller THD. This provides an effective solution for promoting the large-scale application of MMCs and enhancing the performance of power systems.
1 Introduction
The Modular Multilevel Converter (MMC) has become a preferred choice in medium-to high-voltage modern power systems, owing to its distinctive advantages such as scalability, low switching losses, and ease of maintenance (Aslam and Raza, 2025; Sánchez-Sánchez et al., 2020) Relevant studies (Steckler et al., 2022; Zhang et al., 2025; Nougain et al., 2021; Farias et al., 2021; Vipin and Mohan, 2025) demonstrate that in high-voltage direct current (HVDC) transmission scenarios, MMCs effectively enhance transmission efficiency, reduce line energy losses, and ensure stable and efficient power delivery. Regarding power quality management, MMCs can precisely compensate harmonic components and rationally regulate reactive power, thereby significantly improving grid power supply quality. Leveraging these outstanding features, MMCs have become indispensable core power conversion equipment in emerging power engineering applications such as renewable energy grid integration and flexible AC transmission systems.
However, MMCs inherently suffer from structural limitations during practical operation. Since the three-phase arms share a common DC link and energy storage elements are distributed within individual submodules, capacitor voltage imbalance inevitably occurs across the arms during steady-state operation. This imbalance induces circulating currents within the converter (Dinkel et al., 2022). As elaborated in reference (Luo et al., 2023), while these circulating currents do not directly affect the output current, they distort the arm currents, which substantially raises the converter’s power losses and jeopardizes the stable operation of MMCs. The increased power losses not only reduce energy utilization efficiency but also lead to device overheating, shortened equipment lifespan, and potentially even system failures. Consequently, developing efficient and reliable techniques for suppressing inter-phase circulating currents in MMCs has become a critical research focus to enable their large-scale deployment. This advancement holds considerable practical value, as it enhances the stability, reliability, and economic performance of power systems.
To address the issue of circulating current suppression in MMCs, the academic community has developed several methodological approaches (Li and Zhu, 2025). Traditional PI control was an early common solution. A plug-in repetitive control scheme is presented in Reference (He et al., 2015), which features the high dynamic characteristics of a PI controller and the exceptional steady-state harmonic suppression of a repetitive controller, thereby minimizing their mutual interference. However, it needs to be matched with a PI controller with appropriate parameters to achieve the best effect.
To improve the adaptability of the PI controller, fuzzy adaptive technology has been introduced. Reference (Chao et al., 2023) proposes a fuzzy adaptive PI circulating current suppression controller, which continuously adapts the PI parameters in real-time through fuzzy logic, using both the system state error and its derivative as inputs. Simulation results indicate a marked improvement compared to conventional PI control. Reference (Li et al., 2019) puts forward a hybrid particle swarm optimization (HPSO) algorithm, which merges the strengths of both particle swarm optimization and genetic algorithm, dynamically adjusts the inertia weight and introduces a disturbance mechanism to avoid local optimization. It is used to optimize the PI parameters of the circulating current suppressor. Simulation results demonstrate that the proposed approach can reduce the circulating current amplitude by 83.33% and markedly enhances the converter’s dynamic response. Reference (Fang et al., 2023) presents a hybrid linear predictive control framework that integrates Model Predictive Control (MPC) with a PIR controller. This design eliminates the need for coordinate transformation and phase decoupling, enables direct and static-error-free control of AC components, and achieves linear circulating current control. However, MPC has a large amount of calculation and a certain degree of control complexity. Reference (Zuo et al., 2020) proposes a method for the adaptive adjustment of quasi-PR controller parameters, which realizes circulating current compensation by combining with the proportional negative feedback of the arm current. Only 3 quasi-PR controllers are needed to suppress the three-phase circulating current, simplifying the system design. An optimization strategy for the proportional negative feedback link, integrating the traditional PR controller with arm current, is proposed in Reference (Shi et al., 2021). Although it enhances system stability, it does so at the cost of increased system complexity, which is harmful to parameter tuning. Reference (Wang et al., 2018) proposes an adaptive quasi-PR circulating current suppression method with feedforward compensation, analyzes in detail the influence of the resonance coefficient in the controller on the system, gives the adjustment rules of the resonance coefficient, and shows that the scheme works well for suppressing circulating current. This method only needs one resonance controller to suppress all even-order circulating currents simultaneously and can adapt to different load sizes.
With the advancement of artificial intelligence (AI) technology, reinforcement learning has shown significant potential for control applications and has already been applied in fields such as power electronics and power systems (Shi et al., 2021; Chen et al., 2024; Jiang et al., 2021; Cao et al., 2020). The application of reinforcement learning in control systems is mainly reflected in two aspects. One is the direct control of the system. For example, Ye applied reinforcement learning methods to Buck circuits and SIMO circuits, verifying the feasibility of using DDPG and TD3 algorithms for direct control of power electronic converters (Gheisarnejad et al., 2021; Ye, et al., 2024a; Ye, et al., 2024b). On the other hand, reinforcement learning is also used for parameter tuning and optimization. To improve the speed tracking accuracy of brushless DC motors, Reference (Lu et al., 2021) combines the Deep Deterministic Policy Gradient (DDPG) algorithm with PID control. The reinforcement learning algorithm compensates for the controller’s proportional, integral, and derivative components. Reference (Park et al., 2022) introduces a dynamic PI gain self-tuning method using DRL. It employs the DQN algorithm to train agents in a simulation environment, producing a reference gain table. Vehicle tests show that this method reduces the root mean square error by nearly 46.8% compared to traditional fuzzy PI control Reference (Kumar and Detroja, 2022) designs a parameterized adaptive controller based on the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm. By dynamically adjusting the action space and reward function, it achieves faster convergence and better performance in unstable system control. These studies have verified the advantages of reinforcement learning in adaptive optimization of controller parameters, providing new ideas for parameter optimization of MMC circulating current suppression.
Synthesizing existing research, traditional control methods exhibit limitations in parameter adaptability and multi-harmonic suppression. In contrast, reinforcement learning possesses capabilities for dynamic decision-making and adaptive optimization. Combining RL with PR controllers holds promise for realizing real-time optimization of circulating current suppression parameters. Building on this concept, a reinforcement learning-based PR controller can be proposed: using the circulating current error and its rate of change as state inputs, the PR controller parameter adjustments as action outputs, and designing an objective function incorporating error penalties and convergence rewards. A DRL algorithm is then employed to train an agent to autonomously learn parameter adjustment strategies. This framework aims to achieve precise suppression of multi-frequency circulating currents in MMCs, enhancing system stability and robustness under complex operating conditions. Ultimately, the performance of the proposed control strategy was confirmed through Matlab/Simulink simulations.
2 Mathematical models and problem description
2.1 MMC topology and circulating current mechanism
The typical topological structure of MMC is shown in Figure 1. As shown, the MMC has six arms, where each one consists of N identical sub-modules (SMs) in series with an inductor Lm and a resistor Rm. The upper and lower arms of the same phase form a phase unit. Each sub-module is a half H-bridge composed of two IGBTs with anti-parallel diodes connected in series and then paralleled with a DC energy storage capacitor. The output voltages of the upper and lower arm sub-modules are upj and unj respectively, and the system-side voltages and currents are usj and isj (j = a, b, c) respectively.
In the theoretical operating state, the switching of the upper and lower arm sub-modules of each phase of the modular multilevel converter presents a complementary law: if each arm contains N sub-modules, the number of sub-modules activated in the upper (m) and lower (n) arms at any time must satisfy the constraint m + n = N. At this time, the capacitor voltage of each sub-module should be stably maintained at the rated value of Udc/N.
However, in practical engineering scenarios, the frequent switching operations of the submodules inevitably induce capacitor voltage fluctuations. These voltage fluctuations prevent the output voltages of all three-phase units from maintaining consistent levels, consequently leading to circulating currents. Given the structural symmetry of the MMC’s three-phase arms, phase A can be selected for detailed analysis.
The system analysis is simplified using the single-phase MMC equivalent circuit shown in Figure 2. Based on Kirchhoff’s Current Law (KCL), the mathematical model of the single-phase MMC can be derived:
Where, Lm, Rm, Ls, Rs are the arm inductance, arm resistance, AC-side inductance, and AC-side resistance, respectively.
Based on Equation 1, by eliminating the AC grid voltage usj and the DC-link voltage Udc, we obtain the following Equation 2:
The expressions for the differential-mode voltage udiffj, common-mode voltage ucomj, and the circulating current icirj flowing through the arms are given by:
By combining Equation 1 and Equation 3, further simplification can be obtained as follows:
Deriving and analyzing Equations 4, 5, yields the mathematical models of the system’s AC equivalent current and DC equivalent current, as shown in Figure 3.
Figure 3. Small signal equivalent model. (a) AC small signal equivalent model (b) DC small signal equivalent model.
As shown in Figure 3, L = Ls + Lm/2 represents the equivalent inductance of the system, and R = Rs + Rm/2 represents the equivalent resistance of the system. It can be known from Equation 3 that the control of AC current ij can be achieved by controlling the magnitude of the differential-mode voltage udiffj thereby realizing the regulation of AC-side power; the control of arm circulating current icirj can be achieved by controlling the magnitude of the common-mode voltage ucomj.
2.2 Circulation suppression strategies
Due to voltage fluctuations in the sub-modules, maintaining a consistent total output voltage from the upper and lower arms per phase is difficult during stable system operation. Consequently, there exists a certain voltage difference with the DC bus, which in turn induces internal circulating currents. Through analysis, the expression of the circulating current can be obtained as follows:
where, Idc is the DC bus current; Icirj1 is the amplitude of the fundamental frequency component of the circulating current; Icirj2 is the amplitude of the second harmonic component of the circulating current;
From Equation 6, it can be concluded that the system’s circulating current is mainly composed of a DC component, a fundamental frequency component, a second harmonic component, and high-order harmonic components. The proportion of high-order harmonic components is relatively small and can be neglected. The DC component flows in the DC line and serves as the operating current for the normal operation of the DC side. The second harmonic component has negative sequence characteristics; it neither flows into the DC side of the converter nor operates on the AC side of the converter, but flows entirely through the three-phase arms. During normal operation, this causes unnecessary losses in power electronic devices, reducing system efficiency and reliability. To improve efficiency, the second harmonic component in the circulating current should be eliminated. The mathematical expression of the three-phase circulating current in the system is shown in Equation 7:
In the formula,
An ideal Proportional-Resonant (PR) controller can achieve zero steady-state error tracking of AC quantities at its designated frequency and enables phase-separated control, eliminating the need for inter-phase decoupling considerations. But due to its inherently narrow bandwidth and poor disturbance rejection capability, the ideal PR controller exhibits significantly reduced gain when frequency deviates from the nominal value. The quasi-PR controller (QPR) exhibits enhanced robustness, as its wider bandwidth minimizes the performance impact of frequency variations, unlike the adverse effect of resonant frequency gain reduction on suppression performance.
This paper proposes a suppression strategy that employs a Quasi-PR controller as its core framework, augmented by reinforcement learning-based dynamic parameter optimization to enhance system stability and robustness. The transfer function of the Quasi-PR controller is given by Equation 9:
where: kP is the proportional gain; kR is the resonant gain; ωc is the cutoff frequency; ω0 is the resonant frequency. The cutoff frequency ωc determines the system bandwidth and open-loop gain. Assuming a grid frequency fluctuation of ±0.5 Hz and a quasi-PR controller bandwidth of ωc/π = 1 Hz, it follows that ωc = 3 Hz. Based on this and the MMC topology, an adaptive quasi-PR controller-based MMC circulating current suppression structure is designed, as illustrated in Figure 4.
Taking phase A as an example: the circulating current icira is calculated by first summing the upper and lower arm currents (ipa and ina) and then taking the average of this sum. Subtracting the DC component Idc/3 yields the second-harmonic component icira_2. Since this paper focuses solely on suppressing the dominant second-harmonic component (higher-order harmonics being negligible), the goal is to drive the AC content of this second-harmonic arm circulating current to zero. To achieve this.
1. icira_2 is compared to its reference value of 0, resulting in the error signal -icira_2.
2. This error signal -icira_2 is fed into a quasi-PR controller for tracking.
3. By setting the resonant frequency ω0 of the quasi-PR controller to twice the fundamental grid frequency (2ωg), the controller effectively reduces the second-harmonic component within the circulating current.
The control objective of this paper is to intelligently tune the parameters kP and kR in the quasi-PR controller using a DRL algorithm, enabling the quasi-PR controller to adaptively adjust its parameters according to different operating conditions. This will allow for rapid suppression of circulating currents. Additionally, the control effect of this method will be compared with that of a controller with fixed PR parameters.
3 Deep reinforcement learning and training strategies
This section enhances the traditional quasi-PR controller by using reinforcement learning for parameter tuning, and introduces the basic principles of the TD3 algorithm adopted in this paper. Then, on this basis, the training process of parameter tuning using the TD3 algorithm is elaborated. Finally, the complete structure diagram of the adaptive quasi-PR controller based on the reinforcement learning algorithm is presented.
3.1 Principle of deep reinforcement learning
As a major branch of artificial intelligence, DRL combines the perceptual capabilities of deep learning with the decision-making mechanisms of reinforcement learning, enabling agents to autonomously learn and optimize in complex environments. Based on the Markov Decision Processes, this framework facilitates learning through dynamic interaction between the agent and its environment.
In the specific mechanism, the agent obtains the state information of the environment through sensors and outputs actions based on strategies. After the environment responds to the action, it will feed back a new state and an immediate reward, forming a closed-loop interaction chain of “State-Action-Reward-new State”. Reinforcement learning problems usually involve modeling large-scale states and actions, and there is environmental uncertainty in tasks. However, deep learning models can effectively learn and adapt to complex patterns and features in these spaces, thereby improving the robustness of the system.
3.2 TD3 algorithm principle
The TD3 algorithm is an important improved method in the field of DRL for problems involving continuous action spaces. Its core principle is built on the Deep Deterministic Policy Gradient (DDPG) algorithm, with multiple enhancement mechanisms introduced to alleviate issues of policy overestimation and training instability. Compared to other on-policy algorithms like PPO, TD3 is generally more sample-efficient, making it well-suited for environments where interactions are computationally intensive, such as the high-fidelity power electronics simulations used in this work. Its core principle is built on the actor-critic framework, but it introduces several key enhancements to ensure stable and robust learning. As a typical implementation of the actor-critic framework, the TD3 algorithm simultaneously includes a policy network for generating deterministic actions and value networks for evaluating action values, achieving collaborative optimization of both through temporal difference learning.
Shown as Figure 5, the core improvements of the TD3 algorithm are reflected in three key mechanisms. Firstly, the dual Critic network structure trains two Q-networks (Q1 and Q2) with identical architecture but independent parameters in parallel. When calculating target values, the minimum value of the two is selected as the target Q-value, effectively suppressing policy biases caused by overestimation in the value networks of traditional DDPG. Secondly, the delayed policy update mechanism sets the update frequency of the Critic networks to twice that of the policy network (usually updating the Actor network only after every two updates of the Critic networks), reducing the interference of policy parameter fluctuations on value estimation and improving the stability of the training process. Finally, target action noise injection adds small random noise when generating actions for the target policy, with clipping operations to limit the noise range, enhancing the policy’s robustness against environmental disturbances while promoting more thorough exploration of the action space.
In the specific training process, the TD3 algorithm uses a replay buffer to store trajectory samples from agent-environment interactions. Random sampling is applied to break data correlations. The Critic networks are updated by minimizing a loss function, while the Actor network improves the policy by maximizing the Q-values from the Critic, guided by the deterministic policy gradient. Target network parameters are softly updated to track the main networks slowly, enhancing training stability.
These designs enable the TD3 algorithm to exhibit better convergence performance and stability than DDPG in continuous control tasks. Especially in high-dimensional action spaces and environments with sparse rewards, its ability to balance exploration and exploitation through multiple mechanisms is significantly demonstrated, making it an important technical choice for DRL in practical scenarios such as robot control and industrial optimization.
3.3 Training process
As mentioned earlier, in reinforcement learning, an agent generates actions based on its own policy using the state information of the environment. During the interaction with the environment, it receives rewards and updates the weights in its neural network according to these rewards. Therefore, the design of state variables, action variables, reward functions, and neural networks is crucial to the entire agent. Before introducing the design of the aforementioned variables, a brief description of the training process is given as follows: the simulation time for each episode is 1.5 s. Circulating current suppression is activated at 0.3 s. The active power steps from −2 per-unit (p.u.) to 2 p.u. at 0.6 s and subsequently steps back to −2 p.u. at 1 s.
3.3.1 State
The state variables, serving as the sole source of information for the agent to perceive the environmental dynamics, directly influence the decision-making quality and convergence efficiency of the policy network. Therefore, the state variables selected in this paper are as following Equation 10:
where: t is the simulation running time; Ms is a Boolean value indicating the operating condition, identifying the current system phase (Ms = 1 after an active power step change, Ms = 0 otherwise). This design explicitly informs the agent of the current system phase, enabling it to adapt its control strategy accordingly. Pref and Perr are the reference value and error of the active power, respectively; icirA is the circulating current in phase A; THD is the Total Harmonic Distortion rate of the phase A arm current. To eliminate the influence of differing dimensions among the state variables, normalization is applied, ensuring all state components reside within a comparable magnitude range.
It is worth mentioning that the inclusion of simulation time t in the state vector serves to inform the agent about the specific phase of the training episode. The training episodes have a fixed structure with events at 0.3 s, 0.6 s, and 1.0 s. This allows the agent to anticipate scheduled events, such as the activation of the controller or power steps, and adopt a more proactive control strategy and aids convergence in the defined training scenario.
3.3.2 Action
Since the agent’s objective is to tune the parameters of the quasi-PR controller, the action space is defined as a two-dimensional vector containing the adjustments for { kP, kR }. According to reference (He et al., 2015), a larger resonant gain kR enhances the suppression of the second-harmonic circulating current. However, as kR increases, the closed-loop poles move closer to the imaginary axis, causing the system’s stability margin to diminish. Consequently, kR cannot be excessively large. Integrating the agent’s exploration requirements, the allowable ranges for the action space parameters are defined as Equation 11:
3.3.3 Reward
In the reinforcement learning framework, the reward function serves as the core bridge connecting the agent’s decision-making and control objectives. Its rationality directly determines whether the agent can learn an optimization strategy that meets actual needs. For the problem of MMC circulating current suppression, an effective reward function needs to accurately reflect the system’s comprehensive requirements for active power control accuracy, circulating current suppression effect, and power quality. At the same time, it should guide the agent to make adaptive adjustments when the operating conditions change abruptly (such as active power steps). Therefore, the design of the reward function must take into account both multi-objective optimization and the specificity of dynamic operating conditions.
For the intelligent tuning of PR controller parameters in MMC circulating current suppression, this paper designs the reward function expression as following Equations 12–15:
Where,
This reward function consists of three core components: active power error penalty, circulating current amplitude penalty, and total harmonic distortion (THD) penalty, and incorporates a time-segmented mechanism to meet the control demands under various operating conditions. Specifically.
a. The active power error penalty is derived from the absolute value of the per-unitized active power error. During the active power step period (0.6–1s), the penalty is enhanced by a factor of 1.5 to strengthen the requirement for step response speed.
b. The circulating current amplitude penalty is calculated based on the average value of the absolute values of the three-phase circulating currents. It takes effect after the activation of circulating current suppression at 0.3s, and the penalty weight is also increased by a factor of 2 during the step period. Additionally, in the circulation suppression of QPR control, the average amplitude of the circulation is 50 A. Therefore, in order to make the agent have better circulation suppression effect than the traditional QPR control and ensure the safe operation of the system, an additional penalty is imposed on the limit circulation amplitude exceeding 40 A.
c. The THD penalty is included from 0.4s onwards. It penalizes excessive harmonic distortion with 2% as the reference value, and imposes a more severe fixed penalty on extreme cases where THD exceeds 10% to guarantee power quality.
The final reward is obtained by summing up all components. And the overall reward is limited to the range of [−1,000, 100] to avoid the impact of numerical saturation on training stability.
3.3.4 Neural net
The Actor network in the algorithm is shown in Figure 6. It takes time t, condition signal Ms, active power reference value Pref, active power error Perr, circulating current icircA, and total harmonic distortion rate THD as inputs. Through multi-layer fully connected hidden layers containing learnable parameters θπ, it performs non-linear transformation and deep fusion on the input state information to capture complex feature relationships. Then, through the output layer and scaling layer, the processed features are mapped and adapted to the control quantities a1, a2(t) executable by the converter, realizing the conversion from environmental states to control action strategies. The Critic network is divided into two paths: state quantity and action quantity. The network structure under each path is similar to the Actor network, as shown in Figure 7.
The selection of the agent’s training hyperparameters, listed in Table 1, is crucial for stable and efficient learning. The values were determined through an iterative tuning process, starting with commonly used values from DRL literature for continuous control tasks (Ye, et al., 2024b). Minor adjustments were then made based on the observed training stability and reward convergence from preliminary simulation runs. For instance, the learning rate of 0.0005 was found to provide a good balance between convergence speed and stability, avoiding large, unstable policy updates.
4 Training and simulation results
4.1 TD3 algorithm training results
The simulation step size is set to 2e-5s, and the total simulation duration per episode is 1.5s. In one episode, the agent selects the initial state, and the action generated by the policy function is appended with Gaussian random noise of a certain standard deviation. After the agent takes the action, the reward is calculated according to Equation 12, and the rewards at each moment in the episode are accumulated to obtain the cumulative reward of the episode. Through simulations of multiple episodes, the obtained cumulative rewards are shown in Figure 8.
The entire process, spanning over 140 episodes, took approximately 3–4 h to converge to a stable and high-performing policy. The duration is primarily influenced by the high fidelity of the MMC simulation model and the number of interaction steps required for the agent to learn effectively.
As can be seen from Figure 8, the reward of the agent fluctuates continuously during the 0–60 episodes. This is because the agent is exploring appropriate parameters, and different parameters have different adjustment effects on the system, thus leading to reward fluctuations. After 60 episodes, the reward tends to converge, indicating that the agent has learned an effective parameter optimization strategy for the system.
Figure 9 shows the control parameters finally converged by the agent, from which it can be observed that the agent continuously adjusts kP and kR according to the operating conditions. This demonstrates a learned, logical relationship between system state and controller parameters, even if it's not a simple linear one.
4.2 Simulation results
This paper evaluates the feasibility of using an agent trained by the TD3 algorithm to adjust quasi-PR controller parameters. A simulation model of a three-phase MMC was developed on the Matlab/Simulink platform for this purpose. The system parameters are listed in Table 2.
The waveforms for phase A in the absence of a circulating current controller are depicted in Figure 10. Distortion of the arm current caused by high-order harmonics within the circulating current can be observed in Figure 10a, resulting in a non-sinusoidal waveform. The circulating current after removal of the DC component is shown in Figure 10b. It is evident that without suppression, this circulating current exhibits significant fluctuations, severely compromising MMC operational efficiency. Figure 10c presents the Fast Fourier Transform (FFT) analysis of the arm current, revealing a Total Harmonic Distortion (THD) of 26.31%. This verifies the presence of even-order harmonic components beyond the DC offset, with the second-harmonic component exhibiting particularly large amplitude fluctuations.
Figure 10. Pre-circulating current suppression waveform. (a) A-phase bridge arm current (b) A-phase circulating current (c) FFT analysis of A-phase bridge arm current.
Figure 11 depicts the results of circulating current suppression using a quasi-proportional-resonant (QPR) controller with fixed parameters. After the introduction of the QPR controller, the waveform of the leg current is improved to a certain extent. As can be seen from Figure 11b in conjunction with Figure 11c, the average value of the circulating current amplitude decreases to 20 A after 0.4 s, with a positive peak value of 30.2 A and a negative peak value of 36.8 A. FFT results indicate that the second-harmonic component is significantly suppressed, with a THD of 3.09%. Although the QPR controller provides some suppression of the circulating current, its control response time is relatively long, requiring at least 0.4 s to achieve the optimal suppression effect.
Figure 11. Fixed parameter QPR control. (a) A-phase bridge arm current (b) A-phase circulating current (c) FFT analysis of A-phase bridge arm current.
Figure 12 presents the results of adaptive parameter tuning for the QPR controller using the agent trained with the TD3 algorithm. It can be observed that after the control is activated, the waveform of the leg current is also significantly improved, and the circulating current amplitude can be rapidly suppressed within 0.001 s. The average amplitude of the suppressed circulating current is comparable to that achieved by the QPR controller, with a positive peak value of 30.6 A and a negative peak value of 37.1 A. FFT analysis reveals that the second-harmonic component in the circulating current is significantly suppressed, with a measured THD of 3.01%, exhibiting a certain improvement in dynamic response performance compared to the QPR controller with fixed parameters.
Figure 12. Adaptive QPR control based on TD3 algorithm. (a) A-phase bridge arm current (b) A-phase circulating current (c) FFT analysis of A-phase bridge arm current.
To further analyze the dynamic characteristics of the agent-based control, a step change is applied to the per-unit value of the AC-side output power: at 0.6 s, the active power steps from −2 to 2, and then steps from 2 to −2 at 1 s. The variations of the phase-A leg current and circulating current amplitude are observed, with the results presented in Figures 13, 14. It can be seen that both control strategies enable the phase-A leg current to recover to a steady state within 0.1 s when the load suddenly increases or decreases, and the difference in circulating current amplitude at steady state is negligible.
Figure 13. Waveforms of MMC when power changes——QPR. (a) A-phase circulating current (b) A-phase bridge arm current.
Figure 14. Waveforms of MMC when power changes——QPR + DRL. (a) A-phase circulating current (b) A-phase bridge arm current.
Comparing Figures 13a, 14a, when the load increases, the circulating current step caused by the adaptive QPR control is 112 A, whereas that caused by the QPR controller with fixed parameters is 376 A; when the load decreases, the circulating current step of the adaptive control is −104 A, while that of the fixed-parameter control is −198 A. Comprehensive comparison of the leg current and circulating current indicates that, compared with the fixed-parameter QPR control, the adaptive QPR control exhibits smaller overshoot and shorter adjustment time when the output power changes, which demonstrates the superiority of the proposed method.
Figures 15, 16 demonstrate the stability of the proposed control at the system level. Figure 15 displays the waveforms of the average capacitor voltage for all submodules in the upper arm of phase A under both fixed-parameter QPR control and adaptive DRL control, plotted within the same coordinate system. When circulating current suppression control is activated at 0.5 s, both control methods reduce the capacitor voltage ripple to some extent. However, the voltage ripple under fixed-parameter control is larger than that under adaptive DRL control, highlighting the effectiveness and robustness of the DRL approach. Figure 16 presents the three-phase bridge currents and three-phase output AC currents of the upper arm in phase A. As shown in Figure 16a, when circulating current suppression is initiated, the second-harmonic component in the bridge current is significantly eliminated, resulting in a more sinusoidal current after 0.5 s. Figure 16b illustrates the three-phase output currents during steady-state operation, demonstrating that while suppressing internal circulating currents, the MMC maintains uncompromised power quality in its external output.
Figure 16. Three-phase current waveforms. (a) Three-phase arm current waveforms of phase A (b) Three-phase output AC current waveforms.
5 Conclusion
The limitations of the conventional quasi-PR controller-based method for MMC circulating current suppression, which include difficulties in parameter tuning and the low adaptability of fixed parameters, are the primary focus of this paper. To solve these problems, a method for adaptive tuning of controller parameters through deep reinforcement learning is proposed. This paper analyzes the feasibility of using the TD3 algorithm in DRL to tune the parameters of the PR controller, trains an agent capable of adaptively changing parameters by designing a reasonable neural network and reward function, and finally compares the advantages and disadvantages of the adaptive strategy with the traditional quasi-PR method through simulation. The results confirm that the proposed adaptive QPR controller more effectively suppresses MMC circulating currents than the traditional quasi-PR controller. While achieving equivalent suppression, it operates with superior dynamic characteristics and a lower THD.
While this study demonstrates the effectiveness of the proposed method in a simulation environment, future work will focus on hardware-in-the-loop testing and physical deployment. The deployment process involves embedding the trained Actor network onto a real-time digital controller, such as a DSP or FPGA. Since the computationally intensive training process is performed offline, the real-time implementation only requires executing a forward pass of the lightweight Actor network each control cycle. The network’s modest size, consisting of a few hidden layers, implies that its memory footprint (for weights and biases) and computational requirements (for matrix multiplications) are well within the capabilities of standard industrial controllers. Future research will validate the real-time performance, including inference latency and resource utilization on a target hardware platform, to confirm its feasibility for practical applications.
Data availability statement
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.
Author contributions
YC: Conceptualization, Funding acquisition, Investigation, Methodology, Supervision, Writing – original draft. XL: Conceptualization, Writing – original draft. YL: Methodology, Writing – review and editing. FD: Software, Visualization, Writing – review and editing. ZG: Validation, Writing – review and editing. TY: Formal Analysis, Project administration, Writing – review and editing.
Funding
The author(s) declare that financial support was received for the research and/or publication of this article. This work is supported by the National Major Science and Technology Projects (2024ZD0802600).
Conflict of interest
Authors YC, YL, and ZG were employed by China Southern Power Grid. Authors XL, FD, and TY were employed by Guangzhou Power Supply Bureau of Guangdong Power Grid Co., Ltd.
Generative AI statement
The author(s) declare that no Generative AI was used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
Aslam, A., and Raza, M. (2025). Design and implementation of active control method for minimizing circulating current in MMC-VSC system. IEEE Access 13, 124471–124482. doi:10.1109/access.2025.3588713
Cao, D., Hu, W., Zhao, J., Zhang, G., Zhang, B., Liu, Z., et al. (2020). Reinforcement learning and its applications in modern power and energy systems: a review. J. Mod. Power Syst. Clean Energy 8 (6), 1029–1042. doi:10.35833/mpce.2020.000552
Chao, W., Huang, J., Deng, C., and Dai, L. (2023). “Fuzzy adaptive PI circulating current suppressing control for MMC-HVDC,” in 2023 IEEE 6th information technology, networking, electronic and automation control conference (ITNEC); February 24–26, 2023. Chongqing, China, 1163–1167.
Chen, P., Zhao, J., Liu, K., Zhou, J., Dong, K., Li, Y., et al. (2024). A review on the applications of reinforcement learning control for power electronic converters. IEEE Trans. Industry Appl. 60 (6), 8430–8450. doi:10.1109/tia.2024.3435170
Dinkel, D., Hillermeier, C., and Marquardt, R. (2022). Direct multivariable control for modular multilevel converters. IEEE Trans. Power Electron. 37 (7), 7819–7833. doi:10.1109/tpel.2022.3148578
Fang, Y., Xu, N., and Liu, Y. (2023). “Hybrid linear predictive control scheme based on PIR and MPC for MMC,” in 2023 IEEE 2nd international power electronics and application symposium (PEAS); November 10–13, 2023. Guangzhou, China, 491–495.
Farias, J. V. M., Cupertino, A. F., Pereira, H. A., Seleme, S. I., and Teodorescu, R. (2021). On converter fault tolerance in MMC-HVDC systems: a comprehensive survey. IEEE J. Emerg. Sel. Top. Power Electron. 9 (6), 7459–7470. doi:10.1109/jestpe.2020.3032393
Gheisarnejad, M., Farsizadeh, H., and Khooban, M. H. (2021). A novel nonlinear deep reinforcement learning controller for DC–DC power buck converters. IEEE Trans. Industrial Electron. 68 (8), 6849–6858. doi:10.1109/tie.2020.3005071
He, L., Zhang, K., Xiong, J., and Fan, S. (2015). A repetitive control scheme for harmonic suppression of circulating current in modular multilevel converters. IEEE Trans. Power Electron. 30 (1), 471–481. doi:10.1109/tpel.2014.2304978
Jiang, H., Chen, Y., and Kang, Y. (2021). “Application of neural network controller and policy gradient reinforcement learning on modular multilevel converter (MMC) - a proof of concept,” in 2021 IEEE 4th international electrical and energy conference (CIEEC), May 20–30, 2021. Wuhan, China, 1–6.
Kumar, K. P., and Detroja, K. P. (2022). “Parameterized adaptive controller design using reinforcement learning and deep neural networks,” in 2022 eighth indian control conference (ICC), December 14–16, 2022. IEEE, 121–126.
Li, F., and Zhu, C. (2025). Research on MMC circulating current suppression based on feedforward compensation PR control. Power Syst. Automation 47 (4), 62–65.
Li, C., Zhang, Y., Zhang, X., and Liu, Z. (2019). “Circulating current suppression for MMC with hybrid particle swarm optimization,” in 2019 Chinese control conference (CCC), July 27–30, 2019. Guangzhou, China, 7316–7321.
Lu, P., Huang, W., and Xiao, J. (2021). “Speed tracking of brushless DC motor based on deep reinforcement learning and PID,” in 2021 7th international conference on condition monitoring of machinery in non-stationary operations (CMMNO), June 11–13, 2021. Guangzhou, China, 130–134.
Luo, Y., Yao, J., Huang, S., and Liu, K. (2023). “Small signal stability analysis of MMC-HVDC grid-connected system and optimization control of zero-sequence circulating current controller,” in 2023 IEEE international conference on power science and technology (ICPST), May 5–7, 2023. IEEE, 414–419.
Nougain, V., Mishra, S., Misyris, G. S., and Chatzivasileiadis, S. (2021). Multiterminal DC fault identification for MMC-HVDC systems based on modal analysis—A localized protection scheme. IEEE J. Emerg. Sel. Top. Power Electron. 9 (6), 6650–6661. doi:10.1109/jestpe.2021.3068800
Park, J., Kim, H., Hwang, K., and Lim, S. (2022). Deep reinforcement learning based dynamic proportional-integral (PI) gain auto-tuning method for a robot driver system. IEEE Access 10, 31043–31057. doi:10.1109/access.2022.3159785
Sánchez-Sánchez, E., Groß, D., Prieto-Araujo, E., Dörfler, F., and Gomis-Bellmunt, O. (2020). Optimal multivariable MMC energy-based control for DC voltage regulation in HVDC applications. IEEE Trans. Power Deliv. 35 (2), 999–1009. doi:10.1109/tpwrd.2019.2933771
Shi, X., Chen, N., Wei, T., Wu, J., and Xiao, P. (2021). “A reinforcement learning-based online-training AI controller for DC-DC switching converters,” in 2021 6th international conference on integrated circuits and microsystems (ICICM), October 22–24, 2021. Nanjing, China (Piscataway, NJ: IEEE), 435–438.
Steckler, P.-B., Gauthier, J.-Y., Lin-Shi, X., and Wallart, F. (2022). Differential flatness-based, full-order nonlinear control of a modular multilevel converter (MMC). IEEE Trans. Control Syst. Technol. 30 (2), 547–557. doi:10.1109/tcst.2021.3067887
Vipin, V. N., and Mohan, N. (2025). Sensitivity analysis of the high-frequency-link MMC to DC link voltage ripples in a back-to-back connected MMC-based power electronic transformer. IEEE Trans. Power Electron. 40 (6), 8691–8708. doi:10.1109/tpel.2025.3538605
Wang, Y., Wang, J., Tong, L., and Ye, Q. (2018). Research on MMC circulation control strategy based on adaptive quasi-PR controller. Adv. Technol. Electr. Eng. Energy 37 (12), 24–31.
Ye, J., Guo, H., Wang, B., and Zhang, X. (2024a). Deep deterministic policy gradient algorithm based reinforcement learning controller for single-inductor multiple-output DC–DC converter. IEEE Trans. Power Electron. 39 (4), 4078–4090. doi:10.1109/tpel.2024.3350181
Ye, J., Guo, H., Zhao, D., Wang, B., and Zhang, X. (2024b). TD3 algorithm based reinforcement learning control for multiple-input multiple-output DC–DC converters. IEEE Trans. Power Electron. 39 (10), 12729–12742. doi:10.1109/tpel.2024.3416911
Zhang, W., Li, J., Zhang, M., Yang, X., and Zhong, D. (2025). Research on circulating-current suppression strategy of MMC based on passivity-based integral sliding mode control for multiphase wind power grid-connected systems. Electronics 14 (13), 2722. doi:10.3390/electronics14132722
Keywords: modular multilevel converter (MMC), quasi-PR controller, circulating current suppression, deep reinforcement learning (DRL), optimized control
Citation: Chen Y, Luo X, Lu Y, Duan F, Guo Z and Yan T (2025) An adaptive quasi-PR controller for modular multilevel converters based on deep reinforcement learning. Front. Energy Res. 13:1716873. doi: 10.3389/fenrg.2025.1716873
Received: 01 October 2025; Accepted: 27 October 2025;
Published: 27 November 2025.
Edited by:
Xuewei Pan, Harbin Institute of Technology, Shenzhen, ChinaReviewed by:
Shunfeng Yang, Southwest Jiaotong University, ChinaAmeen Ullah, Shenzhen University, China
Benfei Wang, Sun Yat-sen University, China
Copyright © 2025 Chen, Luo, Lu, Duan, Guo and Yan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Yukun Chen, Y2hlbnlrMUBjc2cuY24=
Xin Luo2