Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Energy Res., 27 November 2025

Sec. Smart Grids

Volume 13 - 2025 | https://doi.org/10.3389/fenrg.2025.1716873

This article is part of the Research TopicAdvanced Operation, Control, and Planning of Urban Power GridView all 3 articles

An adaptive quasi-PR controller for modular multilevel converters based on deep reinforcement learning

Yukun Chen
Yukun Chen1*Xin LuoXin Luo2Yuxin LuYuxin Lu1Fei DuanFei Duan2Zhu GuoZhu Guo1Tianyou YanTianyou Yan2
  • 1Electric Power Research Institute, China Southern Power Grid, Guangzhou, China
  • 2Guangzhou Power Supply Bureau of Guangdong Power Grid Co., Ltd., Guangzhou, China

In modern power systems, Modular Multilevel Converter (MMC) plays an important role due to its advantages of convenient maintenance and easy expansion. However, because of its structural defects, it is easy to induce circulating current, which poses a challenge to stable operation. It is of great significance to develop efficient and reliable circulating current suppression technology. This paper introduces a Deep Reinforcement Learning (DRL) method for adaptive tuning of controller parameters, addressing the challenge of difficult parameter adjustment in the MMC circulating current suppression strategy employing a quasi-PR controller. It analyzes the feasibility of using the twin delayed deep deterministic (TD3) algorithm to tune the parameters of the PR controller, and designs a reasonable neural network and reward function to train the agent for control. Simulation results demonstrate the superiority of the TD3-based adaptive quasi-PR controller over the traditional fixed-parameter quasi-PR controller. The adaptive controller has a better effect on MMC circulating current suppression, with better dynamic response and smaller THD. This provides an effective solution for promoting the large-scale application of MMCs and enhancing the performance of power systems.

1 Introduction

The Modular Multilevel Converter (MMC) has become a preferred choice in medium-to high-voltage modern power systems, owing to its distinctive advantages such as scalability, low switching losses, and ease of maintenance (Aslam and Raza, 2025; Sánchez-Sánchez et al., 2020) Relevant studies (Steckler et al., 2022; Zhang et al., 2025; Nougain et al., 2021; Farias et al., 2021; Vipin and Mohan, 2025) demonstrate that in high-voltage direct current (HVDC) transmission scenarios, MMCs effectively enhance transmission efficiency, reduce line energy losses, and ensure stable and efficient power delivery. Regarding power quality management, MMCs can precisely compensate harmonic components and rationally regulate reactive power, thereby significantly improving grid power supply quality. Leveraging these outstanding features, MMCs have become indispensable core power conversion equipment in emerging power engineering applications such as renewable energy grid integration and flexible AC transmission systems.

However, MMCs inherently suffer from structural limitations during practical operation. Since the three-phase arms share a common DC link and energy storage elements are distributed within individual submodules, capacitor voltage imbalance inevitably occurs across the arms during steady-state operation. This imbalance induces circulating currents within the converter (Dinkel et al., 2022). As elaborated in reference (Luo et al., 2023), while these circulating currents do not directly affect the output current, they distort the arm currents, which substantially raises the converter’s power losses and jeopardizes the stable operation of MMCs. The increased power losses not only reduce energy utilization efficiency but also lead to device overheating, shortened equipment lifespan, and potentially even system failures. Consequently, developing efficient and reliable techniques for suppressing inter-phase circulating currents in MMCs has become a critical research focus to enable their large-scale deployment. This advancement holds considerable practical value, as it enhances the stability, reliability, and economic performance of power systems.

To address the issue of circulating current suppression in MMCs, the academic community has developed several methodological approaches (Li and Zhu, 2025). Traditional PI control was an early common solution. A plug-in repetitive control scheme is presented in Reference (He et al., 2015), which features the high dynamic characteristics of a PI controller and the exceptional steady-state harmonic suppression of a repetitive controller, thereby minimizing their mutual interference. However, it needs to be matched with a PI controller with appropriate parameters to achieve the best effect.

To improve the adaptability of the PI controller, fuzzy adaptive technology has been introduced. Reference (Chao et al., 2023) proposes a fuzzy adaptive PI circulating current suppression controller, which continuously adapts the PI parameters in real-time through fuzzy logic, using both the system state error and its derivative as inputs. Simulation results indicate a marked improvement compared to conventional PI control. Reference (Li et al., 2019) puts forward a hybrid particle swarm optimization (HPSO) algorithm, which merges the strengths of both particle swarm optimization and genetic algorithm, dynamically adjusts the inertia weight and introduces a disturbance mechanism to avoid local optimization. It is used to optimize the PI parameters of the circulating current suppressor. Simulation results demonstrate that the proposed approach can reduce the circulating current amplitude by 83.33% and markedly enhances the converter’s dynamic response. Reference (Fang et al., 2023) presents a hybrid linear predictive control framework that integrates Model Predictive Control (MPC) with a PIR controller. This design eliminates the need for coordinate transformation and phase decoupling, enables direct and static-error-free control of AC components, and achieves linear circulating current control. However, MPC has a large amount of calculation and a certain degree of control complexity. Reference (Zuo et al., 2020) proposes a method for the adaptive adjustment of quasi-PR controller parameters, which realizes circulating current compensation by combining with the proportional negative feedback of the arm current. Only 3 quasi-PR controllers are needed to suppress the three-phase circulating current, simplifying the system design. An optimization strategy for the proportional negative feedback link, integrating the traditional PR controller with arm current, is proposed in Reference (Shi et al., 2021). Although it enhances system stability, it does so at the cost of increased system complexity, which is harmful to parameter tuning. Reference (Wang et al., 2018) proposes an adaptive quasi-PR circulating current suppression method with feedforward compensation, analyzes in detail the influence of the resonance coefficient in the controller on the system, gives the adjustment rules of the resonance coefficient, and shows that the scheme works well for suppressing circulating current. This method only needs one resonance controller to suppress all even-order circulating currents simultaneously and can adapt to different load sizes.

With the advancement of artificial intelligence (AI) technology, reinforcement learning has shown significant potential for control applications and has already been applied in fields such as power electronics and power systems (Shi et al., 2021; Chen et al., 2024; Jiang et al., 2021; Cao et al., 2020). The application of reinforcement learning in control systems is mainly reflected in two aspects. One is the direct control of the system. For example, Ye applied reinforcement learning methods to Buck circuits and SIMO circuits, verifying the feasibility of using DDPG and TD3 algorithms for direct control of power electronic converters (Gheisarnejad et al., 2021; Ye, et al., 2024a; Ye, et al., 2024b). On the other hand, reinforcement learning is also used for parameter tuning and optimization. To improve the speed tracking accuracy of brushless DC motors, Reference (Lu et al., 2021) combines the Deep Deterministic Policy Gradient (DDPG) algorithm with PID control. The reinforcement learning algorithm compensates for the controller’s proportional, integral, and derivative components. Reference (Park et al., 2022) introduces a dynamic PI gain self-tuning method using DRL. It employs the DQN algorithm to train agents in a simulation environment, producing a reference gain table. Vehicle tests show that this method reduces the root mean square error by nearly 46.8% compared to traditional fuzzy PI control Reference (Kumar and Detroja, 2022) designs a parameterized adaptive controller based on the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm. By dynamically adjusting the action space and reward function, it achieves faster convergence and better performance in unstable system control. These studies have verified the advantages of reinforcement learning in adaptive optimization of controller parameters, providing new ideas for parameter optimization of MMC circulating current suppression.

Synthesizing existing research, traditional control methods exhibit limitations in parameter adaptability and multi-harmonic suppression. In contrast, reinforcement learning possesses capabilities for dynamic decision-making and adaptive optimization. Combining RL with PR controllers holds promise for realizing real-time optimization of circulating current suppression parameters. Building on this concept, a reinforcement learning-based PR controller can be proposed: using the circulating current error and its rate of change as state inputs, the PR controller parameter adjustments as action outputs, and designing an objective function incorporating error penalties and convergence rewards. A DRL algorithm is then employed to train an agent to autonomously learn parameter adjustment strategies. This framework aims to achieve precise suppression of multi-frequency circulating currents in MMCs, enhancing system stability and robustness under complex operating conditions. Ultimately, the performance of the proposed control strategy was confirmed through Matlab/Simulink simulations.

2 Mathematical models and problem description

2.1 MMC topology and circulating current mechanism

The typical topological structure of MMC is shown in Figure 1. As shown, the MMC has six arms, where each one consists of N identical sub-modules (SMs) in series with an inductor Lm and a resistor Rm. The upper and lower arms of the same phase form a phase unit. Each sub-module is a half H-bridge composed of two IGBTs with anti-parallel diodes connected in series and then paralleled with a DC energy storage capacitor. The output voltages of the upper and lower arm sub-modules are upj and unj respectively, and the system-side voltages and currents are usj and isj (j = a, b, c) respectively.

Figure 1
Diagram of a power electronics system with submodules labeled SM1 to SM4 in three phase units. It shows a bridge configuration, currents denoted by `i` with subscripts, and transformers T1 and T2. Capacitors and resistors labeled Rm and inductors Lm are present. Connections are marked with terminals A, B, and C, leading to a point O.

Figure 1. The structure of MMC system.

In the theoretical operating state, the switching of the upper and lower arm sub-modules of each phase of the modular multilevel converter presents a complementary law: if each arm contains N sub-modules, the number of sub-modules activated in the upper (m) and lower (n) arms at any time must satisfy the constraint m + n = N. At this time, the capacitor voltage of each sub-module should be stably maintained at the rated value of Udc/N.

However, in practical engineering scenarios, the frequent switching operations of the submodules inevitably induce capacitor voltage fluctuations. These voltage fluctuations prevent the output voltages of all three-phase units from maintaining consistent levels, consequently leading to circulating currents. Given the structural symmetry of the MMC’s three-phase arms, phase A can be selected for detailed analysis.

The system analysis is simplified using the single-phase MMC equivalent circuit shown in Figure 2. Based on Kirchhoff’s Current Law (KCL), the mathematical model of the single-phase MMC can be derived:

usj+Lsdisjdt+Rsisj=usj+Udc2LmdipjdtRmipjupjusj+Lsdisjdt+Rsisj=usjUdc2+LmdinjdtRminjunj(1)

Figure 2
Diagram of an electrical circuit featuring sources and components. It includes voltage sources labeled \(usj\) and \(upj\), resistors labeled \(Rs\) and \(Rm\), and inductors labeled \(Ls\) and \(Lm\). Red arrows indicate current directions labeled \(ipj\) and \(inj\). The circuit shows two voltage dividers labeled \(U_{dc}/2\) with nodes P and N at the top and bottom.

Figure 2. The single-phase MMC equivalent circuit.

Where, Lm, Rm, Ls, Rs are the arm inductance, arm resistance, AC-side inductance, and AC-side resistance, respectively.

Based on Equation 1, by eliminating the AC grid voltage usj and the DC-link voltage Udc, we obtain the following Equation 2:

12unj  upj = usj + L + Lo2dijdt + Rs + Rm2ij12unj + upj =12Ud Lo2dipj + injdtRm2ipj + inj(2)

The expressions for the differential-mode voltage udiffj, common-mode voltage ucomj, and the circulating current icirj flowing through the arms are given by:

udiffj =12unj  upjucomj = 12unj + upjicirj = 12inj + ipj(3)

By combining Equation 1 and Equation 3, further simplification can be obtained as follows:

udiffjusj=LS+Lm2dijdt+Rs+Rm2ij(4)
12Uducomj=Lmdicirjdt+Rmicirj(5)

Deriving and analyzing Equations 4, 5, yields the mathematical models of the system’s AC equivalent current and DC equivalent current, as shown in Figure 3.

Figure 3
Two electrical circuits labeled a and b. Circuit a consists of three parallel branches, each with a power source \(u_s\), a resistor \(R_s\), and an inductor \(L_s\), with currents denoted as \(i_a\), \(i_b\), and \(i_c\), and differential voltage outputs \(u_{diffa}\), \(u_{diffb}\), and \(u_{diffc}\). Circuit b includes three parallel branches with an AC voltage source \(u_{coma}\), \(u_{comb}\), \(u_{comc}\), each connected to a resistor \(R_m\) and inductor \(L_m\), with currents \(i_{cira}\), \(i_{cirb}\), and \(i_{circ}\), and a DC voltage \(U_{dc}/2\).

Figure 3. Small signal equivalent model. (a) AC small signal equivalent model (b) DC small signal equivalent model.

As shown in Figure 3, L = Ls + Lm/2 represents the equivalent inductance of the system, and R = Rs + Rm/2 represents the equivalent resistance of the system. It can be known from Equation 3 that the control of AC current ij can be achieved by controlling the magnitude of the differential-mode voltage udiffj thereby realizing the regulation of AC-side power; the control of arm circulating current icirj can be achieved by controlling the magnitude of the common-mode voltage ucomj.

2.2 Circulation suppression strategies

Due to voltage fluctuations in the sub-modules, maintaining a consistent total output voltage from the upper and lower arms per phase is difficult during stable system operation. Consequently, there exists a certain voltage difference with the DC bus, which in turn induces internal circulating currents. Through analysis, the expression of the circulating current can be obtained as follows:

icirj=Idc3+Icirj1cosωt+ξj+Icirj2cos2ωt+ψj+H(6)

where, Idc is the DC bus current; Icirj1 is the amplitude of the fundamental frequency component of the circulating current; Icirj2 is the amplitude of the second harmonic component of the circulating current; H is the amplitude of the second harmonic component of the circulating current;ω is the fundamental angular frequency; ξj is the phase angle of the fundamental circulating current; ψj s the phase angle of the second harmonic component of the circulating current.

From Equation 6, it can be concluded that the system’s circulating current is mainly composed of a DC component, a fundamental frequency component, a second harmonic component, and high-order harmonic components. The proportion of high-order harmonic components is relatively small and can be neglected. The DC component flows in the DC line and serves as the operating current for the normal operation of the DC side. The second harmonic component has negative sequence characteristics; it neither flows into the DC side of the converter nor operates on the AC side of the converter, but flows entirely through the three-phase arms. During normal operation, this causes unnecessary losses in power electronic devices, reducing system efficiency and reliability. To improve efficiency, the second harmonic component in the circulating current should be eliminated. The mathematical expression of the three-phase circulating current in the system is shown in Equation 7:

icira =Idc3+ Icira1cosωt + ξa+Icira2cos2ωt + ψoicirb =Idc3+ Icirb1cosωt + ξb + Icirb2cos2ωt + ψo +23πicirc =Idc3+ Icirc1cosωt + ξa + Icirc2cos2ωt + ψo 23π(7)
Ud2Ud2Ud2ucomatucombtucomct=Lmddticiraticirbticirct+Rmiciraticirbticirct(8)

In the formula, ψo is the initial phase angle of the second harmonic component. By converting the common-mode voltage in Equation 7 into a three-phase form, the time-domain mathematical model of the MMC system in the abc coordinate system can be obtained as Equation 8.

An ideal Proportional-Resonant (PR) controller can achieve zero steady-state error tracking of AC quantities at its designated frequency and enables phase-separated control, eliminating the need for inter-phase decoupling considerations. But due to its inherently narrow bandwidth and poor disturbance rejection capability, the ideal PR controller exhibits significantly reduced gain when frequency deviates from the nominal value. The quasi-PR controller (QPR) exhibits enhanced robustness, as its wider bandwidth minimizes the performance impact of frequency variations, unlike the adverse effect of resonant frequency gain reduction on suppression performance.

This paper proposes a suppression strategy that employs a Quasi-PR controller as its core framework, augmented by reinforcement learning-based dynamic parameter optimization to enhance system stability and robustness. The transfer function of the Quasi-PR controller is given by Equation 9:

Gks=kp+2kRωcss2+2ωcs+ω02(9)

where: kP is the proportional gain; kR is the resonant gain; ωc is the cutoff frequency; ω0 is the resonant frequency. The cutoff frequency ωc determines the system bandwidth and open-loop gain. Assuming a grid frequency fluctuation of ±0.5 Hz and a quasi-PR controller bandwidth of ωc/π = 1 Hz, it follows that ωc = 3 Hz. Based on this and the MMC topology, an adaptive quasi-PR controller-based MMC circulating current suppression structure is designed, as illustrated in Figure 4.

Figure 4
Diagram of three parallel circuits for signal processing. Each circuit has inputs labeled \(i_{pa}\), \(i_{na}\), \(i_{pb}\), \(i_{nb}\), \(i_{pc}\), and \(i_{nc}\). Components include adders, multipliers by \( \frac{1}{2} \) and \( \frac{1}{3} \), and a Q-PR block. Outputs are labeled \(U_{cira\_ref}\), \(U_{cib\_ref}\), and \(U_{circ\_ref}\). Each circuit processes a direct current \(I_{dc}\) and has a zero-level reference.

Figure 4. Quasi-PR controller.

Taking phase A as an example: the circulating current icira is calculated by first summing the upper and lower arm currents (ipa and ina) and then taking the average of this sum. Subtracting the DC component Idc/3 yields the second-harmonic component icira_2. Since this paper focuses solely on suppressing the dominant second-harmonic component (higher-order harmonics being negligible), the goal is to drive the AC content of this second-harmonic arm circulating current to zero. To achieve this.

1. icira_2 is compared to its reference value of 0, resulting in the error signal -icira_2.

2. This error signal -icira_2 is fed into a quasi-PR controller for tracking.

3. By setting the resonant frequency ω0 of the quasi-PR controller to twice the fundamental grid frequency (2ωg), the controller effectively reduces the second-harmonic component within the circulating current.

The control objective of this paper is to intelligently tune the parameters kP and kR in the quasi-PR controller using a DRL algorithm, enabling the quasi-PR controller to adaptively adjust its parameters according to different operating conditions. This will allow for rapid suppression of circulating currents. Additionally, the control effect of this method will be compared with that of a controller with fixed PR parameters.

3 Deep reinforcement learning and training strategies

This section enhances the traditional quasi-PR controller by using reinforcement learning for parameter tuning, and introduces the basic principles of the TD3 algorithm adopted in this paper. Then, on this basis, the training process of parameter tuning using the TD3 algorithm is elaborated. Finally, the complete structure diagram of the adaptive quasi-PR controller based on the reinforcement learning algorithm is presented.

3.1 Principle of deep reinforcement learning

As a major branch of artificial intelligence, DRL combines the perceptual capabilities of deep learning with the decision-making mechanisms of reinforcement learning, enabling agents to autonomously learn and optimize in complex environments. Based on the Markov Decision Processes, this framework facilitates learning through dynamic interaction between the agent and its environment.

In the specific mechanism, the agent obtains the state information of the environment through sensors and outputs actions based on strategies. After the environment responds to the action, it will feed back a new state and an immediate reward, forming a closed-loop interaction chain of “State-Action-Reward-new State”. Reinforcement learning problems usually involve modeling large-scale states and actions, and there is environmental uncertainty in tasks. However, deep learning models can effectively learn and adapt to complex patterns and features in these spaces, thereby improving the robustness of the system.

3.2 TD3 algorithm principle

The TD3 algorithm is an important improved method in the field of DRL for problems involving continuous action spaces. Its core principle is built on the Deep Deterministic Policy Gradient (DDPG) algorithm, with multiple enhancement mechanisms introduced to alleviate issues of policy overestimation and training instability. Compared to other on-policy algorithms like PPO, TD3 is generally more sample-efficient, making it well-suited for environments where interactions are computationally intensive, such as the high-fidelity power electronics simulations used in this work. Its core principle is built on the actor-critic framework, but it introduces several key enhancements to ensure stable and robust learning. As a typical implementation of the actor-critic framework, the TD3 algorithm simultaneously includes a policy network for generating deterministic actions and value networks for evaluating action values, achieving collaborative optimization of both through temporal difference learning.

Shown as Figure 5, the core improvements of the TD3 algorithm are reflected in three key mechanisms. Firstly, the dual Critic network structure trains two Q-networks (Q1 and Q2) with identical architecture but independent parameters in parallel. When calculating target values, the minimum value of the two is selected as the target Q-value, effectively suppressing policy biases caused by overestimation in the value networks of traditional DDPG. Secondly, the delayed policy update mechanism sets the update frequency of the Critic networks to twice that of the policy network (usually updating the Actor network only after every two updates of the Critic networks), reducing the interference of policy parameter fluctuations on value estimation and improving the stability of the training process. Finally, target action noise injection adds small random noise when generating actions for the target policy, with clipping operations to limit the noise range, enhancing the policy’s robustness against environmental disturbances while promoting more thorough exploration of the action space.

Figure 5
Diagram depicting an actor-critic reinforcement learning framework. The process begins with sampling from an experiment buffer, feeding into both actor and critic components. The actor, on the left, includes a principal policy network depicted as a neural network taking state inputs and producing actions. It updates via policy gradient methods. The critic, on the right, consists of two principal value networks, updating with temporal difference error using an optimizer. Each value network has corresponding target networks, indicated by similar neural network diagrams, which perform soft updates. The diagram illustrates data flow and network interactions.

Figure 5. TD3 algorithm principle.

In the specific training process, the TD3 algorithm uses a replay buffer to store trajectory samples from agent-environment interactions. Random sampling is applied to break data correlations. The Critic networks are updated by minimizing a loss function, while the Actor network improves the policy by maximizing the Q-values from the Critic, guided by the deterministic policy gradient. Target network parameters are softly updated to track the main networks slowly, enhancing training stability.

These designs enable the TD3 algorithm to exhibit better convergence performance and stability than DDPG in continuous control tasks. Especially in high-dimensional action spaces and environments with sparse rewards, its ability to balance exploration and exploitation through multiple mechanisms is significantly demonstrated, making it an important technical choice for DRL in practical scenarios such as robot control and industrial optimization.

3.3 Training process

As mentioned earlier, in reinforcement learning, an agent generates actions based on its own policy using the state information of the environment. During the interaction with the environment, it receives rewards and updates the weights in its neural network according to these rewards. Therefore, the design of state variables, action variables, reward functions, and neural networks is crucial to the entire agent. Before introducing the design of the aforementioned variables, a brief description of the training process is given as follows: the simulation time for each episode is 1.5 s. Circulating current suppression is activated at 0.3 s. The active power steps from −2 per-unit (p.u.) to 2 p.u. at 0.6 s and subsequently steps back to −2 p.u. at 1 s.

3.3.1 State

The state variables, serving as the sole source of information for the agent to perceive the environmental dynamics, directly influence the decision-making quality and convergence efficiency of the policy network. Therefore, the state variables selected in this paper are as following Equation 10:

S=t,Ms,Pref,Perr,icirA,THD(10)

where: t is the simulation running time; Ms is a Boolean value indicating the operating condition, identifying the current system phase (Ms = 1 after an active power step change, Ms = 0 otherwise). This design explicitly informs the agent of the current system phase, enabling it to adapt its control strategy accordingly. Pref and Perr are the reference value and error of the active power, respectively; icirA is the circulating current in phase A; THD is the Total Harmonic Distortion rate of the phase A arm current. To eliminate the influence of differing dimensions among the state variables, normalization is applied, ensuring all state components reside within a comparable magnitude range.

It is worth mentioning that the inclusion of simulation time t in the state vector serves to inform the agent about the specific phase of the training episode. The training episodes have a fixed structure with events at 0.3 s, 0.6 s, and 1.0 s. This allows the agent to anticipate scheduled events, such as the activation of the controller or power steps, and adopt a more proactive control strategy and aids convergence in the defined training scenario.

3.3.2 Action

Since the agent’s objective is to tune the parameters of the quasi-PR controller, the action space is defined as a two-dimensional vector containing the adjustments for { kP, kR }. According to reference (He et al., 2015), a larger resonant gain kR enhances the suppression of the second-harmonic circulating current. However, as kR increases, the closed-loop poles move closer to the imaginary axis, causing the system’s stability margin to diminish. Consequently, kR cannot be excessively large. Integrating the agent’s exploration requirements, the allowable ranges for the action space parameters are defined as Equation 11:

0 < kP < 12000 < kR < 1200(11)

3.3.3 Reward

In the reinforcement learning framework, the reward function serves as the core bridge connecting the agent’s decision-making and control objectives. Its rationality directly determines whether the agent can learn an optimization strategy that meets actual needs. For the problem of MMC circulating current suppression, an effective reward function needs to accurately reflect the system’s comprehensive requirements for active power control accuracy, circulating current suppression effect, and power quality. At the same time, it should guide the agent to make adaptive adjustments when the operating conditions change abruptly (such as active power steps). Therefore, the design of the reward function must take into account both multi-objective optimization and the specificity of dynamic operating conditions.

For the intelligent tuning of PR controller parameters in MMC circulating current suppression, this paper designs the reward function expression as following Equations 1215:

reward=0,Rperr + Rcirc + Rthd,t < 0.3st0.3s(12)

Where,

Rperr=400·|Perr,400·1.5·|Perr,t < 0.6s ort  1s0.6s  t < 1s(13)
Rcirc=0,t<0.3s2·icirc3,0.3st<0.6st1s2·2·icirc3,0.6st<1srewardcirc100,icirc3>40(14)
Rthd=0,10·THD2,50,t < 0.4s0.4st andTHD100.4s  t and THD > 10(15)

This reward function consists of three core components: active power error penalty, circulating current amplitude penalty, and total harmonic distortion (THD) penalty, and incorporates a time-segmented mechanism to meet the control demands under various operating conditions. Specifically.

a. The active power error penalty is derived from the absolute value of the per-unitized active power error. During the active power step period (0.6–1s), the penalty is enhanced by a factor of 1.5 to strengthen the requirement for step response speed.

b. The circulating current amplitude penalty is calculated based on the average value of the absolute values of the three-phase circulating currents. It takes effect after the activation of circulating current suppression at 0.3s, and the penalty weight is also increased by a factor of 2 during the step period. Additionally, in the circulation suppression of QPR control, the average amplitude of the circulation is 50 A. Therefore, in order to make the agent have better circulation suppression effect than the traditional QPR control and ensure the safe operation of the system, an additional penalty is imposed on the limit circulation amplitude exceeding 40 A.

c. The THD penalty is included from 0.4s onwards. It penalizes excessive harmonic distortion with 2% as the reference value, and imposes a more severe fixed penalty on extreme cases where THD exceeds 10% to guarantee power quality.

The final reward is obtained by summing up all components. And the overall reward is limited to the range of [−1,000, 100] to avoid the impact of numerical saturation on training stability.

3.3.4 Neural net

The Actor network in the algorithm is shown in Figure 6. It takes time t, condition signal Ms, active power reference value Pref, active power error Perr, circulating current icircA, and total harmonic distortion rate THD as inputs. Through multi-layer fully connected hidden layers containing learnable parameters θπ, it performs non-linear transformation and deep fusion on the input state information to capture complex feature relationships. Then, through the output layer and scaling layer, the processed features are mapped and adapted to the control quantities a1, a2(t) executable by the converter, realizing the conversion from environmental states to control action strategies. The Critic network is divided into two paths: state quantity and action quantity. The network structure under each path is similar to the Actor network, as shown in Figure 7.

Figure 6
Diagram of a neural network with three layers: an input layer with six nodes labeled \( t \), \( M_s \), \( P_{ref} \), \( P_{err} \), \( i_{cirA} \), and \( THD \); a hidden layer with multiple interconnected nodes; and an output layer featuring a node connected to a scaling layer represented by \( a_1 \cdot a_2(t) \).

Figure 6. Proposed actor net.

Figure 7
Diagram of a neural network with input, hidden, and output layers. The input layer includes variables such as \( t \), \( M_s \), \( P_{ref} \), \( P_{err} \), \( i_{cirA} \), and \( THD \). Hidden layers contain nodes labeled with \( \theta \) values. The output layer has a single node. Connections between layers are shown with lines.

Figure 7. Proposed critic net.

The selection of the agent’s training hyperparameters, listed in Table 1, is crucial for stable and efficient learning. The values were determined through an iterative tuning process, starting with commonly used values from DRL literature for continuous control tasks (Ye, et al., 2024b). Minor adjustments were then made based on the observed training stability and reward convergence from preliminary simulation runs. For instance, the learning rate of 0.0005 was found to provide a good balance between convergence speed and stability, avoiding large, unstable policy updates.

Table 1
www.frontiersin.org

Table 1. Agent training parameters.

4 Training and simulation results

4.1 TD3 algorithm training results

The simulation step size is set to 2e-5s, and the total simulation duration per episode is 1.5s. In one episode, the agent selects the initial state, and the action generated by the policy function is appended with Gaussian random noise of a certain standard deviation. After the agent takes the action, the reward is calculated according to Equation 12, and the rewards at each moment in the episode are accumulated to obtain the cumulative reward of the episode. Through simulations of multiple episodes, the obtained cumulative rewards are shown in Figure 8.

Figure 8
Line graph displaying the average reward, episode reward, and evaluation statistic over a series of episodes. Average reward shows a rapid decline around episode 20, then stabilizes. Asterisks mark evaluation statistics consistently plotted along the line.

Figure 8. Agent training process.

The entire process, spanning over 140 episodes, took approximately 3–4 h to converge to a stable and high-performing policy. The duration is primarily influenced by the high fidelity of the MMC simulation model and the number of interaction steps required for the agent to learn effectively.

As can be seen from Figure 8, the reward of the agent fluctuates continuously during the 0–60 episodes. This is because the agent is exploring appropriate parameters, and different parameters have different adjustment effects on the system, thus leading to reward fluctuations. After 60 episodes, the reward tends to converge, indicating that the agent has learned an effective parameter optimization strategy for the system.

Figure 9 shows the control parameters finally converged by the agent, from which it can be observed that the agent continuously adjusts kP and kR according to the operating conditions. This demonstrates a learned, logical relationship between system state and controller parameters, even if it's not a simple linear one.

Figure 9
Two line graphs are displayed. The top graph, labeled

Figure 9. kP, kR changing process.

4.2 Simulation results

This paper evaluates the feasibility of using an agent trained by the TD3 algorithm to adjust quasi-PR controller parameters. A simulation model of a three-phase MMC was developed on the Matlab/Simulink platform for this purpose. The system parameters are listed in Table 2.

Table 2
www.frontiersin.org

Table 2. Parameters for simulation.

The waveforms for phase A in the absence of a circulating current controller are depicted in Figure 10. Distortion of the arm current caused by high-order harmonics within the circulating current can be observed in Figure 10a, resulting in a non-sinusoidal waveform. The circulating current after removal of the DC component is shown in Figure 10b. It is evident that without suppression, this circulating current exhibits significant fluctuations, severely compromising MMC operational efficiency. Figure 10c presents the Fast Fourier Transform (FFT) analysis of the arm current, revealing a Total Harmonic Distortion (THD) of 26.31%. This verifies the presence of even-order harmonic components beyond the DC offset, with the second-harmonic component exhibiting particularly large amplitude fluctuations.

Figure 10
Graphs showing electrical data in three panels. Panel (a) is a blue waveform of current against time from 0.44 to 0.54 seconds with peaks around 2000 and -1000 amperes. Panel (b) is a red waveform of current against time from 0.405 to 0.45 seconds with peaks around 300 and -400 amperes. Panel (c) is a bar chart showing harmonic order amplitudes, with a major peak for DC and another for the fiftieth harmonic.

Figure 10. Pre-circulating current suppression waveform. (a) A-phase bridge arm current (b) A-phase circulating current (c) FFT analysis of A-phase bridge arm current.

Figure 11 depicts the results of circulating current suppression using a quasi-proportional-resonant (QPR) controller with fixed parameters. After the introduction of the QPR controller, the waveform of the leg current is improved to a certain extent. As can be seen from Figure 11b in conjunction with Figure 11c, the average value of the circulating current amplitude decreases to 20 A after 0.4 s, with a positive peak value of 30.2 A and a negative peak value of 36.8 A. FFT results indicate that the second-harmonic component is significantly suppressed, with a THD of 3.09%. Although the QPR controller provides some suppression of the circulating current, its control response time is relatively long, requiring at least 0.4 s to achieve the optimal suppression effect.

Figure 11
Graph a) depicts an oscillating current waveform over time in seconds, ranging from minus one thousand to two thousand amperes. Graph b) shows a current oscillating with decreasing amplitude over time, highlighting detailed oscillations in an inset. Graph c) is a bar chart illustrating harmonics by order, with the highest amplitude at the fiftieth harmonic order and noticeable DC component.

Figure 11. Fixed parameter QPR control. (a) A-phase bridge arm current (b) A-phase circulating current (c) FFT analysis of A-phase bridge arm current.

Figure 12 presents the results of adaptive parameter tuning for the QPR controller using the agent trained with the TD3 algorithm. It can be observed that after the control is activated, the waveform of the leg current is also significantly improved, and the circulating current amplitude can be rapidly suppressed within 0.001 s. The average amplitude of the suppressed circulating current is comparable to that achieved by the QPR controller, with a positive peak value of 30.6 A and a negative peak value of 37.1 A. FFT analysis reveals that the second-harmonic component in the circulating current is significantly suppressed, with a measured THD of 3.01%, exhibiting a certain improvement in dynamic response performance compared to the QPR controller with fixed parameters.

Figure 12
Three graphs display different representations of electrical current data. Graph (a) shows a sinusoidal waveform of current over time in blue, peaking at 2000 A. Graph (b) displays a red waveform with an inset highlighting fluctuations, peaking at 400 A. Graph (c) is a bar chart showing harmonics, with the dominant amplitude at harmonic order fifty around 1400, and a smaller DC component.

Figure 12. Adaptive QPR control based on TD3 algorithm. (a) A-phase bridge arm current (b) A-phase circulating current (c) FFT analysis of A-phase bridge arm current.

To further analyze the dynamic characteristics of the agent-based control, a step change is applied to the per-unit value of the AC-side output power: at 0.6 s, the active power steps from −2 to 2, and then steps from 2 to −2 at 1 s. The variations of the phase-A leg current and circulating current amplitude are observed, with the results presented in Figures 13, 14. It can be seen that both control strategies enable the phase-A leg current to recover to a steady state within 0.1 s when the load suddenly increases or decreases, and the difference in circulating current amplitude at steady state is negligible.

Figure 13
Two line graphs show current over time. Graph (a) displays current peaking at 400 amperes at 0.6 seconds with oscillations around zero. Graph (b) shows a peak of 4000 amperes at 0.6 seconds with a larger oscillation range from 3000 to -3000 amperes. Both graphs use a red line to represent data trends.

Figure 13. Waveforms of MMC when power changes——QPR. (a) A-phase circulating current (b) A-phase bridge arm current.

Figure 14
Two graphs are displayed. Graph a) shows a signal with high-frequency oscillations and two noticeable spikes, around 0.6 and 1 on the x-axis, with amplitude ranging from -100 to 100. Graph b) depicts a waveform with periodic oscillations, featuring a prominent peak at 0.6 and a shift in baseline around 0.7, with values between -3000 and 5000.

Figure 14. Waveforms of MMC when power changes——QPR + DRL. (a) A-phase circulating current (b) A-phase bridge arm current.

Comparing Figures 13a, 14a, when the load increases, the circulating current step caused by the adaptive QPR control is 112 A, whereas that caused by the QPR controller with fixed parameters is 376 A; when the load decreases, the circulating current step of the adaptive control is −104 A, while that of the fixed-parameter control is −198 A. Comprehensive comparison of the leg current and circulating current indicates that, compared with the fixed-parameter QPR control, the adaptive QPR control exhibits smaller overshoot and shorter adjustment time when the output power changes, which demonstrates the superiority of the proposed method.

Figures 15, 16 demonstrate the stability of the proposed control at the system level. Figure 15 displays the waveforms of the average capacitor voltage for all submodules in the upper arm of phase A under both fixed-parameter QPR control and adaptive DRL control, plotted within the same coordinate system. When circulating current suppression control is activated at 0.5 s, both control methods reduce the capacitor voltage ripple to some extent. However, the voltage ripple under fixed-parameter control is larger than that under adaptive DRL control, highlighting the effectiveness and robustness of the DRL approach. Figure 16 presents the three-phase bridge currents and three-phase output AC currents of the upper arm in phase A. As shown in Figure 16a, when circulating current suppression is initiated, the second-harmonic component in the bridge current is significantly eliminated, resulting in a more sinusoidal current after 0.5 s. Figure 16b illustrates the three-phase output currents during steady-state operation, demonstrating that while suppressing internal circulating currents, the MMC maintains uncompromised power quality in its external output.

Figure 15
Line graph showing voltage in volts versus time in seconds. Two lines represent avgCV_DRL in blue and avgCV_QPR in red, both oscillating with peaks near 4950 volts and troughs near 4700 volts. The time range is from 0.46 to 0.58 seconds.

Figure 15. Phase-A upper arm average capacitor voltage.

Figure 16
Graph (a) shows voltage waveforms for iarmA, iarmB, and iarmC over time, with three sinusoidal waves in red, blue, and yellow fluctuating between -1500 and 3000 volts. Graph (b) displays similar sinusoidal waveforms labeled IUA, IUB, and IUC, ranging from -4000 to 4000 volts. Both graphs illustrate alternating currents over a time period from 0.47 to 0.55 seconds.

Figure 16. Three-phase current waveforms. (a) Three-phase arm current waveforms of phase A (b) Three-phase output AC current waveforms.

5 Conclusion

The limitations of the conventional quasi-PR controller-based method for MMC circulating current suppression, which include difficulties in parameter tuning and the low adaptability of fixed parameters, are the primary focus of this paper. To solve these problems, a method for adaptive tuning of controller parameters through deep reinforcement learning is proposed. This paper analyzes the feasibility of using the TD3 algorithm in DRL to tune the parameters of the PR controller, trains an agent capable of adaptively changing parameters by designing a reasonable neural network and reward function, and finally compares the advantages and disadvantages of the adaptive strategy with the traditional quasi-PR method through simulation. The results confirm that the proposed adaptive QPR controller more effectively suppresses MMC circulating currents than the traditional quasi-PR controller. While achieving equivalent suppression, it operates with superior dynamic characteristics and a lower THD.

While this study demonstrates the effectiveness of the proposed method in a simulation environment, future work will focus on hardware-in-the-loop testing and physical deployment. The deployment process involves embedding the trained Actor network onto a real-time digital controller, such as a DSP or FPGA. Since the computationally intensive training process is performed offline, the real-time implementation only requires executing a forward pass of the lightweight Actor network each control cycle. The network’s modest size, consisting of a few hidden layers, implies that its memory footprint (for weights and biases) and computational requirements (for matrix multiplications) are well within the capabilities of standard industrial controllers. Future research will validate the real-time performance, including inference latency and resource utilization on a target hardware platform, to confirm its feasibility for practical applications.

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Author contributions

YC: Conceptualization, Funding acquisition, Investigation, Methodology, Supervision, Writing – original draft. XL: Conceptualization, Writing – original draft. YL: Methodology, Writing – review and editing. FD: Software, Visualization, Writing – review and editing. ZG: Validation, Writing – review and editing. TY: Formal Analysis, Project administration, Writing – review and editing.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. This work is supported by the National Major Science and Technology Projects (2024ZD0802600).

Conflict of interest

Authors YC, YL, and ZG were employed by China Southern Power Grid. Authors XL, FD, and TY were employed by Guangzhou Power Supply Bureau of Guangdong Power Grid Co., Ltd.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Aslam, A., and Raza, M. (2025). Design and implementation of active control method for minimizing circulating current in MMC-VSC system. IEEE Access 13, 124471–124482. doi:10.1109/access.2025.3588713

CrossRef Full Text | Google Scholar

Cao, D., Hu, W., Zhao, J., Zhang, G., Zhang, B., Liu, Z., et al. (2020). Reinforcement learning and its applications in modern power and energy systems: a review. J. Mod. Power Syst. Clean Energy 8 (6), 1029–1042. doi:10.35833/mpce.2020.000552

CrossRef Full Text | Google Scholar

Chao, W., Huang, J., Deng, C., and Dai, L. (2023). “Fuzzy adaptive PI circulating current suppressing control for MMC-HVDC,” in 2023 IEEE 6th information technology, networking, electronic and automation control conference (ITNEC); February 24–26, 2023. Chongqing, China, 1163–1167.

Google Scholar

Chen, P., Zhao, J., Liu, K., Zhou, J., Dong, K., Li, Y., et al. (2024). A review on the applications of reinforcement learning control for power electronic converters. IEEE Trans. Industry Appl. 60 (6), 8430–8450. doi:10.1109/tia.2024.3435170

CrossRef Full Text | Google Scholar

Dinkel, D., Hillermeier, C., and Marquardt, R. (2022). Direct multivariable control for modular multilevel converters. IEEE Trans. Power Electron. 37 (7), 7819–7833. doi:10.1109/tpel.2022.3148578

CrossRef Full Text | Google Scholar

Fang, Y., Xu, N., and Liu, Y. (2023). “Hybrid linear predictive control scheme based on PIR and MPC for MMC,” in 2023 IEEE 2nd international power electronics and application symposium (PEAS); November 10–13, 2023. Guangzhou, China, 491–495.

Google Scholar

Farias, J. V. M., Cupertino, A. F., Pereira, H. A., Seleme, S. I., and Teodorescu, R. (2021). On converter fault tolerance in MMC-HVDC systems: a comprehensive survey. IEEE J. Emerg. Sel. Top. Power Electron. 9 (6), 7459–7470. doi:10.1109/jestpe.2020.3032393

CrossRef Full Text | Google Scholar

Gheisarnejad, M., Farsizadeh, H., and Khooban, M. H. (2021). A novel nonlinear deep reinforcement learning controller for DC–DC power buck converters. IEEE Trans. Industrial Electron. 68 (8), 6849–6858. doi:10.1109/tie.2020.3005071

CrossRef Full Text | Google Scholar

He, L., Zhang, K., Xiong, J., and Fan, S. (2015). A repetitive control scheme for harmonic suppression of circulating current in modular multilevel converters. IEEE Trans. Power Electron. 30 (1), 471–481. doi:10.1109/tpel.2014.2304978

CrossRef Full Text | Google Scholar

Jiang, H., Chen, Y., and Kang, Y. (2021). “Application of neural network controller and policy gradient reinforcement learning on modular multilevel converter (MMC) - a proof of concept,” in 2021 IEEE 4th international electrical and energy conference (CIEEC), May 20–30, 2021. Wuhan, China, 1–6.

Google Scholar

Kumar, K. P., and Detroja, K. P. (2022). “Parameterized adaptive controller design using reinforcement learning and deep neural networks,” in 2022 eighth indian control conference (ICC), December 14–16, 2022. IEEE, 121–126.

Google Scholar

Li, F., and Zhu, C. (2025). Research on MMC circulating current suppression based on feedforward compensation PR control. Power Syst. Automation 47 (4), 62–65.

Google Scholar

Li, C., Zhang, Y., Zhang, X., and Liu, Z. (2019). “Circulating current suppression for MMC with hybrid particle swarm optimization,” in 2019 Chinese control conference (CCC), July 27–30, 2019. Guangzhou, China, 7316–7321.

Google Scholar

Lu, P., Huang, W., and Xiao, J. (2021). “Speed tracking of brushless DC motor based on deep reinforcement learning and PID,” in 2021 7th international conference on condition monitoring of machinery in non-stationary operations (CMMNO), June 11–13, 2021. Guangzhou, China, 130–134.

Google Scholar

Luo, Y., Yao, J., Huang, S., and Liu, K. (2023). “Small signal stability analysis of MMC-HVDC grid-connected system and optimization control of zero-sequence circulating current controller,” in 2023 IEEE international conference on power science and technology (ICPST), May 5–7, 2023. IEEE, 414–419.

CrossRef Full Text | Google Scholar

Nougain, V., Mishra, S., Misyris, G. S., and Chatzivasileiadis, S. (2021). Multiterminal DC fault identification for MMC-HVDC systems based on modal analysis—A localized protection scheme. IEEE J. Emerg. Sel. Top. Power Electron. 9 (6), 6650–6661. doi:10.1109/jestpe.2021.3068800

CrossRef Full Text | Google Scholar

Park, J., Kim, H., Hwang, K., and Lim, S. (2022). Deep reinforcement learning based dynamic proportional-integral (PI) gain auto-tuning method for a robot driver system. IEEE Access 10, 31043–31057. doi:10.1109/access.2022.3159785

CrossRef Full Text | Google Scholar

Sánchez-Sánchez, E., Groß, D., Prieto-Araujo, E., Dörfler, F., and Gomis-Bellmunt, O. (2020). Optimal multivariable MMC energy-based control for DC voltage regulation in HVDC applications. IEEE Trans. Power Deliv. 35 (2), 999–1009. doi:10.1109/tpwrd.2019.2933771

CrossRef Full Text | Google Scholar

Shi, X., Chen, N., Wei, T., Wu, J., and Xiao, P. (2021). “A reinforcement learning-based online-training AI controller for DC-DC switching converters,” in 2021 6th international conference on integrated circuits and microsystems (ICICM), October 22–24, 2021. Nanjing, China (Piscataway, NJ: IEEE), 435–438.

CrossRef Full Text | Google Scholar

Steckler, P.-B., Gauthier, J.-Y., Lin-Shi, X., and Wallart, F. (2022). Differential flatness-based, full-order nonlinear control of a modular multilevel converter (MMC). IEEE Trans. Control Syst. Technol. 30 (2), 547–557. doi:10.1109/tcst.2021.3067887

CrossRef Full Text | Google Scholar

Vipin, V. N., and Mohan, N. (2025). Sensitivity analysis of the high-frequency-link MMC to DC link voltage ripples in a back-to-back connected MMC-based power electronic transformer. IEEE Trans. Power Electron. 40 (6), 8691–8708. doi:10.1109/tpel.2025.3538605

CrossRef Full Text | Google Scholar

Wang, Y., Wang, J., Tong, L., and Ye, Q. (2018). Research on MMC circulation control strategy based on adaptive quasi-PR controller. Adv. Technol. Electr. Eng. Energy 37 (12), 24–31.

Google Scholar

Ye, J., Guo, H., Wang, B., and Zhang, X. (2024a). Deep deterministic policy gradient algorithm based reinforcement learning controller for single-inductor multiple-output DC–DC converter. IEEE Trans. Power Electron. 39 (4), 4078–4090. doi:10.1109/tpel.2024.3350181

CrossRef Full Text | Google Scholar

Ye, J., Guo, H., Zhao, D., Wang, B., and Zhang, X. (2024b). TD3 algorithm based reinforcement learning control for multiple-input multiple-output DC–DC converters. IEEE Trans. Power Electron. 39 (10), 12729–12742. doi:10.1109/tpel.2024.3416911

CrossRef Full Text | Google Scholar

Zhang, W., Li, J., Zhang, M., Yang, X., and Zhong, D. (2025). Research on circulating-current suppression strategy of MMC based on passivity-based integral sliding mode control for multiphase wind power grid-connected systems. Electronics 14 (13), 2722. doi:10.3390/electronics14132722

CrossRef Full Text | Google Scholar

Zuo, J., Xie, X., and Luo, Y. (2020). “Suppression strategy of circulating current in MMC-HVDC based on Quasi-PR controller,” in 2020 IEEE 5th information technology and mechatronics engineering conference (ITOEC), June 12–14, 2020. Chongqing, China (Piscataway, NJ: IEEE), 149–155.

CrossRef Full Text | Google Scholar

Keywords: modular multilevel converter (MMC), quasi-PR controller, circulating current suppression, deep reinforcement learning (DRL), optimized control

Citation: Chen Y, Luo X, Lu Y, Duan F, Guo Z and Yan T (2025) An adaptive quasi-PR controller for modular multilevel converters based on deep reinforcement learning. Front. Energy Res. 13:1716873. doi: 10.3389/fenrg.2025.1716873

Received: 01 October 2025; Accepted: 27 October 2025;
Published: 27 November 2025.

Edited by:

Xuewei Pan, Harbin Institute of Technology, Shenzhen, China

Reviewed by:

Shunfeng Yang, Southwest Jiaotong University, China
Ameen Ullah, Shenzhen University, China
Benfei Wang, Sun Yat-sen University, China

Copyright © 2025 Chen, Luo, Lu, Duan, Guo and Yan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Yukun Chen, Y2hlbnlrMUBjc2cuY24=

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.