Deep Reinforcement Learning Algorithms for Multiple Arc-Welding Robots

The applications of the deep reinforcement learning method to achieve the arcs welding by multi-robot systems are presented, where the states and the actions of each robot are continuous and obstacles are considered in the welding environment. In order to adapt to the time-varying welding task and local information available to each robot in the welding environment, the so-called multi-agent deep deterministic policy gradient (MADDPG) algorithm is designed with a new set of rewards. Based on the idea of the distributed execution and centralized training, the proposed MADDPG algorithm is distributed. Simulation results demonstrate the effectiveness of the proposed method.


INTRODUCTION
Welding control is an important technology in industrial manufacturing due to the fact that its performance can determine the product quality (Shan et al., 2017). With the development of information technology, coordinated welding control, using multiple arc-welding robots to achieve a complex welding task, has increasingly received attention. Some details can be found in Hvilshøj et al., 2012;Feng et al., 2020;Zhao and Wu, 2020). The key of coordinated welding control is to optimize collaborative welding path without collision.
The classical method in this line of research is trajectory planning. In Cao et al. (2006), an artificial potential field algorithm is presented. However, such method just achieves the local optimization. In Enayattabar et al. (2019), a greedy method called the Dijkstra algorithm is designed only for the graphs with positively weighted edges. The so-called A * algorithm is proposed in Song et al. (2019). However, the A * algorithm will be exponential with spatial growth. Therefore, there is a trend to use intelligent algorithms to solve the welding control problem. The details can be found in bioinspired neural network (Luo and Yang, 2008), the genetic algorithm (Hu and Yang, 2004), the colony algorithm (Karaboga and Akay, 2009), and the particle swarm optimization (Kennedy and Eberhart, 1995). Due to the limitation of a large number of calculations and slow convergence speed for these basic intelligent algorithms in increasingly complex tasks, many improved methods building upon the above method have been proposed. In Luo et al. (2019), an improved bioinspired neural network is designed to reduce the time cost and the mathematical complexity in the case of trajectory planning. In Nazarahari et al. (2019), an enhanced genetic algorithm to improve the initial paths in continuous space and find the optimal path between start and destination locations is given. In Pu et al. (2020), an improved ant colony optimization algorithm integrated to the pseudo-random state transition strategy is designed in the three-dimensional space. In Mohammed et al. (2020), an enhanced particle swarm optimization algorithm to find a safer path is presented. In Chen et al. (2017) and Chen et al. (2019), a coordinated path following control law is designed without any optimization. It is noted that the above methods rely on the accurate mathematical models, and thus it is difficult to be applied in the dynamic environments and the complex scenarios.
Recently, reinforcement learning methods stand out in various competitions, for example, Go game (Silver et al., 2016) and StarCraft (Vinyals et al., 2019;Sutton and Barto, 2018). Such methods using the reward values and the information of the environment to update an intelligent algorithm give lights in coordinated trajectory planning. In Tang et al. (2019), the idea of a multi-agent reinforcement learning method is introduced in the case of trajectory planning. With knowledge of the whole environmental information, a rule-based shallow trial reinforcement learning algorithm is given. In Qie et al. (2019), a reinforcement learning method for the continuous state and action space is given based on the Actor-Critic (AC) framework. In Lowe et al. (2017), a MADDPG algorithm is designed based on the structure of the distributed execution and the centralized training. As we all know, the reinforcement learning method has not been used in the coordinated welding control problem. This paper deals with the coordinated welding control problem of multi-robot systems. To achieve the time-varying welding task, the optimization of robot trajectory, and collision avoidance, a MADDPG algorithm with a new set of rewards is designed based on local information available to each robot in the welding environment. This is the first result of the application of using the deep reinforcement learning method in the coordinated welding control problem.
The remainder of the paper is structured as follows: Section 2 presents the problem formulation of coordinated welding. Section 3 provides the MADDPG with a new set of rewards. Section 4 gives the validation of the algorithm by simulations. Conclusions are given in Section 5.

PROBLEM FORMULATION
Since two or three mechanical arms are generally used in the actual ship welding, let us consider that n ≥ 2 welding robots denoted by r 1 , . . . , r n and m ≥ n welding arcs in the twodimensional (2D) space, as shown in Figure 1A. Each robot is a kinematic point with the second-order dynamical system given by where , 1] are its position, velocity, and control input, respectively.
The objectives of this paper are to optimally accomplish all the welding arcs without any collision. In this paper, the following assumptions are required: 1) a welding arc can be welded by a robot with a constant speed; 2) once a welding arc is accomplished, it can not be welded again; 3) the states of welding arcs are accessible to all robots but only the neighbors' states and obstacle status with local measurements are known to each robot; 4) without loss of generality, the shapes of all robots and all obstacles are round and the shapes of all welding arcs are straight lines.

MADDPG ALGORITHM
In this section, the main designing process will be given by referring to MADDPG (Lowe et al., 2017). The environment consists of agents model, action space, and state space, where the model of each agent moving the 2D environment, and the states {p i , v i } and the actions u i are defined in the previous section.
In our algorithm, there are actor network, critic network, and two target networks used for each robot. Specifically, there are two Muti-Layer Perception (MLP) layers of 256 and 128 neurons with Rectified Linear Unit (RELU) activation and softmax action selection on the output layer in the actor network. In the critic network for each robot, there are three MLP layers of 256, 128, and 64 neurons with RELU activation and softmax action selection on the output layer. The structures of two target networks are the same as the actor network and the critic network, respectively, but their update times are not synchronized for satisfying the independence and the distribution of the sampled data. In the execution, the actor network outputs the action of each robot for the exploration based on the states obtained by itself. Then, the environment outputs the rewards and the states at the next moment according to the actions. In the training, the critic network evaluates the action chosen by each actor network to improve the performance of the actor network by constructing a loss function. The in-batch data of tuples is sampled uniformly from the replay buffer D composed of the states and the actions of all robots at the current moment, and the reward and the state of all robots at the next moment, which is the input of each critic network. Episodes are used for learning such that it is terminated when all welds are executed or the number of steps reaches the maximum.
The total reward r total k 1 r iw (t) + k 2 r id (t) + k 3 r ic (t) consists of three terms, where k 1 , k 2 , and k 3 are the weight coefficients. Each term is listed as follows. In Equation 1, the welding-based reward r iw (t) is set by This forces the robots to find the welding arcs that are not welded. In Equation 2, the distance-based reward r id (t) is set by Here, d(i, j, t) represents the distance from the robot i to the starting point of the welding arc j at time t and σ is a small positive value for avoiding the invalid distance. sh(j, t) is equal to 1 when the welding arc j is welded at time t and otherwise sh(j, t) 0. sa(i, t) is equal to 1 when the robot i is welding and otherwise sa(i, t) 0. The distance-based rewards are used to yield each free robot to find the nearest unsoldered welds, which is used to achieve the trajectory optimization of each robot. In Equation 3, the collision-avoidance-based reward r ic (t) is given by Here, D i and D m denote the safe radius of the robot i and the obstacle O k , respectively. p O k denotes the position of the center of obstacle O k . It is a punishment/reward design for collision avoidance. Let P t represent the random noise which is simple Gaussian distribution with N(0, 1). x(t) {o 1 (t), . . . , o N (t)} denotes the states of all the robots from observation, where o i (t) is the observation of the robot i. μ θi denotes N continuous policies with respect to target network parameters θ i . a i denotes the action of the robot i. Q μ i (x, a 1 , . . . , a N ) and y j represent the action-value function and the actual action-value of the sample j by the target critic network. S, c, j, k, τ denote the random mini-batch size of samples, the discount factor, the index of samples, the index of action, and the update speed of the target network, respectively.
From the above sets, the pseudocode of MADDPG for the multiple arc-welding robots is given in Algorithm 1.
Algorithm 1Coordinated welding algorithm 1: for episode 1 to Max-episode do 2: Initialize a random process P for action exploration. 3: Receive the initial states x(0). 4: for t 1 to Max-step do 5: For robot i select action a i μ θ i (o i ) + P t . 6: Execute the actions a i and calculate the rewards r i as Equations 2-5 and acquire the new state x ′ 7: by interacting with the environment based on Equation 1. 8: Store (x, a, r, x ′ ) in replay buffer D. 9: x←x ′ 10: for agent i 1 to N do 11: Sample a random minibatch of S samples (x j , a j , r j , x ′ j ) from D. 12: Set y j r . 13: Update critic by minimizing the loss by L(θ i ) . 14: Update actor using the sampled policy gradient:

SIMULATION RESULTS AND ANALYSIS
The simulation environment is under Pytorch, which includes three welding robots and four welding arcs. The radiuses of the robots are 0.01 m, and the radiuses of the obstacles are 0.08 m. The hyperparameters of the neural network in training are set as follows. The size of the replay buffer D is set to 100, 000. The learning rate of Adam Optimizer is e −3 . The discount factor c is 0.95. The episodes before training starts are 30 and the parameter τ is e −2 . Four sets of weight coefficients are selected as k 1 k 2 k 3 1; k 1 5, k 2 1, k 3 1; k 1 1, k 2 5, k 3 1; and k 1 1, k 2 1, k 3 5 for experiments. In the simulation, 10, 000 training episodes are given to show the performance of the three robots, where each episode consists of 200 step iterations, and the pictures of reward we obtain are all taking an average every five episodes. A Savitzky-Golay filter has been used in Figure 1 to smooth the data and mitigate this problem. Figure 1B shows that the cumulative rewards in different weight coefficients increase gradually and finally reach some stable values, which implies that a good policy is learned in each case. From Figure 1B, one can also obviously see that the selection of the coefficients does not significantly affect the learning results; in other words, the differences of the parameters do not change the convergence speed too much. Figures 2A,B present the trajectories of the robots in two situations with the different positions of obstacles. From Figure 2A, it is shown that the robots 1, 2, and 3 first figure Frontiers in Control Engineering | www.frontiersin.org February 2021 | Volume 2 | Article 632417 out the nearest welding arcs 1, 2, and 3 and then robot 1 continuously accomplishes the welding arc 4 after finishing arc 1. Similar precedence is shown in Figure 2B when the positions of obstacles are changed. From the above figures, we conclude that all the trajectories for the robots are almost shortest and there is no collision between the robots and obstacles.

CONCLUSION AND FUTURE WORK
A MADDPG algorithm with a new set of rewards is designed for the coordinated welding of multiple arc-welding robots. The proposed MADDPG algorithm is distributed, and only local information is available to each arc-welding robot. In the ongoing work, we will devote ourselves to the coordinated welding control problem in the three-dimensional space and the situation that one welding arc is operated by multiple robots.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.