<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="research-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Robot. AI</journal-id>
<journal-title>Frontiers in Robotics and AI</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Robot. AI</abbrev-journal-title>
<issn pub-type="epub">2296-9144</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">702845</article-id>
<article-id pub-id-type="doi">10.3389/frobt.2021.702845</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Robotics and AI</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Reinforcement Learning and Control of a Lower Extremity Exoskeleton for Squat Assistance</article-title>
<alt-title alt-title-type="left-running-head">Luo et&#x20;al.</alt-title>
<alt-title alt-title-type="right-running-head">Reinforcement Learning Control of Exoskeleton</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Luo</surname>
<given-names>Shuzhen</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1298219/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Androwis</surname>
<given-names>Ghaith</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/375621/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Adamovich</surname>
<given-names>Sergei</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/143498/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Su</surname>
<given-names>Hao</given-names>
</name>
<xref ref-type="aff" rid="aff3">
<sup>3</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Nunez</surname>
<given-names>Erick</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Zhou</surname>
<given-names>Xianlian</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1241854/overview"/>
</contrib>
</contrib-group>
<aff id="aff1">
<label>
<sup>1</sup>
</label>Department of Biomedical Engineering, New Jersey Institute of Technology, <addr-line>Newark</addr-line>, <addr-line>NJ</addr-line>, <country>United&#x20;States</country>
</aff>
<aff id="aff2">
<label>
<sup>2</sup>
</label>Kessler Foundation, West Orange, <addr-line>Newark</addr-line>, <addr-line>NJ</addr-line>, <country>United&#x20;States</country>
</aff>
<aff id="aff3">
<label>
<sup>3</sup>
</label>Department of Mechanical Engineering, The City University of New York, <addr-line>Newark</addr-line>, <addr-line>NY</addr-line>, <country>United&#x20;States</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/191261/overview">Nitin Sharma</ext-link>, North Carolina State University, United&#x20;States</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/274505/overview">Yue Chen</ext-link>, University of Arkansas, United&#x20;States</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1028007/overview">Xiao Xiao</ext-link>, Southern University of Science and Technology, China</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Xianlian Zhou, <email>alexzhou@njit.edu</email>
</corresp>
<fn fn-type="other">
<p>This article was submitted to Biomedical Robotics, a section of the journal Frontiers in Robotics and&#x20;AI</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>19</day>
<month>07</month>
<year>2021</year>
</pub-date>
<pub-date pub-type="collection">
<year>2021</year>
</pub-date>
<volume>8</volume>
<elocation-id>702845</elocation-id>
<history>
<date date-type="received">
<day>30</day>
<month>04</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>29</day>
<month>06</month>
<year>2021</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2021 Luo, Androwis, Adamovich, Su, Nunez and Zhou.</copyright-statement>
<copyright-year>2021</copyright-year>
<copyright-holder>Luo, Androwis, Adamovich, Su, Nunez and Zhou</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these&#x20;terms.</p>
</license>
</permissions>
<abstract>
<p>A significant challenge for the control of a robotic lower extremity rehabilitation exoskeleton is to ensure stability and robustness during programmed tasks or motions, which is crucial for the safety of the mobility-impaired user. Due to various levels of the user&#x2019;s disability, the human-exoskeleton interaction forces and external perturbations are unpredictable and could vary substantially and cause conventional motion controllers to behave unreliably or the robot to fall down. In this work, we propose a new, reinforcement learning-based, motion controller for a lower extremity rehabilitation exoskeleton, aiming to perform collaborative squatting exercises with efficiency, stability, and strong robustness. Unlike most existing rehabilitation exoskeletons, our exoskeleton has ankle actuation on both sagittal and front planes and is equipped with multiple foot force sensors to estimate center of pressure (CoP), an important indicator of system balance. This proposed motion controller takes advantage of the CoP information by incorporating it in the state input of the control policy network and adding it to the reward during the learning to maintain a well balanced system state during motions. In addition, we use dynamics randomization and adversary force perturbations including large human interaction forces during the training to further improve control robustness. To evaluate the effectiveness of the learning controller, we conduct numerical experiments with different settings to demonstrate its remarkable ability on controlling the exoskeleton to repetitively perform well balanced and robust squatting motions under strong perturbations and realistic human interaction forces.</p>
</abstract>
<kwd-group>
<kwd>lower extremity rehabilitation exoskeleton</kwd>
<kwd>reinforcement learning</kwd>
<kwd>center of pressure</kwd>
<kwd>balanced squatting control</kwd>
<kwd>human-exoskeleton interaction</kwd>
</kwd-group>
<contract-sponsor id="cn001">National Institute on Disability, Independent Living, and Rehabilitation Research<named-content content-type="fundref-id">10.13039/100009157</named-content>
</contract-sponsor>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction</title>
<p>Due to the aging population and other factors, an increasing number of people are suffering from neurological disorders, such as stroke, central nervous system disorder, and spinal cord injury (SCI) that affect the patient&#x2019;s mobility. Emerging from the field of robotics, robotic exoskeletons have become a promising solution to enable mobility impaired people to perform the activities of daily living (ADLs) (<xref ref-type="bibr" rid="B24">Mergner and Lippi, 2018</xref>; <xref ref-type="bibr" rid="B42">Vouga et&#x20;al., 2017</xref>; <xref ref-type="bibr" rid="B49">Zhang et&#x20;al., 2018</xref>). Lower-limb rehabilitation exoskeletons are wearable bionic devices that are equipped with powerful actuators to assist people to regain their lower leg function and mobility. With a built-in multi-sensor system, an exoskeleton can recognise the wearer&#x2019;s motion intentions and assist the wearer&#x2019;s motion accordingly (<xref ref-type="bibr" rid="B7">Chen et&#x20;al., 2016</xref>). Compared to traditional physical therapy, rehabilitation exoskeleton robots have the advantages of providing more intensive patient repetitive training, better quantitative feedback, and improved life quality for patients (<xref ref-type="bibr" rid="B7">Chen et&#x20;al., 2016</xref>).</p>
<p>A degradation or loss of balance as a result of neuromusculoskeletal disorders or impairments is a common symptom, for instance, in patients with SCI or stroke. Balance training in the presence of external perturbations (<xref ref-type="bibr" rid="B14">Horak et&#x20;al., 1997</xref>) is considered as one of the more important factors in evaluating patients&#x2019; rehabilitation performance. A rehabilitation exoskeleton can be employed for balance training to achieve static stability (quiet standing) or dynamic stability (squatting, sit-to-stand, and walking) (<xref ref-type="bibr" rid="B5">Bayon et&#x20;al., 2020</xref>; <xref ref-type="bibr" rid="B25">Mungai and Grizzle, 2020</xref>; <xref ref-type="bibr" rid="B32">Rajasekaran et&#x20;al., 2015</xref>). Squatting exercises are very common for resistance-training programs because their multiple-joint movements are a characteristic of most sports and daily living activities. In rehabilitation, squatting is commonly performed as an important exercise for patients during the recovery of various lower extremity injuries (<xref ref-type="bibr" rid="B23">McGinty et&#x20;al., 2000</xref>; <xref ref-type="bibr" rid="B34">Salem and Powers, 2001</xref>; <xref ref-type="bibr" rid="B9">Crossley et&#x20;al., 2011</xref>; <xref ref-type="bibr" rid="B45">Yu et&#x20;al., 2019</xref>). Squatting, which is symmetric by nature, can help coordinate bilateral muscle activities and strengthen weaker muscles on one side (e.g., among hemiplegia patients) or both sides. Compared to walking, squatting is often perceived to be safer for patients who are unable to perform these activities independently. In addition, the range of motion and the joint torques required for squatting are often greater than walking (<xref ref-type="bibr" rid="B46">Yu et&#x20;al., 2019</xref>). With a reliable lower extremity rehabilitation exoskeleton, performing squatting exercises without external help (e.g., from a clinician) will be a confidence boost for patients to use the exoskeleton independently. However, in order for the exoskeletons to cooperate with the human without causing risks of harm, advanced balance controllers to robustly perform squatting motion that can deal with a broad range of environment conditions and external perturbations need to be developed.</p>
<p>Most existing lower extremity rehabilitation exoskeletons require the human operator to use canes for additional support or having a clinician or helper to provide balance assistance to avoid falling down. They often offer assistance via pre-planned trajectories of gait and provide limited control to perform diverse human motions. Some well known exoskeletons include the ReWalk (ReWalk Robotics), Ekso (Ekso bionics), Indego (Parker Hannifin), TWIICE (<xref ref-type="bibr" rid="B42">Vouga et&#x20;al., 2017</xref>) and VariLeg (<xref ref-type="bibr" rid="B36">Schrade et&#x20;al., 2018</xref>). When holding the crutches, the patient&#x2019;s interactions with the environment is also limited (<xref ref-type="bibr" rid="B4">Baud et&#x20;al., 2019</xref>). One example is that the patient is unlikely to perform squatting with long canes or crutches. Very few exoskeletons are able to assist human motions such as walking without the need of crutches or helpers with the exception of a few well known ones: the Rex (Rex Bionics) (<xref ref-type="bibr" rid="B6">Bionics, 2020</xref>) and the Atalante (Wandercraft) (<xref ref-type="bibr" rid="B43">Wandercraft, 2020</xref>). These exoskeletons free the user&#x2019;s hands, but come at the cost of an very low walking speed and an increased overall weight (38&#xa0;kg for the Rex, 60&#xa0;kg for the Atalante) and are very expensive (<xref ref-type="bibr" rid="B42">Vouga et&#x20;al., 2017</xref>). In this paper, we introduce a relatively light weight lower extremity exoskeleton that includes a sufficient number of degrees of freedom (DoF) with strong actuation. On each side, this skeleton system has a one DoF hip flexion/extension joint, a one DoF knee flexion/extension joint, and a 2-DoF ankle joint, which can perform ankle dorsi/plantar flexion as well as inversion/eversion that can swing the center of mass laterally in the frontal plane. Moreover, four force sensors are equipped on each foot for accurate measurement of ground reaction forces (GRFs) to estimate the center of pressure (CoP) so as to build automatic balance control without external crutches assistance.</p>
<p>Designing a robust balance control policy for a lower extremity exoskeleton is particularly important and represents a crucial challenge due to the balance requirement and safety concerns for the user (<xref ref-type="bibr" rid="B7">Chen et&#x20;al., 2016</xref>; <xref ref-type="bibr" rid="B19">Kumar et&#x20;al., 2020</xref>). First, the control policy needs to run in real-time with limited sensing and capabilities dictated by the exoskeletons. Second, due to various levels of patients&#x2019; disability, the human-exoskeleton interaction forces are unpredictable and could vary substantially and cause conventional motion controllers to behave unreliably or the robot to fall down. Virtual testing of a controller with realistic human interactions in simulations is very challenging and the risk of testing on real humans is even greater. To the best of our knowledge, investigations presenting robust controllers against large and uncertain perturbation forces (e.g., due to human interactions) have rarely been carried out as biped balance control without perturbation itself is a challenging task. Most existing balance controller designs for such lower extremity rehabilitation exoskeletons focused mostly on the trajectory tracking method, conventional control like Proportional&#x2013;Integral&#x2013;Derivative (PID) (<xref ref-type="bibr" rid="B44">Xiong, 2014</xref>), model-based predictive control (<xref ref-type="bibr" rid="B38">Shi et&#x20;al., 2019</xref>), fuzzy control (<xref ref-type="bibr" rid="B2">Ayas and Altas, 2017</xref>), impedance control (<xref ref-type="bibr" rid="B15">Hu et&#x20;al., 2012</xref>; <xref ref-type="bibr" rid="B17">Karunakaran et&#x20;al., 2020</xref>), and momentum-based control for standing (<xref ref-type="bibr" rid="B5">Bayon et&#x20;al., 2020</xref>). Although the trajectory tracking approaches can be easily applied to regular motions, its robustness against unexpected large perturbations is not great. On the other hand, model-based predictive control could be ineffective or even unstable due to inaccurate dynamics modeling, and it typically requires a laborious task-specific parameters tuning. The momentum-based control strategies have also been applied to impose standing balancing on the exoskeleton (<xref ref-type="bibr" rid="B5">Bayon et&#x20;al., 2020</xref>; <xref ref-type="bibr" rid="B11">Emmens et&#x20;al., 2018</xref>), which was first applied in humanoid robotics to impose standing and walking balance (<xref ref-type="bibr" rid="B21">Lee and Goswami, 2012</xref>; <xref ref-type="bibr" rid="B18">Koolen et&#x20;al., 2016</xref>). This method aimed to simultaneously regulate both the linear and angular component of the whole body momentum for balance maintenance with desired GRF and CoP at each support foot. The movement of system CoP is an important indicator of system balance (<xref ref-type="bibr" rid="B21">Lee and Goswami, 2012</xref>). When the CoP leaves or is about to leave the support base, a possible toppling of a foot or loss of balance is imminent and the control goal is to bring the CoP back inside the support base to keep balance and stability. Although the CoP information can sometimes be estimated from robot dynamics (<xref ref-type="bibr" rid="B21">Lee and Goswami, 2012</xref>), the reliability of such estimation highly depends on the accuracy of the robot model and sensing of joint states. When a human user is involved, it is almost impossible to estimate the CoP accurately due to the difficulty to estimate the user&#x2019;s dynamic properties or real-time joint motions. Therefore, it is highly desired to obtain the foot CoP information directly and accurately. In most existing lower extremity rehabilitation exoskeletons, the mechanical structures of the foot are either relatively simple with no force or pressure sensors or with sensors but no capability to process the GRF and CoP information for real-time fall detection or balance control. In this work, a lower extremity rehabilitation exoskeleton with force sensors equipped on each foot for accurate estimation of CoP is presented. Inspired by the CoP-associated balance and stability (<xref ref-type="bibr" rid="B21">Lee and Goswami, 2012</xref>), this paper aims to explore a robust motion controller to encourage the system CoP to stay inside a stable region when subjected to the uncertainty of human interaction and perturbations.</p>
<p>Recently, model-free control methods, like reinforcement learning (RL), promise to overcome the limitations of prior model-based approaches that require an accurate dynamic model. It has gained considerable attention in multi-legged robots control for their capability to produce controllers that can perform a wide range of complicated tasks (<xref ref-type="bibr" rid="B29">Peng et&#x20;al., 2016</xref>; <xref ref-type="bibr" rid="B30">Peng and van de Panne, 2017</xref>; <xref ref-type="bibr" rid="B28">Peng et&#x20;al., 2018</xref>; <xref ref-type="bibr" rid="B16">Hwangbo et&#x20;al., 2019</xref>; <xref ref-type="bibr" rid="B31">Peng et&#x20;al., 2020</xref>). The RL-based balance control approach for lower extremity rehabilitation exoskeletons to perform squatting motion have not been investigated before, especially when balancing with a human strapped inside is considered. Since the coupling between the human and exoskeleton could leads to unexpected perturbation forces, it is highly desired to develop a robust controller to learn collaborative human-robot squatting skills. In this paper, we propose a novel robust control framework based on RL to train a robust control policy that operates on the exoskeleton in real-time so as to overcome the external perturbations and unpredictable varying human-exoskeleton&#x20;force.</p>
<p>The central contributions of this work are summarized in the following:<list list-type="simple">
<list-item>
<p>&#x2022; We build a novel RL-based motion control framework for a lower extremity rehabilitation exoskeleton to imitate realistic human squatting motion under random adversary perturbations or large uncertain human-exoskeleton interaction forces.</p>
</list-item>
<list-item>
<p>&#x2022; We take advantage of the foot CoP information by incorporating it into the state input of the control policy as well as the reward function to produce a balance controller that is robust against various perturbations.</p>
</list-item>
<list-item>
<p>&#x2022; We demonstrate that the lightweight exoskeleton can carry a human to perform robust and well-balanced squatting motions in a virtual environment with an integrated exoskeleton and full-body human musculoskeletal&#x20;model.</p>
</list-item>
</list>
</p>
<p>To demonstrate the effectiveness and robustness of the proposed control framework, a set of numerical experiments under external random perturbations and varying human-exoskeleton interactions are conducted. Dynamics randomization is incorporated into the training to minimize the effects of model inaccuracy and prepare for sim-to-real transfer.</p>
</sec>
<sec id="s2">
<title>2 Exoskeleton and Interaction Modeling</title>
<sec id="s2-1">
<title>2.1 Mechanical Design of a Lower Extremity Robotic Exoskeleton</title>
<p>A lower extremity robotic exoskeleton device (<xref ref-type="bibr" rid="B1">Androwis et&#x20;al., 2017</xref>) is currently under development by the authors to assist patients with ADL, such as balance, ambulation and gait rehabilitation. In <xref ref-type="fig" rid="F1">Figure&#x20;1A</xref>, the physical prototype of this exoskeleton is shown. The total mass of the exoskeleton is <inline-formula id="inf1">
<mml:math id="m1">
<mml:mrow>
<mml:mn>20.4</mml:mn>
<mml:mi>k</mml:mi>
<mml:mi>g</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> and the frame of the exoskeleton has been manufactured with Onyx (Markforged&#x2019;s nylon with chopped fiber) reinforced by continuous carbon fiber between layers, using Markforged&#x2019;s Mark Two printer (Markforged, INC., MA). The exoskeleton has 14 independent DoFs (including six global DoFs for the pelvis root joint): one DoF for the hip flexion/extension joint, one DoF for the knee flexion/extension joint, and two DoFs for the ankle dorsiflexion/plantarflexion and inversion/eversion joint on each side of the body. The 6-DoF pelvis (root) joint is a free joint and unactuated, and the rest eight DoFs on both sides are actuated. Unlike most commercial rehabilitation exoskeletons that have either passive or fixed ankles, the ankle of our system includes powered 2-DoF to assist with dorsiflexion/plantarflexion and inversion/eversion (<xref ref-type="bibr" rid="B26">Nunez et&#x20;al., 2017</xref>). All joints of the robotic exoskeleton are powered by smart actuators (Dynamixel Pro Motor H54-200-S500-R).</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>The lower extremity exoskeleton and integrated human-exoskeleton model. <bold>(A)</bold> The multibody dynamics model of the exoskeleton with joint frames (red/green/blue axes) rendered in place. <bold>(B)</bold> The physical prototype of the exoskeleton. <bold>(C)</bold> The foot model with force sensors locations and the two independent ankle DoFs indicated. The joint axes of the model are displayed as well. <bold>(D)</bold> Integrated human and lower extremity exoskeleton model. <bold>(E)</bold> Spring (yellow) connections between the human and the strap on the robot. <bold>(F)</bold> Side view.</p>
</caption>
<graphic xlink:href="frobt-08-702845-g001.tif"/>
</fig>
<p>Both hip and knee joints are driven by bevel gears for compact design. The ankle is actuated by two parallel motors that are attached to the posterior side the shank support and operate simultaneously in a closed-loop to flex or abduct the ankle. <xref ref-type="fig" rid="F1">Figure&#x20;1C</xref> shows the foot model of our exoskeleton system. At the bottom of each foot plate, four 2000N 3-axis force transducers (OptoForce Kft, Hungary) are installed to measure GRFs. These measured forces can be used to determine the CoP in real-time. The CoP is the point on the ground where the tipping moment acting on the foot equals zero (<xref ref-type="bibr" rid="B35">Sardain and Bessonnet, 2004</xref>), with the tipping moment defined as the moment component that is tangential to the foot support surface. Let <inline-formula id="inf2">
<mml:math id="m2">
<mml:mrow>
<mml:msub>
<mml:mi>O</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> denote the location of the <inline-formula id="inf3">
<mml:math id="m3">
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> force sensor and <inline-formula id="inf4">
<mml:math id="m4">
<mml:mrow>
<mml:msub>
<mml:mi>F</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> be its measured force, <inline-formula id="inf5">
<mml:math id="m5">
<mml:mi>n</mml:mi>
</mml:math>
</inline-formula> denote the unit normal vector of the foot support surface, and <inline-formula id="inf6">
<mml:math id="m6">
<mml:mi>C</mml:mi>
</mml:math>
</inline-formula> denote the CoP point where the tipping moment vanishes:<disp-formula id="e1">
<mml:math id="m7">
<mml:mrow>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo>&#x2211;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mn>4</mml:mn>
</mml:munderover>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>O</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mi>C</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:msub>
<mml:mi>F</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
<mml:mo>&#xd7;</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:math>
<label>(1)</label>
</disp-formula>from which <inline-formula id="inf7">
<mml:math id="m8">
<mml:mi>C</mml:mi>
</mml:math>
</inline-formula> can be computed.</p>
</sec>
<sec id="s2-2">
<title>2.2 Exoskeleton Modeling</title>
<p>A multibody dynamics model of the exoskeleton, shown in <xref ref-type="fig" rid="F1">Figure&#x20;1A</xref>, is created with mass and inertia properties estimated from each part&#x2019;s 3D geometry and density or measured mass. Simple revolute joints are used for the hips and knees. Each ankle has two independent rotational DoFs but their rotation axes are located at different positions (<xref ref-type="fig" rid="F1">Figure&#x20;1C</xref>). These two DoFs are physically driven by the closed-loop of two ankle motors together with linkage of universal joints and screw joints. When extending or flexing the ankle, both motors work in sync to move the two long screws up or down simultaneously. When adducting or abducting the ankle, the motors move the screws in opposite directions. The motors can generate up to 160<inline-formula id="inf8">
<mml:math id="m9">
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> dorsi/plantar flexion torque (<xref ref-type="bibr" rid="B26">Nunez et&#x20;al., 2017</xref>). In this study, we do not directly control these ankle motors. Instead, we assume the two ankle DoFs can be independently controlled with sufficient torques from these motors.</p>
</sec>
<sec id="s2-3">
<title>2.3 Modeling of Human Exoskeleton Interactions</title>
<sec id="s2-3-1">
<title>2.3.1 Human Musculoskeletal Model</title>
<p>To simulate realistic human exoskeleton interactions, the exoskeleton is integrated with a full body human musculoskeletal model (<xref ref-type="bibr" rid="B22">Lee et&#x20;al., 2019</xref>) that has a total mass of <inline-formula id="inf9">
<mml:math id="m10">
<mml:mrow>
<mml:mn>61</mml:mn>
<mml:mi>k</mml:mi>
<mml:mi>g</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> and includes 284 musculotendon units. The integrated human and lower extremity exoskeleton model is shown in <xref ref-type="fig" rid="F1">Figure&#x20;1D</xref>. The muscle geometry is designed as a polyline that starts at the origin of the muscle, passes through a sequence of way points, and ends at the insertion (<xref ref-type="bibr" rid="B22">Lee et&#x20;al., 2019</xref>). The muscle dynamics is simulated using a Hill-type model (<xref ref-type="bibr" rid="B40">Thelen, 2003</xref>; <xref ref-type="bibr" rid="B10">Delp et&#x20;al., 2007</xref>), which includes the muscle force-velocity and force-length relations as well as the muscle activation as follows:<disp-formula id="e2">
<mml:math id="m11">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mo>&#x22c5;</mml:mo>
<mml:msub>
<mml:mi>F</mml:mi>
<mml:mi>L</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x22c5;</mml:mo>
<mml:msub>
<mml:mi>F</mml:mi>
<mml:mi>V</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>l</mml:mi>
<mml:mo>&#x2d9;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
<mml:mo>&#xd7;</mml:mo>
<mml:msub>
<mml:mi>F</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
<label>(2)</label>
</disp-formula>where <inline-formula id="inf10">
<mml:math id="m12">
<mml:mrow>
<mml:msub>
<mml:mi>F</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the maximum isometric muscle force, <inline-formula id="inf11">
<mml:math id="m13">
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mn>0,1</mml:mn>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is the muscle activation, <italic>l</italic> is the normalized muscle length. <inline-formula id="inf12">
<mml:math id="m14">
<mml:mrow>
<mml:msub>
<mml:mi>F</mml:mi>
<mml:mi>L</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf13">
<mml:math id="m15">
<mml:mrow>
<mml:msub>
<mml:mi>F</mml:mi>
<mml:mi>V</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> are normalized force-length and force-velocity functions, respectively. The normalized passive force-length relationship of muscle <inline-formula id="inf14">
<mml:math id="m16">
<mml:mrow>
<mml:msub>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is represented by an exponential function:<disp-formula id="e3">
<mml:math id="m17">
<mml:mrow>
<mml:msub>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msup>
<mml:mi>e</mml:mi>
<mml:mrow>
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mi>P</mml:mi>
</mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msup>
<mml:mi>l</mml:mi>
<mml:mi>M</mml:mi>
</mml:msup>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>/</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mi>&#x3b5;</mml:mi>
<mml:mn>0</mml:mn>
<mml:mi>M</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mi>e</mml:mi>
<mml:mrow>
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mi>P</mml:mi>
</mml:msup>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
<label>(3)</label>
</disp-formula>where <inline-formula id="inf16">
<mml:math id="m19">
<mml:mrow>
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mi>P</mml:mi>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> is an exponential shape factor and set equal to four, and <inline-formula id="inf17">
<mml:math id="m20">
<mml:mrow>
<mml:msubsup>
<mml:mi>&#x3b5;</mml:mi>
<mml:mn>0</mml:mn>
<mml:mi>M</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> is a fixed passive muscle&#x20;strain (at which point <inline-formula id="inf18">
<mml:math id="m21">
<mml:mrow>
<mml:msub>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>) and set to 0.6. When the muscle is fully passive (zero activation or without active muscle contraction), the muscle can only develop passive force <inline-formula id="inf19">
<mml:math id="m22">
<mml:mrow>
<mml:msub>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
</mml:msub>
<mml:mo>&#xd7;</mml:mo>
<mml:msub>
<mml:mi>F</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> because of its background elasticity due to stretch. In this paper, we do not consider the active muscle contraction of the human operator, considering the operator could be a patient suffering from spinal cord injury or stroke, with very limited or no control of his or her own body. Nonetheless, the passive muscle forces <inline-formula id="inf20">
<mml:math id="m23">
<mml:mrow>
<mml:msub>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
</mml:msub>
<mml:mo>&#xd7;</mml:mo>
<mml:msub>
<mml:mi>F</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> produced from all musculotendon units during movement are incorporated in the training to influence the human dynamics.</p>
</sec>
<sec id="s2-3-2">
<title>2.3.2&#x20;Human-Exoskeleton Interactions</title>
<p>To integrate the human musculoskeletal model with the exoskeleton, the root of the human model (pelvis) is attached to the exoskeleton hip through a prismatic joint that only allows relative movement along the vertical (up and down) direction (all rotations and the translations along the lateral and fore-and-aft directions are fixed). The assembled exoskeleton, as shown in <xref ref-type="fig" rid="F1">Figure&#x20;1D</xref>, has straps around the hip, femur and tibia to constraint the human motion. Here we utilize the spring models to simulate the strap forces. <xref ref-type="fig" rid="F1">Figure&#x20;1E</xref> shows the spring connections between the human and exoskeleton at the strap locations with the yellow zigzag lines illustrating the springs. From <xref ref-type="fig" rid="F1">Figure&#x20;1E</xref>, an elastic spring with stiffness <inline-formula id="inf21">
<mml:math id="m24">
<mml:mrow>
<mml:msub>
<mml:mi>k</mml:mi>
<mml:mi>h</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mn>10000</mml:mn>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mo>/</mml:mo>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is utilized to generate a vertical force and simulate the interaction between the human pelvis and the robot waist structure. At each femur strap location, we use four springs (arranged in 90&#xb0; apart) with stiffness <inline-formula id="inf22">
<mml:math id="m25">
<mml:mrow>
<mml:msub>
<mml:mi>k</mml:mi>
<mml:mi>f</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mn>2000</mml:mn>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mo>/</mml:mo>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> to simulate the connections. Similar setup with stiffness <inline-formula id="inf23">
<mml:math id="m26">
<mml:mrow>
<mml:msub>
<mml:mi>k</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mn>2000</mml:mn>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mo>/</mml:mo>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> are used for the tibia strap connection. We choose these spring stiffness values based on empirical testing. These selected stiffness values produce a tight connection such that it can generate a reasonable (small) relative movement between the human and exoskeleton. During movement, the spring models can generate varying interaction forces between the human and exoskeleton robot during motion, which exert forces on both human and exoskeleton. We assume there is no relative motion between the human foot and the exoskeleton foot due to tight coupling and model that as a welding constraint. For simplicity, the joints at the upper limb of the human are fixed as weld joints.</p>
</sec>
</sec>
</sec>
<sec id="s3">
<title>3 Learning Controller for Balanced Motion</title>
<p>In this section, we propose a robust motion control training and testing framework based on RL that enables the exoskeleton to learn squatting skill with strong balance and robustness. <xref ref-type="fig" rid="F2">Figure&#x20;2</xref> shows the overall learning process of the motion controller. The details of the motion controller learning process will be introduced in the following sections.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>The overall learning process of the integrated robust motion controller. We construct the learning controller as a Multi-Layer Perception (MLP) neural network that consists of three hidden layers. The neural network parameters are updated using the policy gradient method. The network produces joint target positions, which are translated into torque-level commands by PD (Proportional Derivative) control.</p>
</caption>
<graphic xlink:href="frobt-08-702845-g002.tif"/>
</fig>
<sec id="s3-1">
<title>3.1 Reinforcement Learning With Motion Imitation</title>
<p>The controller (or control policy) is learned through a continuous RL process. We design the control policy through a neural network with parameters <italic>&#x3b8;</italic>, denoting the weights and bias in the neural network. The control policy can be expressed as <inline-formula id="inf24">
<mml:math id="m27">
<mml:mrow>
<mml:msub>
<mml:mi>&#x3c0;</mml:mi>
<mml:mi>&#x3b8;</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mo>&#x7c;</mml:mo>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> and the agent (neural network model) learns to update the parameters <italic>&#x3b8;</italic> to achieve the maximum reward. The learning controller (i.e.,&#x20;control policy network) is implemented as a Multi-Layer Perception (MLP) network that consists of three fully connected layers and ReLU as the activation function, as illustrated in <xref ref-type="fig" rid="F2">Figure&#x20;2</xref>, the sizes of three layers are set to 256, 256 and 128, respectively. At every time step <italic>t</italic>, the agent observes the state <inline-formula id="inf25">
<mml:math id="m28">
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> from the environment, and selects an action <inline-formula id="inf26">
<mml:math id="m29">
<mml:mrow>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> according to its control policy <inline-formula id="inf27">
<mml:math id="m30">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>&#x7c;</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>. The control policy network <inline-formula id="inf28">
<mml:math id="m31">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mo>&#x7c;</mml:mo>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is in the form of the probability distribution of actions in a given state. This action distribution is modeled as a Gaussian, with a state dependent mean <inline-formula id="inf29">
<mml:math id="m32">
<mml:mrow>
<mml:mi>&#x3bc;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> defined by the network, and a standard deviation (STD) <inline-formula id="inf30">
<mml:math id="m33">
<mml:mtext>&#x3a3;</mml:mtext>
</mml:math>
</inline-formula> that is learned as the neural network parameters:<disp-formula id="e4">
<mml:math id="m34">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mo>&#x7c;</mml:mo>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mi mathvariant="script">N</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>&#x3bc;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mtext>&#x3a3;</mml:mtext>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(4)</label>
</disp-formula>The action <inline-formula id="inf31">
<mml:math id="m35">
<mml:mrow>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is sampled from this distribution, and then converted to control commands that drive the exoskeleton in the environment, which results a new state <inline-formula id="inf32">
<mml:math id="m36">
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and a scalar reward <inline-formula id="inf33">
<mml:math id="m37">
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> immediately. The objective is to learn a control policy that maximizes the discounted sum of reward:<disp-formula id="e5">
<mml:math id="m38">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
<mml:mo>&#x3d;</mml:mo>
<mml:mtext>arg</mml:mtext>
<mml:munder>
<mml:mrow>
<mml:mtext>max</mml:mtext>
</mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:munder>
<mml:msub>
<mml:mi mathvariant="double-struck">E</mml:mi>
<mml:mrow>
<mml:mi>&#x3c4;</mml:mi>
<mml:mo>&#x223c;</mml:mo>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>&#x3c4;</mml:mi>
<mml:mo>&#x7c;</mml:mo>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munderover>
<mml:mrow>
<mml:msup>
<mml:mi>&#x3b3;</mml:mi>
<mml:mi>t</mml:mi>
</mml:msup>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(5)</label>
</disp-formula>where <inline-formula id="inf34">
<mml:math id="m39">
<mml:mrow>
<mml:mi>&#x3b3;</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>0,1</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is the discount factor, <italic>&#x3c4;</italic> is the trajectory <inline-formula id="inf35">
<mml:math id="m40">
<mml:mrow>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mo>&#x22ef;</mml:mo>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf36">
<mml:math id="m41">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>&#x3c4;</mml:mi>
<mml:mo>&#x7c;</mml:mo>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> denotes the likelihood of a trajectory <italic>&#x3c4;</italic> under a given control policy <italic>&#x3c0;</italic>, and <italic>T</italic> is the horizon of an episode.</p>
<p>The control policy is task- or motion-specific and starts with motion imitation. Within the learning process (<xref ref-type="fig" rid="F2">Figure&#x20;2</xref>), the input of the control policy network is defined by <inline-formula id="inf37">
<mml:math id="m42">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>:</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>:</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>:</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>3</mml:mn>
<mml:mo>:</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>p</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>:</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>6</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, in which <italic>p</italic> and <italic>v</italic> are joint positions and velocities of the exoskeleton, and <inline-formula id="inf38">
<mml:math id="m43">
<mml:mrow>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>3</mml:mn>
<mml:mo>:</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> represents the action history of three previous steps. To learn a particular skill, we utilize the corresponding target joint pose from the task motion at six future time-steps <inline-formula id="inf39">
<mml:math id="m44">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>p</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>:</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>6</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> as the motion prior for feasible control strategies. Considering the importance of CoP as an indicator of system balance and its ready availability from our exoskeleton design, we incorporate the CoP position history <inline-formula id="inf40">
<mml:math id="m45">
<mml:mrow>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>:</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> into the state as a feedback to the controller. As summarized in <xref ref-type="fig" rid="F2">Figure&#x20;2</xref>, the learning controller is given as input the combination of the state history, the action history and the future target motions and outputs the joint target positions as the actions. The use of task motion data alleviates the need to design task-specific reward functions and thereby facilitates a general framework to learn a diverse array of behaviors.</p>
</sec>
<sec id="s3-2">
<title>3.2 Learning With Proximal Policy Optimization</title>
<p>To train the control policy network, we use a model-free RL algorithm known as Proximal Policy Optimization (PPO). An effective solution to many RL problems is the family of policy gradient algorithms, in which the gradient of the expected return with respect to the policy parameters is computed and used to update the parameters <italic>&#x3b8;</italic> through gradient ascent. PPO is a model-free policy gradient algorithm that samples data through interaction with the environment and optimizes a &#x201c;surrogate&#x201d; objective function (<xref ref-type="bibr" rid="B37">Schulman et&#x20;al., 2017</xref>). It utilizes a trust region constraint to force the control policy update to ensure that the new policy is not too far away from the old policy. The probability ratio <inline-formula id="inf41">
<mml:math id="m46">
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>&#x3b8;</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is defined by:<disp-formula id="e6">
<mml:math id="m47">
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>&#x3b8;</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mi>&#x3c0;</mml:mi>
<mml:mi>&#x3b8;</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>&#x7c;</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>&#x3c0;</mml:mi>
<mml:mi>&#x3b8;</mml:mi>
</mml:msub>
<mml:mi>o</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>&#x7c;</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
<label>(6)</label>
</disp-formula>This probability ratio is a measure of how different the current policy is from the old policy <inline-formula id="inf42">
<mml:math id="m48">
<mml:mrow>
<mml:msub>
<mml:mi>&#x3c0;</mml:mi>
<mml:mi>&#x3b8;</mml:mi>
</mml:msub>
<mml:mi>o</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> (the policy before the last update). A large value of this ratio means that there is a large change in the updated policy compared to the old one. PPO also introduces a modified objective function that adopts clipped probability ratios which forms a pessimistic estimate of the policy&#x2019;s performance and avoids a reduction in performance during the training process. The &#x201c;surrogate&#x201d; objective function is described by considering the clipped objective:<disp-formula id="e7">
<mml:math id="m49">
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>&#x3b8;</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mi mathvariant="double-struck">E</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>&#x3b8;</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>A</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>&#x3b8;</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:mi mathvariant="italic">&#x3f5;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>&#x2b;</mml:mo>
<mml:mi mathvariant="italic">&#x3f5;</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>A</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(7)</label>
</disp-formula>where <italic>&#x3f5;</italic> is a small positive constant which constrain the probability ratio <inline-formula id="inf43">
<mml:math id="m50">
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>&#x3b8;</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>. <inline-formula id="inf44">
<mml:math id="m51">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>A</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> denotes the advantage value at time step <italic>t</italic>. <inline-formula id="inf45">
<mml:math id="m52">
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is the clipping function. Clipping the probability ratio discourages the policy from changing too much and taking the minimum results in using the lower, pessimistic bound of the unclipped objective. Thus any change in the probability ratio is included when it makes the objective worse, and otherwise is ignored. This can prevent the policy from changing too quickly and leads to more stable learning. The control policy can be updated by maximizing the clipped discounted total reward in <xref ref-type="disp-formula" rid="e7">Eq. 7</xref> with a gradient ascent.</p>
</sec>
<sec id="s3-3">
<title>3.3 Proportional-Derivative-Based Torque</title>
<p>It has been reported that the performance of RL for continuous control depends on the choice of action types (<xref ref-type="bibr" rid="B30">Peng and van de Panne, 2017</xref>). It works better when the control policy outputs PD position targets rather than joint torque directly. In this study, we also let the learning controller output the joint target positions as actions (shown in <xref ref-type="fig" rid="F2">Figure&#x20;2</xref>). To obtain smooth motions, actions from the control policy network are first processed by a second low-pass filter before being applied to the robot. Our learning process allows the control policy network and the environment to operate at different frequencies since the environment often requires a small time step for integration. During each time integration step, we apply preprocessed actions that are linear interpolated from two consecutive filtered actions. Then the preprocessed actions are specified as PD targets and the final PD-based torques applied to each joint are calculated as<disp-formula id="e8">
<mml:math id="m53">
<mml:mrow>
<mml:mi>&#x3c4;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>k</mml:mi>
<mml:mi>p</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>k</mml:mi>
<mml:mi>v</mml:mi>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>p</mml:mi>
<mml:mo>&#x2d9;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>&#x2212;</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>&#x3c4;</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>&#x3c4;</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(8)</label>
</disp-formula>where <inline-formula id="inf46">
<mml:math id="m54">
<mml:mrow>
<mml:msub>
<mml:mi>k</mml:mi>
<mml:mi>p</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf47">
<mml:math id="m55">
<mml:mrow>
<mml:msub>
<mml:mi>k</mml:mi>
<mml:mi>v</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> are proportional gain and differential gain, respectively. The function <inline-formula id="inf48">
<mml:math id="m56">
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> returns the upper bound <inline-formula id="inf49">
<mml:math id="m57">
<mml:mrow>
<mml:mover accent="true">
<mml:mi>&#x3c4;</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula> or the lower bound<inline-formula id="inf50">
<mml:math id="m58">
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>&#x3c4;</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> if the torque <italic>&#x3c4;</italic> exceeds the&#x20;limit.</p>
</sec>
<sec id="s3-4">
<title>3.4 Reward Function</title>
<p>We design the reward function to encourage the control policy to imitate a target joint motion <inline-formula id="inf51">
<mml:math id="m59">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>p</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> of the exoskeleton while maintaining balance with robustness. The reward function consists of pose reward <inline-formula id="inf52">
<mml:math id="m60">
<mml:mrow>
<mml:msubsup>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>p</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>, velocity reward <inline-formula id="inf53">
<mml:math id="m61">
<mml:mrow>
<mml:msubsup>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>v</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>, end-effector reward <inline-formula id="inf54">
<mml:math id="m62">
<mml:mrow>
<mml:msubsup>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>, root reward <inline-formula id="inf55">
<mml:math id="m63">
<mml:mrow>
<mml:msubsup>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>, center of mass reward <inline-formula id="inf56">
<mml:math id="m64">
<mml:mrow>
<mml:msubsup>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>, CoP reward <inline-formula id="inf57">
<mml:math id="m65">
<mml:mrow>
<mml:msubsup>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>, and torque reward <inline-formula id="inf58">
<mml:math id="m66">
<mml:mrow>
<mml:msubsup>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>q</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>, which is defined by:<disp-formula id="e9">
<mml:math id="m67">
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:msup>
<mml:mi>w</mml:mi>
<mml:mi>p</mml:mi>
</mml:msup>
<mml:msubsup>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>p</mml:mi>
</mml:msubsup>
<mml:mo>&#x2b;</mml:mo>
<mml:msup>
<mml:mi>w</mml:mi>
<mml:mi>v</mml:mi>
</mml:msup>
<mml:msubsup>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>v</mml:mi>
</mml:msubsup>
<mml:mo>&#x2b;</mml:mo>
<mml:msup>
<mml:mi>w</mml:mi>
<mml:mi>e</mml:mi>
</mml:msup>
<mml:msubsup>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
</mml:msubsup>
<mml:mo>&#x2b;</mml:mo>
<mml:msup>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msup>
<mml:msubsup>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x2b;</mml:mo>
<mml:msup>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msup>
<mml:msubsup>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
<label>(9)</label>
</disp-formula>
<disp-formula id="equ1">
<mml:math id="m68">
<mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:msup>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:msup>
<mml:msubsup>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x2b;</mml:mo>
<mml:msup>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>q</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msup>
<mml:msubsup>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>q</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</disp-formula>where <inline-formula id="inf59">
<mml:math id="m69">
<mml:mrow>
<mml:msup>
<mml:mi>w</mml:mi>
<mml:mi>p</mml:mi>
</mml:msup>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi>w</mml:mi>
<mml:mi>v</mml:mi>
</mml:msup>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi>w</mml:mi>
<mml:mi>e</mml:mi>
</mml:msup>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>q</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> are their respective weights. The pose reward <inline-formula id="inf60">
<mml:math id="m70">
<mml:mrow>
<mml:msubsup>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>p</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> and velocity reward <inline-formula id="inf61">
<mml:math id="m71">
<mml:mrow>
<mml:msubsup>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>v</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> match the current and task (target) motions in terms of the joint positions <inline-formula id="inf62">
<mml:math id="m72">
<mml:mrow>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and velocities <inline-formula id="inf63">
<mml:math id="m73">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>p</mml:mi>
<mml:mo>&#x2d9;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>:<disp-formula id="e10">
<mml:math id="m74">
<mml:mrow>
<mml:msubsup>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>p</mml:mi>
</mml:msubsup>
<mml:mo>&#x3d;</mml:mo>
<mml:mtext>exp</mml:mtext>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>&#x3c3;</mml:mi>
<mml:mi>p</mml:mi>
</mml:msub>
<mml:mstyle displaystyle="true">
<mml:munder>
<mml:mo>&#x2211;</mml:mo>
<mml:mi>j</mml:mi>
</mml:munder>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>&#x2016;</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>p</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msubsup>
<mml:mo>&#x2212;</mml:mo>
<mml:msubsup>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msubsup>
</mml:mrow>
<mml:mo>&#x2016;</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(10)</label>
</disp-formula>
<disp-formula id="e11">
<mml:math id="m75">
<mml:mrow>
<mml:msubsup>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>v</mml:mi>
</mml:msubsup>
<mml:mo>&#x3d;</mml:mo>
<mml:mtext>exp</mml:mtext>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>&#x3c3;</mml:mi>
<mml:mi>v</mml:mi>
</mml:msub>
<mml:mstyle displaystyle="true">
<mml:munder>
<mml:mo>&#x2211;</mml:mo>
<mml:mi>j</mml:mi>
</mml:munder>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>&#x2016;</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mover accent="true">
<mml:mi>p</mml:mi>
<mml:mo>&#x2d9;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo stretchy="true">&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msubsup>
<mml:mo>&#x2212;</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>p</mml:mi>
<mml:mo>&#x2d9;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msubsup>
</mml:mrow>
<mml:mo>&#x2016;</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(11)</label>
</disp-formula>where <italic>j</italic> is the index of joints, <inline-formula id="inf64">
<mml:math id="m76">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>p</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf65">
<mml:math id="m77">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mover accent="true">
<mml:mi>p</mml:mi>
<mml:mo>&#x2d9;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo stretchy="true">&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> are target position and velocity, respectively. The end-effector reward is defined to encourage the robot to track the target positions of selected end-effectors:<disp-formula id="e12">
<mml:math id="m78">
<mml:mrow>
<mml:msubsup>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
</mml:msubsup>
<mml:mo>&#x3d;</mml:mo>
<mml:mtext>exp</mml:mtext>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>&#x3c3;</mml:mi>
<mml:mi>e</mml:mi>
</mml:msub>
<mml:munder>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mi>i</mml:mi>
</mml:munder>
<mml:mo>&#x7c;</mml:mo>
<mml:mo>&#x7c;</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>x</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
</mml:msubsup>
<mml:mo>&#x2212;</mml:mo>
<mml:msubsup>
<mml:mi>x</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
</mml:msubsup>
<mml:mo>&#x7c;</mml:mo>
<mml:msup>
<mml:mo>&#x7c;</mml:mo>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(12)</label>
</disp-formula>where <italic>i</italic> is the index of the end-effector. Let <inline-formula id="inf66">
<mml:math id="m79">
<mml:mrow>
<mml:msubsup>
<mml:mi>x</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> be the position of end-effectors, that include left foot and right foot, relative to the moving coordinate frame attached to the root (waist structure). The end-effectors motions are supposed to match well if the joint angles match well, and vice&#x20;versa.</p>
<p>We also design the root reward function <inline-formula id="inf67">
<mml:math id="m80">
<mml:mrow>
<mml:msubsup>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> to track the task root motion including the root&#x2019;s position <inline-formula id="inf68">
<mml:math id="m81">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>x</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> and rotation <inline-formula id="inf69">
<mml:math id="m82">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>q</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>.<disp-formula id="e13">
<mml:math id="m83">
<mml:mrow>
<mml:msubsup>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x3d;</mml:mo>
<mml:mtext>exp</mml:mtext>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>&#x3c3;</mml:mi>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>&#x2016;</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>x</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x2212;</mml:mo>
<mml:msubsup>
<mml:mi>x</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mo>&#x2016;</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>&#x3c3;</mml:mi>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>&#x2016;</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>q</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x2212;</mml:mo>
<mml:msubsup>
<mml:mi>q</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mo>&#x2016;</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(13)</label>
</disp-formula>The overall center of mass reward <inline-formula id="inf70">
<mml:math id="m84">
<mml:mrow>
<mml:msubsup>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> is also considered in the learning process, enabling the control policy to track the target height during the complete squatting cycle.<disp-formula id="e14">
<mml:math id="m85">
<mml:mrow>
<mml:msubsup>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x3d;</mml:mo>
<mml:mtext>exp</mml:mtext>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>&#x3c3;</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>&#x2016;</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>x</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x2212;</mml:mo>
<mml:msubsup>
<mml:mi>x</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mo>&#x2016;</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(14)</label>
</disp-formula>
</p>
<p>The movement of system CoP is an important indicator of system balance as set forth in the introduction. When the CoP leaves or is about to leave the foot support polygon, a possible toppling of the foot or loss of balance is imminent. During squatting (or walking), there are always one or two feet on the ground and the support polygon on the touching foot persists, and thus the CoP criterion is highly relevant for characterizing the tipping equilibrium of a bipedal robot (<xref ref-type="bibr" rid="B35">Sardain and Bessonnet, 2004</xref>). When the CoP point lies within the foot support polygon, it can be assumed that the biped robot can keep balance during squatting. Considering the importance of CoP, we incorporate the CoP positions in the state input as a feedback from the controller and also add a CoP reward function to encourage the current CoP (<inline-formula id="inf71">
<mml:math id="m86">
<mml:mrow>
<mml:msubsup>
<mml:mi>c</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>) to stay inside a stable region around the center of the foot support. By considering the geometric of the foot in the lower extremity exoskeleton design, the stable region for foot CoP is defined as a rectangle area <italic>S</italic> around the foot geometric center whose width and length are set to <inline-formula id="inf72">
<mml:math id="m87">
<mml:mrow>
<mml:mn>7</mml:mn>
<mml:mi>c</mml:mi>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf73">
<mml:math id="m88">
<mml:mrow>
<mml:mn>11</mml:mn>
<mml:mi>c</mml:mi>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, respectively (narrower in the lateral direction than forward direction). And the CoP reward function is expressed as<disp-formula id="e15">
<mml:math id="m89">
<mml:mrow>
<mml:msubsup>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mtext>exp</mml:mtext>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>&#x3c3;</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>&#x2016;</mml:mo>
<mml:mrow>
<mml:mi>D</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mi>c</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>&#x2016;</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mtext>if&#xa0;</mml:mtext>
<mml:msubsup>
<mml:mi>c</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x2208;</mml:mo>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mtext>if&#xa0;</mml:mtext>
<mml:msubsup>
<mml:mi>c</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x2209;</mml:mo>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(15)</label>
</disp-formula>where <inline-formula id="inf74">
<mml:math id="m90">
<mml:mrow>
<mml:mi>D</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>,</mml:mo>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is the Euclidean distance between CoP and the center of <italic>S</italic>. The goal of this CoP reward is to encourage the controller to predict an action that will improve the balance and robustness of the exoskeleton&#x2019;s motion.</p>
<p>At last, we design the torque reward to reduce energy consumption or improve efficiency and to prevent damaging joint actuators during the deployment.<disp-formula id="e16">
<mml:math id="m91">
<mml:mrow>
<mml:msubsup>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>q</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x3d;</mml:mo>
<mml:mtext>exp</mml:mtext>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>&#x3c3;</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>q</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mstyle displaystyle="true">
<mml:munder>
<mml:mo>&#x2211;</mml:mo>
<mml:mi>i</mml:mi>
</mml:munder>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>&#x2016;</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>&#x3c4;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>&#x2016;</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(16)</label>
</disp-formula>where <italic>i</italic> is the index of joints.</p>
</sec>
<sec id="s3-5">
<title>3.5 Dynamics Randomization</title>
<p>Due to the model discrepancy between the physics simulation and the real-world environment, well-known as reality or sim-to-real gap (<xref ref-type="bibr" rid="B47">Yu et&#x20;al., 2018</xref>), the trained control policy usually performs poorly in the real environment. In order to improve the robustness of the controller against model inaccuracy and bridge the sim-to-real gap, we need to develop a robust control policy capable of handling various environments with different dynamics characteristics. To this end, we adopt dynamics randomization (<xref ref-type="bibr" rid="B33">Sadeghi and Levine, 2016</xref>; <xref ref-type="bibr" rid="B41">Tobin et&#x20;al., 2017</xref>) in our training strategy, in which dynamics parameters of the simulation environment are randomly sampled from an uniform distribution for each episode. The objective in <xref ref-type="disp-formula" rid="e5">Eq. 5</xref> is then modified to maximize the expected reward across a distribution of dynamics characteristics <inline-formula id="inf75">
<mml:math id="m92">
<mml:mrow>
<mml:mi>&#x3c1;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>&#x3bc;</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>:<disp-formula id="e17">
<mml:math id="m93">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
<mml:mo>&#x3d;</mml:mo>
<mml:mtext>arg</mml:mtext>
<mml:munder>
<mml:mrow>
<mml:mtext>max</mml:mtext>
</mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:munder>
<mml:msub>
<mml:mi mathvariant="double-struck">E</mml:mi>
<mml:mrow>
<mml:mi>&#x3bc;</mml:mi>
<mml:mo>&#x223c;</mml:mo>
<mml:mi>&#x3c1;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>&#x3bc;</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi mathvariant="double-struck">E</mml:mi>
<mml:mrow>
<mml:mi>&#x3c4;</mml:mi>
<mml:mo>&#x223c;</mml:mo>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>&#x3c4;</mml:mi>
<mml:mo>&#x7c;</mml:mo>
<mml:mi>&#x3c0;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>&#x3bc;</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:munderover>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munderover>
<mml:msup>
<mml:mi>&#x3b3;</mml:mi>
<mml:mi>t</mml:mi>
</mml:msup>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<label>(17)</label>
</disp-formula>where <italic>&#x3bc;</italic> represents the values of the dynamics parameters that are randomized during training. By training policies to adapt to variability in environment dynamics, the resulting policy will be more robust when transferred to the real&#x20;world.</p>
</sec>
</sec>
<sec sec-type="results|discussion" id="s4">
<title>4 Numerical Experiment Results and Discussion</title>
<p>We design a set of numerical experiments aiming to answer the following questions: 1) Can the learning process generate feasible control policies to control the exoskeleton to perform well-balanced squatting motions? 2) Will the learned control policies be robust enough under large random external perturbation? 3) Will the learned control policies be robust enough to sustain stable motions when subjected to uncertain human-exoskeleton interaction forces from a disabled human operator?</p>
<sec id="s4-1">
<title>4.1 Simulation and Reinforcement Learning Settings</title>
<p>To demonstrate the effectiveness of our RL-based robust controller, we train the lower extremity exoskeleton to imitate a 4s reference squatting motion that is manually created based on human squatting motion. The reference squatting motion can provide guidance for motion mimicking but needs not to be generated precisely. The exoskeleton contains eight joint DoFs actuated with motors, all of which are controlled by the RL controller. On each leg, there are has four actuators, including one actuator for hip flexion/extension, one actuator for knee flexion/extension, two actuators for ankle dorsiflexion/plantarflexion and ankle inversion/eversion, respectively. The open source library DART (<xref ref-type="bibr" rid="B20">Lee et&#x20;al., 2018</xref>) is utilized to simulate the exoskeleton dynamics. The GRFs are computed by a Dantzig LCP (linear complementary problem) solver (<xref ref-type="bibr" rid="B3">Baraff, 1994</xref>). We utilize PyTorch (<xref ref-type="bibr" rid="B27">Paszke et&#x20;al., 2019</xref>) to implement the neural network and the PPO method for the learning process. The networks are initialized by the Xavier uniform method (<xref ref-type="bibr" rid="B13">Glorot and Bengio, 2010</xref>). We use a desktop computer with an Intel<sup>&#xae;</sup> Xeon(R) CPU E5-1660 v3 at 3.00&#xa0;GHz <inline-formula id="inf76">
<mml:math id="m94">
<mml:mo>&#xd7;</mml:mo>
</mml:math>
</inline-formula> 16 to generate samples in parallel during training. Totally about 20 million samples are collected in each simulation. The policy and value networks are updated at a learning rate of <inline-formula id="inf77">
<mml:math id="m95">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>, which is linearly decreased to 0 when 20 million samples are collected. PPO algorithm is robust in that hyperparameter initialization is a bit more forgiving and it can handle a wide variety of RL tasks. We do not deliberately tune hyperparameters and just use the common setting as in the literature (<xref ref-type="bibr" rid="B37">Schulman et&#x20;al., 2017</xref>; <xref ref-type="bibr" rid="B39">Tan et&#x20;al., 2018</xref>). Hyper-parameters settings for training using PPO are shown in <xref ref-type="table" rid="T1">Table&#x20;1</xref>.</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>Hyper-parameters settings for training.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Parameters</th>
<th align="center">Value</th>
<th align="center">Parameters</th>
<th align="center">Value</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Discount factor</td>
<td align="char" char=".">0.99</td>
<td align="left">Epochs</td>
<td align="char" char=".">10</td>
</tr>
<tr>
<td align="left">Policy adam learning rate</td>
<td align="center">
<inline-formula id="inf78">
<mml:math id="m96">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">Clip threshold</td>
<td align="char" char=".">0.2</td>
</tr>
<tr>
<td align="left">Batch size</td>
<td align="char" char=".">128</td>
<td align="left">Memory buffer</td>
<td align="char" char=".">2048</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>To verify the robustness of the trained controller, we test the control policies in out-of-distribution simulated environments, where the dynamic parameters of the exoskeleton are sampled randomly from a larger range of values than those during training. <xref ref-type="table" rid="T2">Table&#x20;2</xref> shows the dynamics parameters details of the exoskeleton and their range during training and testing. Note that the observation latency denotes the observation time delay in the real physical system due to sensor noise and time delay during information transfer. Considering the observation latency improves the reality of the simulations and further increases the difficulty of policy training. The simulation frequency (time step for the environment simulation) and control policy output frequency are set to 900 and 30Hz, respectively. According to the PD torque <xref ref-type="disp-formula" rid="e8">Eq. 8</xref>, the parameters about the proportional gain <inline-formula id="inf79">
<mml:math id="m97">
<mml:mrow>
<mml:msub>
<mml:mi>k</mml:mi>
<mml:mi>p</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and differential gain <inline-formula id="inf80">
<mml:math id="m98">
<mml:mrow>
<mml:msub>
<mml:mi>k</mml:mi>
<mml:mi>v</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> are set to 900 and 40, respectively. The differential gain <inline-formula id="inf81">
<mml:math id="m99">
<mml:mrow>
<mml:msub>
<mml:mi>k</mml:mi>
<mml:mi>v</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is chosen to be sufficiently high to prevent unwanted oscillation on the exoskeleton robot. From our experience, the control performance is robust against the variance of gains to a certain extent. For instance, increasing or decreasing the position gain <inline-formula id="inf82">
<mml:math id="m100">
<mml:mrow>
<mml:msub>
<mml:mi>k</mml:mi>
<mml:mi>p</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> to 1200 or 800 does not noticeably change the control performance. We have tried different sets of rewards weights (in <xref ref-type="disp-formula" rid="e9">Eq. 9</xref>) in a proper range, we found the control performance with different sets of weights are similar, and we choose the best one: <inline-formula id="inf83">
<mml:math id="m101">
<mml:mrow>
<mml:msup>
<mml:mi>w</mml:mi>
<mml:mi>p</mml:mi>
</mml:msup>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.8</mml:mn>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.8</mml:mn>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi>w</mml:mi>
<mml:mi>v</mml:mi>
</mml:msup>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.1</mml:mn>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>e</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.7</mml:mn>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.4</mml:mn>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.7</mml:mn>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>q</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>. The torque limit for each joint is set to 100<inline-formula id="inf84">
<mml:math id="m102">
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> in the simulation.</p>
<table-wrap id="T2" position="float">
<label>TABLE 2</label>
<caption>
<p>Dynamic parameters and their respective range of values used during training and testing. A larger range of values are used during testing to evaluate the generalization ability of control policies in dynamics uncertainties.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Dynamic parameters</th>
<th align="center">Training range</th>
<th align="center">Testing range</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Friction coefficient</td>
<td align="left">[0.9,1.6]&#x2a;default value</td>
<td align="left">[0.7,2.0]&#x2a;default value</td>
</tr>
<tr>
<td align="left">Mass</td>
<td align="left">[0.8,1.2]&#x2a;default value</td>
<td align="left">[0.7,1.5]&#x2a;default value</td>
</tr>
<tr>
<td align="left">Motor strength</td>
<td align="left">[0.8,1.2]&#x2a;default value</td>
<td align="left">[0.7,1.3]&#x2a;default value</td>
</tr>
<tr>
<td align="left">Observation latency</td>
<td align="left">[0,0.04]s</td>
<td align="left">[0,0.06]s</td>
</tr>
<tr>
<td align="left">Inertial</td>
<td align="left">[0.5,1.5]&#x2a;default value</td>
<td align="left">[0.4,1.6]&#x2a;default value</td>
</tr>
<tr>
<td align="left">Center of Mass</td>
<td align="left">[0.9,1.2]&#x2a;default value</td>
<td align="left">[0.8,1.3]&#x2a;default value</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4-2">
<title>4.2 Learned Squatting Skill</title>
<sec id="s4-2-1">
<title>4.2.1 Case 1&#x2013;Feasibility Demonstration</title>
<p>In the first case, a squatting motion controller is learned from the 4s reference squatting motion without considering external perturbation. It is worth noting that, compared with the training, we use the larger-range dynamics randomization of the exoskeleton model to demonstrate the generalization ability of our learned controller (as shown in <xref ref-type="table" rid="T2">Table&#x20;2</xref>). A series of snapshots of the squatting behavior of the lower extremity exoskeleton under the learned control policy are shown in <xref ref-type="fig" rid="F3">Figure&#x20;3A</xref>. The lower extremity exoskeleton can perform the squatting and stand-up cycle with a nearly symmetric motion. We test the learned controller in 200 out-of-distribution simulated environments, where the dynamics parameters are sampled from a larger range of values than those used during training (as shown in <xref ref-type="table" rid="T2">Table&#x20;2</xref>). <xref ref-type="fig" rid="F4">Figure&#x20;4A</xref> displays the statistical results of the hip flexion/extension, knee flexion/extension, ankle dorsiflexion/plantarflexion and ankle inversion/eversion joint angles in the first squatting cycle. Joint torques statistics are depicted in <xref ref-type="fig" rid="F4">Figure&#x20;4B</xref> and the peak torque at the knee joint for the squatting is around <inline-formula id="inf85">
<mml:math id="m103">
<mml:mrow>
<mml:mn>13.5</mml:mn>
<mml:mi>N</mml:mi>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>. From <xref ref-type="fig" rid="F4">Figure&#x20;4</xref>, it is observed that the joint angles produced have very consistent mean values with near vanishing variances (as a result of very small STDs of the action distributions (<xref ref-type="disp-formula" rid="e4">Eq. 4</xref>) learned from the neural network). The joint torques also have good consistence in predicted mean torques with slightly higher variances, especially for the knee and ankle dorsi/plantar flexion joints. Due to the weights of the waist support structure and the battery (around <inline-formula id="inf86">
<mml:math id="m104">
<mml:mrow>
<mml:mn>2.2</mml:mn>
<mml:mi>k</mml:mi>
<mml:mi>g</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>) mounted on the back of it, the hip joint consistently produces a positive torque to prevent the waist structure from rotating downwards due to gravity. The controller is able to achieve low average joint angle tracking error (about <inline-formula id="inf87">
<mml:math id="m105">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mn>1.22</mml:mn>
</mml:mrow>
<mml:mo>&#x2218;</mml:mo>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>) compared with the target squatting motion.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>Snapshots of the squatting motion of the exoskeleton. This RL-based controller enables the lower extremity exoskeleton to perform the squatting skill under different types of external perturbations with various intensities. <bold>(A)</bold> Performing the squatting skill without perturbation forces. <bold>(B)</bold> Performing the squatting skill with large random perturbation forces. (Red, cyan and blue arrows show the random perturbation force applied on the hip, femur and tibia, respectively). <bold>(C)</bold> Performing the squatting skill with human interaction.</p>
</caption>
<graphic xlink:href="frobt-08-702845-g003.tif"/>
</fig>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>Case1: Joint behavior statistics in the first squatting cycle (curve: mean; shade: STD) with respect to time for 200 simulated environments. <bold>(A)</bold> Hip, knee, ankle dorsiflexion/plantarflexion, ankle inversion/eversion joint angles with respect to time during the first squatting cycle. Since the learned STDs <inline-formula id="inf88">
<mml:math id="m106">
<mml:mtext>&#x3a3;</mml:mtext>
</mml:math>
</inline-formula> (in <xref ref-type="disp-formula" rid="e4">Eq. 4</xref>) of the joint angles are very small (less than 0.05), the variances of joint angles are indiscernible in figure. <bold>(B)</bold> Corresponding joint torques during the first squatting&#x20;cycle.</p>
</caption>
<graphic xlink:href="frobt-08-702845-g004.tif"/>
</fig>
<p>
<xref ref-type="fig" rid="F5">Figure&#x20;5</xref> show the best obtained result of foot CoP trajectories (left and right feet) in the lateral and forward directions, which are calculated in real-time using the ground contact force information. As it can be seen, the exoskeleton controller can keep the foot CoP well inside the stable region for both lateral and forward directions in a complete squatting cycle. Noted at the beginning the CoP is close to the back edge (due to its initial state) but gradually it is bought to near the center. This indicates that the exoskeleton controller is able to recognize the current state of CoP and capable of bring it to a more stable state. And the balance in the lateral direction is better than that in the forward direction due to the symmetric nature of the squatting motion. The right foot CoP trajectories have very similar patterns with those of the left foot&#x20;CoP.</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption>
<p>Case1: Foot CoP trajectories during the first squatting cycle. Green dotted line depicts the stable region border of CoP. <bold>(A)</bold> Left foot CoP and right foot CoP trajectories (lateral direction). <bold>(B)</bold> Left foot CoP and right foot CoP trajectories (forward direction).</p>
</caption>
<graphic xlink:href="frobt-08-702845-g005.tif"/>
</fig>
<p>
<xref ref-type="fig" rid="F6">Figure&#x20;6</xref> presents the performance of the learned controller while performing multiple squatting cycles. We can clearly observe the high CoP rewards, indicating good system balance when the exoskeleton performs the squatting motion. The relatively high joint position tracking and end-effector tracking rewards illustrate strong tracking performance of the control system. The second figure shows the torques for the hip, knee, and ankle joints. The last figure demonstrates the predicted actions (PD target positions) for these joints. It is clear from these plots that the actions predicted from the policy network are smooth and exhibit clear cyclic patterns.</p>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption>
<p>Case1: Performance of the RL based controller without external perturbation during multiple squatting cycles. The first figure demonstrates the foot CoP reward (green and red lines), end-effector tracking reward (magenta line) and joint position tracking reward (black line) calculated according to <xref ref-type="disp-formula" rid="e10">Eqs 10</xref>&#x2013;<xref ref-type="disp-formula" rid="e15">15</xref>. The bottom figure depicts the actions for the hip, knee ankle joint predicted from the neural network.</p>
</caption>
<graphic xlink:href="frobt-08-702845-g006.tif"/>
</fig>
</sec>
<sec id="s4-2-2">
<title>4.2.2 Case 2&#x2013;Robustness Against Random Perturbation Forces</title>
<p>In the second case, we aim to verify the robustness of the controller under random external perturbation forces. From our tests, the learned control policy from case 1, trained without any perturbation forces, could perform well with random perturbation forces up to <inline-formula id="inf89">
<mml:math id="m107">
<mml:mrow>
<mml:mn>100</mml:mn>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>. To further improve the robustness of our controller to handle greater perturbation forces, we now introduce random perturbation forces during the training process of the learning controller. The perturbation forces are applied to three parts of the exoskeleton: hip, femur, and tibia. For femur and tibia, the magnitude of forces are randomly sampled in the range <inline-formula id="inf90">
<mml:math id="m108">
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>0,100</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> and no restriction is set for the direction. For the hip, we randomly sample the magnitude of force in <inline-formula id="inf91">
<mml:math id="m109">
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>0,200</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> but restricted the direction to <inline-formula id="inf92">
<mml:math id="m110">
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo>&#x223c;</mml:mo>
<mml:mn>20</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> degrees from the vertical direction assuming no large lateral pushing perturbation.</p>
<p>
<xref ref-type="fig" rid="F3">Figure&#x20;3B</xref> shows a series of snapshots of the lower extremity exoskeleton behavior with the newly trained controller when tested with random, large perturbation forces during squatting motion. Statistical results of the Joint angles and torques at the hip, knee and ankle joint in the first squatting cycle under 200 simulated environments are shown in <xref ref-type="fig" rid="F7">Figure&#x20;7</xref>. We can clearly observe that the motion is still relative smooth and the torques calculated from <xref ref-type="disp-formula" rid="e8">Eq. 8</xref> have more ripples in response to the random perturbation forces. <xref ref-type="fig" rid="F7">Figure&#x20;7C</xref> shows randomly varying perturbation forces applied on the hip, femur and tibia. Compared to the joint angles and torques without external perturbation (Case 1, <xref ref-type="fig" rid="F4">Figure&#x20;4</xref>), the joint torques under random perturbation forces are almost doubled but the joint angles are relatively close. Under the large, random perturbation forces, the controller can achieve the target squatting motion with a low average joint angle tracking error (about <inline-formula id="inf93">
<mml:math id="m111">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mn>2.64</mml:mn>
</mml:mrow>
<mml:mo>&#x2218;</mml:mo>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>). Moreover, under the random perturbations that could influence the balance in the frontal plane, the learned controller can still keep balance and stability with the actuation of the ankle inversion/eversion&#x20;joint.</p>
<fig id="F7" position="float">
<label>FIGURE 7</label>
<caption>
<p>Case2: Joint behavior statistics (curve: mean; shade: STD) under random perturbation forces in the first squatting cycle with respect to time for 200 simulated environments. <bold>(A)</bold> Hip, knee, ankle dorsiflexion/plantarflexion, ankle inversion/eversion joint angles with respect to time in the first squatting cycle. The learned STDs <inline-formula id="inf94">
<mml:math id="m112">
<mml:mtext>&#x3a3;</mml:mtext>
</mml:math>
</inline-formula> (in <xref ref-type="disp-formula" rid="e4">Eq. 4</xref>) of the joint angles are very small (less than 0.08). <bold>(B)</bold> Corresponding joint torques in the first squatting cycle. <bold>(C)</bold> Magnitudes of the random perturbation forces applied on the three parts of the exoskeleton: hip, femur and tibia during testing.</p>
</caption>
<graphic xlink:href="frobt-08-702845-g007.tif"/>
</fig>
<p>
<xref ref-type="fig" rid="F8">Figure&#x20;8</xref> shows the CoP trajectories of the left and right feet in both lateral and forward directions in the first squatting cycle. As shown in this figure, the foot CoP trajectories have oscillation under the random perturbation forces compared with Case1. But the robot can still keep the CoP inside the safe region of the support foot to guarantee stability and balance. It further validates that this controller enables the robot to perform squatting motion with strong stability and robustness. To further demonstrate the robustness of the learning controller, we increase the random perturbation forces up to <inline-formula id="inf95">
<mml:math id="m113">
<mml:mrow>
<mml:mn>75</mml:mn>
<mml:mtext>%</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula> greater than the training perturbation (e.g., up to <inline-formula id="inf96">
<mml:math id="m114">
<mml:mrow>
<mml:mn>350</mml:mn>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> for the hip force) and found that the exoskeleton can still perform the squatting motion without losing its balance as illustrated in <xref ref-type="fig" rid="F8">Figures 8C, D</xref> about the left foot CoP stability. Real-time tracking results of the exoskeleton with multiple squatting cycles are shown in <xref ref-type="fig" rid="F9">Figure&#x20;9</xref>. From <xref ref-type="fig" rid="F9">Figure&#x20;9</xref>, both the foot CoP reward and the end-effector reward remain high under the random perturbation forces. The torques are greater than that without external perturbation forces, the peak torque for the knee joint is close to <inline-formula id="inf97">
<mml:math id="m115">
<mml:mrow>
<mml:mn>40</mml:mn>
<mml:mi>N</mml:mi>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>. The actions predicted from the policy network are also smooth and exhibit cyclic patterns.</p>
<fig id="F8" position="float">
<label>FIGURE 8</label>
<caption>
<p>Case2: Foot CoP trajectories under random external perturbations in the first squatting cycle and robustness test on the foot CoP stability under 75% greater perturbation forces compared with the training setting. <bold>(A)</bold> Left and right foot CoP trajectories (lateral direction). <bold>(B)</bold> Left and right foot CoP trajectories (forward direction). <bold>(C)</bold> Left foot CoP trajectory (lateral direction) under 75% greater perturbation forces compared with the training setting. <bold>(D)</bold> Left foot CoP trajectory (forward direction) under greater perturbation forces compared with the training setting.</p>
</caption>
<graphic xlink:href="frobt-08-702845-g008.tif"/>
</fig>
<fig id="F9" position="float">
<label>FIGURE 9</label>
<caption>
<p>Case2: Real-time performance of the reinforcement-learning based controller with large, random external perturbation during multiple squatting cycles. The first figure demonstrates the real-time foot CoP reward (green and red lines), end-effector tracking reward (magenta line) and joint position tracking reward (black line) calculated according to <xref ref-type="disp-formula" rid="e10">Eqs 10</xref>&#x2013;<xref ref-type="disp-formula" rid="e15">15</xref>. The bottom figure depicts the actions for the hip, knee ankle joint predicted from the neural network.</p>
</caption>
<graphic xlink:href="frobt-08-702845-g009.tif"/>
</fig>
</sec>
<sec id="s4-2-3">
<title>4.2.3 Case3&#x2013;Robustness Under Human-Exoskeleton Interaction</title>
<p>In the third case, the human musculoskeletal model is integrated with the exoskeleton to simulate more realistic perturbation forces. As described in section 2.3, the springs at the strap locations can generate the varying interaction forces between the human and exoskeleton robot during the motion, which are applied on both human and exoskeleton. We first train the network with the integrated human-exoskeleton model to account for the interaction forces. Here we do not consider the active muscle contraction of the human operator or actuation torques on the human joints, considering the operator could be a patient suffering from spinal cord injury or stroke, with very limited or no control of his or her own body. Nonetheless, the passive muscle forces as described in <xref ref-type="disp-formula" rid="e3">Eq. 3</xref> during movement are incorporated. The squatting skill learned by the exoskeleton and performance of the motion controller are shown in <xref ref-type="fig" rid="F3">Figure&#x20;3C</xref> and <xref ref-type="fig" rid="F10">Figures 10</xref>&#x2013;<xref ref-type="fig" rid="F13">13</xref>.</p>
<fig id="F10" position="float">
<label>FIGURE 10</label>
<caption>
<p>Case3: Joint behavior statistics under human-exoskeleton interactions during the first squatting cycle with respect to time for 200 simulated environments. <bold>(A)</bold> Hip, knee, ankle dorsiflexion/plantarflexion, ankle inversion/eversion joint angles in the first squatting cycle. The learned STDs <inline-formula id="inf98">
<mml:math id="m116">
<mml:mtext>&#x3a3;</mml:mtext>
</mml:math>
</inline-formula> (in <xref ref-type="disp-formula" rid="e4">Eq. 4</xref>) of the joint angles are very small (less than 0.06). <bold>(B)</bold> Corresponding joint torques in the first squatting&#x20;cycle.</p>
</caption>
<graphic xlink:href="frobt-08-702845-g010.tif"/>
</fig>
<p>As shown in <xref ref-type="fig" rid="F3">Figure&#x20;3C</xref>, the rehabilitation exoskeleton is able to assist the human to perform the squatting motion without external assistance. Statistical results of angles and torques at the hip, knee and ankle joint of the left leg in the first cycle under 200 simulated environments are shown in <xref ref-type="fig" rid="F10">Figures 10A, B</xref>. The torques at the hip, knee and ankle joint are greater than those without the human interaction while still below the maximum torques. The controller can still maintain low joint angle average tracking error (about <inline-formula id="inf102">
<mml:math id="m120">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mn>2.49</mml:mn>
</mml:mrow>
<mml:mo>&#x2218;</mml:mo>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>) under the varying human-exoskeleton forces. <xref ref-type="fig" rid="F11">Figure&#x20;11</xref> shows the CoP trajectories of the left foot and right foot in the lateral and forward directions under the predicted human-skeleton interaction forces (as shown in <xref ref-type="fig" rid="F12">Figure&#x20;12A</xref>). From <xref ref-type="fig" rid="F11">Figure&#x20;11A</xref>, there exists a symmetric pattern between the left and right foot CoP trajectories in the lateral direction, which indicates that the squatting motion is well balanced in the frontal plane. Real-time tracking results of the controlled exoskeleton under the human-exoskeleton interactions with more squatting cycles are shown in <xref ref-type="fig" rid="F12">Figure&#x20;12</xref>. As shown in <xref ref-type="fig" rid="F12">Figure&#x20;12B</xref>, the high CoP reward in the first figure indicates that the proposed control system has strong stability and robustness to the human interaction forces. The peak torque for all joints are less than <inline-formula id="inf103">
<mml:math id="m121">
<mml:mrow>
<mml:mn>70</mml:mn>
<mml:mi>N</mml:mi>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>. The largest peak torque happens at the ankle dorsi/plantar flexion joint (<inline-formula id="inf104">
<mml:math id="m122">
<mml:mrow>
<mml:mn>68</mml:mn>
<mml:mi>N</mml:mi>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>), which is much smaller than its 160<inline-formula id="inf105">
<mml:math id="m123">
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> capacity (<xref ref-type="bibr" rid="B26">Nunez et&#x20;al., 2017</xref>). The actions predicted from the policy network, shown in the last figure of <xref ref-type="fig" rid="F12">Figure&#x20;12</xref>, are still smooth and cyclic in general.</p>
<fig id="F11" position="float">
<label>FIGURE 11</label>
<caption>
<p>Case3: Foot CoP trajectories under human-exoskeleton interactions in the first squatting cycle. The stable region border is marked with green dotted lines. <bold>(A)</bold> Left foot CoP and right foot CoP trajectories (lateral direction). <bold>(B)</bold> Left foot CoP and right foot CoP trajectories (forward direction).</p>
</caption>
<graphic xlink:href="frobt-08-702845-g011.tif"/>
</fig>
<fig id="F12" position="float">
<label>FIGURE 12</label>
<caption>
<p>Case3: Performance of the RL based controller under human-exoskeleton interactions during more squatting cycles. <bold>(A)</bold> Human-exoskeleton interaction (strap) forces during multiple squatting cycles. hip<inline-formula id="inf99">
<mml:math id="m117">
<mml:mo>&#x2013;</mml:mo>
</mml:math>
</inline-formula>force, femur<inline-formula id="inf100">
<mml:math id="m118">
<mml:mo>&#x2013;</mml:mo>
</mml:math>
</inline-formula>force and tibia<inline-formula id="inf101">
<mml:math id="m119">
<mml:mo>&#x2013;</mml:mo>
</mml:math>
</inline-formula>force in the figure represent the interaction forces between the human and the strap on the exoskeleton, respectively. <bold>(B)</bold> Performance of the RL based controller. The first figure demonstrates the foot CoP reward (green and red lines), end-effector reward (magenta line) and joint position tracking reward (black line) calculated according to <xref ref-type="disp-formula" rid="e10">Eqs 10</xref>&#x2013;<xref ref-type="disp-formula" rid="e15">15</xref>. The bottom figure depicts the actions for the hip, knee ankle joint predicted from the neural network.</p>
</caption>
<graphic xlink:href="frobt-08-702845-g012.tif"/>
</fig>
<p>
<xref ref-type="fig" rid="F13">Figure&#x20;13</xref> visualizes the performance of the learned controller in 200 simulated environments with different dynamics. <xref ref-type="fig" rid="F13">Figure&#x20;13A</xref> depicts the rewards statistics (mean and STD) with respect to time calculated from <xref ref-type="disp-formula" rid="e10">Eqs 10</xref>&#x2013;<xref ref-type="disp-formula" rid="e15">15</xref> under 200 simulated environments. The end-effector reward indicating the foot tracking performance consistently maintains a high value, revealing the exoskeleton robot has no falling condition in 200 simulated environments with unfamiliar dynamics and it can stand on the ground with stationary feet when performing the complete squatting motion. The joint position tracking and foot CoP also achieve a high reward with less variance under more diverse dynamics of the exoskeleton robot. <xref ref-type="fig" rid="F13">Figure&#x20;13B</xref> shows the average reward of a complete squatting cycle for each simulated environment. These results suggest that the learned controller is able to effortlessly generalize to environments that differ from those encountered during training and achieve good control performance under very diverse dynamics. The extensive testing performed with the integrated human-exoskeleton demonstrate that the RL controller is robust enough to sustain stable motions when subjected to uncertain human-exoskeleton interaction forces.</p>
<fig id="F13" position="float">
<label>FIGURE 13</label>
<caption>
<p>Performance of learned controller under human-exoskeleton interactions in 200 simulated environments with different dynamics. The dynamic model of the exoskeleton is randomly initialized. <bold>(A)</bold> Rewards statistics (curve: mean; shade: STD) with respect to time for 200 simulated environments. <bold>(B)</bold> Average reward of a complete squatting cycle for each simulated environment.</p>
</caption>
<graphic xlink:href="frobt-08-702845-g013.tif"/>
</fig>
</sec>
</sec>
</sec>
<sec id="s5">
<title>5 Discussion</title>
<p>Through these designed numerical tests, we verified that our learning framework can produce effective neural network-based control policies for the lower extremity exoskeleton to perform well-balanced squatting motions. By incorporate adverse perturbations in training, the learned control policies are robust against large random external perturbations during testing. And it can sustain stable motions when subjected to uncertain human-exoskeleton interaction forces from a disabled human user. From all numerical tests performed, the effectiveness and robustness of the learned balance controller are demonstrated by its capability to maintain CoP inside the foot stable region along both lateral and forward directions.</p>
<p>In this study, we evaluated the controllers in a specific case for which the human musculoskeletal model has only passive muscle response (i.e.,&#x20;without active muscle contraction). In reality, the human-exoskeleton interaction forces might vary substantially for different users with different weights and levels of disability. One may assume that a user with good body or muscle control tends to minimize the interaction forces on the straps. On the other hand, a passive human musculoskeletal model tends to generate larger interaction forces. Therefore, using a passive musculoskeletal model can be considered a more difficult task for the exoskeleton. Further investigations with an active human musculoskeletal model are possible but will likely to need additional information on the health condition of the&#x20;user.</p>
<p>Through dynamics randomization, our learned controller is robust against modeling inaccuracy and differences between the training and testing environments (<xref ref-type="bibr" rid="B39">Tan et&#x20;al., 2018</xref>). Nonetheless, it is still beneficial to validate and further improve the dynamic exoskeleton model. This can be done through a model identification process, for which we can conveniently rely on the real-time measurement of GRFs and derived CoP information from the foot force sensors during programmed motions. Experiments can be conducted to correlate the CoP position with the exoskeleton posture and the CoP movement with motion. Stable squatting motions of different velocities can be used to record the dynamic responses of the exoskeleton and the collected data then can be used to further identify or optimize the model&#x2019;s parameters (such as inertia properties and joint friction coefficients).</p>
<p>By incorporating motion imitation into the learning process, the proposed robust control framework has the potential to learn a diverse array of human behaviors without the need to design sophisticated skill-specific control structure and reward functions. Common rehabilitation motions such as sit-to-stand, walking on flat or inclined ground can be learned by feeding proper target motions. Since we incorporate the reference motion as the input of the control policy network, it requires a reference motion trajectory for each new activity. The use of reference motion data (as the input of the control policy network) can alleviate the need to design sophisticated task-specific reward functions and thereby facilitates a general framework to learn a diverse array of behaviors. Note the reference motion provides guidance and needs not to be precise or even physically feasible. In this paper, the reference motion is manually generated by mimicking a human squatting, and we use it to guide the exoskeleton squatting, knowing the motion maybe not be feasible for the exoskeleton to follow exactly due to the differences in mass and inertia properties. Nonetheless, during training, the dynamic simulation environment automatically generates dynamically consistent (thus physically feasible) motions while trying to maximize the tracking reward. For other rehabilitation motions such as walking on flat or inclined ground, sit-to-stand, reference motions can be generated similarly without too much effort (unlike conventional motion or trajectory prediction or optimization methods). In addition, due to the nature of imitation learning and CoP based balance control, we foresee minimal changes to the learning framework with the exception of crafting different target motions for imitation. The learning process will automatically create specific controllers that can produce physically feasible and stable target motions (even when the target motion is coarsely generated and may not be physically feasible).</p>
<p>Transferring or adapting RL control policies trained in the simulations to the real hardware remains a challenge in spite of some demonstrated successes. To bridge the so-called &#x201c;sim-to-real&#x201d; gap, we have adopted dynamics randomization during training that is often used to prepare for sim-to-real transfer (<xref ref-type="bibr" rid="B39">Tan et&#x20;al., 2018</xref>). In a recent work by <xref ref-type="bibr" rid="B12">Exarchos et&#x20;al. (2020)</xref>, it is shown kinematic domain randomization can also be effective for policy transfer. Additionally, efficient adaptation techniques such as latent space (<xref ref-type="bibr" rid="B8">Clavera et&#x20;al., 2019</xref>; <xref ref-type="bibr" rid="B45">Yu et&#x20;al., 2019</xref>) or meta-learning (<xref ref-type="bibr" rid="B48">Yu et&#x20;al., 2020</xref>) can also be applied to further improve the performance of pre-trained policies in the real environment. We plan to construct an adaptation strategy to realize the sim-to-real control policy transfer for the lower extremity exoskeleton robot in the near future.</p>
</sec>
<sec id="s6">
<title>6 Conclusion</title>
<p>In this work, we have presented a robust, RL-based controller for exoskeleton squatting assistance with human interaction. A relatively lightweight lower extremity exoskeleton is presented and used to build a human-exoskeleton interaction model in the simulation environment for learning the robust controller. The exoskeleton foot CoP information collected from the force sensors is leveraged as a feedback for balance control and adversary perturbations and uncertain human interaction forces are used to train the controller. We have successfully demonstrated the lower extremity exoskeleton&#x2019;s capability to carry a human to perform squatting motions with a moderate torque requirement and provided evidence of its effectiveness and robustness. With the actuation of the ankle inversion/eversion joint, the learned controllers are also capable of maintaining the balance within the frontal plane under large perturbation or interaction forces. The success demonstrated in this study has implications for those seeking to apply reinforcement learning to control robots to imitate diverse human behaviors with strong balance and increased robustness even when subjected to the larger perturbations. Most recently, we have extended the proposed controller to perform flat ground walking with additional rewards and improved training methods and obtained stable interactive human-exoskeleton walking motions, which demonstrates the versatility of the method that will be presented in a future work. In the near future, we plan to further extend this framework to learn a family of controllers that can perform a variety of human skills like sit-to-stand, inclined ground walking and stair climbing. Lastly, these learned controllers will be deployed to the hardware through a sim-to-real transfer method and experimental tests will be performed to validate the controllers&#x2019; performance.</p>
</sec>
</body>
<back>
<sec id="s7">
<title>Data Availability Statement</title>
<p>The original contributions presented in the study are included in the article/<xref ref-type="sec" rid="s11">Supplementary Material</xref>, further inquiries can be directed to the corresponding author.</p>
</sec>
<sec id="s8">
<title>Author Contributions</title>
<p>In this work, XZ first proposed the research idea and approach of the paper and provided the project guidance. The code implementation, development and data analysis were done by SL and XZ. The CAD design, development and fabrication of the robotic exoskeleton were done by GJA and EN and its multibody model was created by XZ with support from GJA, EN, and SA. The first draft of the manuscript was written by SL and XZ. SA, GJA, and HS provided valuable suggestions and feedback on the draft, and HS also made some important revisions to the final paper. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec id="s9">
<title>Funding</title>
<p>This work was partially supported by the National Institute on Disability, Independent Living, and Rehabilitation Research (NIDILRR) funded Rehabilitation Engineering Research Center Grant 90RE5021-01-00, and by National Science Foundation (NSF) Major Research Instrumentation grant 1625644). HS was partially supported by the NIDILRR grant 90DPGE0011.</p>
</sec>
<sec sec-type="COI-statement" id="s10">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec id="s11">
<title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/frobt.2021.702845/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/frobt.2021.702845/full&#x23;supplementary-material</ext-link>
</p>
<supplementary-material xlink:href="Video1.MP4" id="SM1" mimetype="application/MP4" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Androwis</surname>
<given-names>G. J.</given-names>
</name>
<name>
<surname>Karunakaran</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Nunez</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Michael</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Yue</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Foulds</surname>
<given-names>R. A.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Research and Development of New Generation Robotic Exoskeleton for over Ground Walking in Individuals with Mobility Disorders (Novel Design and Control)</article-title>. In <conf-name>2017 International Symposium on Wearable Robotics and Rehabilitation (WeRob)</conf-name>. <publisher-name>IEEE</publisher-name>, <fpage>1</fpage>&#x2013;<lpage>2</lpage>. </citation>
</ref>
<ref id="B2">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ayas</surname>
<given-names>M. S.</given-names>
</name>
<name>
<surname>Altas</surname>
<given-names>I. H.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Fuzzy Logic Based Adaptive Admittance Control of a Redundantly Actuated Ankle Rehabilitation Robot</article-title>. <source>Control. Eng. Pract.</source> <volume>59</volume>, <fpage>44</fpage>&#x2013;<lpage>54</lpage>. <pub-id pub-id-type="doi">10.1016/j.conengprac.2016.11.015</pub-id> </citation>
</ref>
<ref id="B3">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Baraff</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>1994</year>). <article-title>Fast Contact Force Computation for Nonpenetrating Rigid Bodies</article-title>. In <conf-name>Proceedings of the 21st annual conference on Computer graphics and interactive techniques</conf-name>. <fpage>23</fpage>&#x2013;<lpage>34</lpage>. </citation>
</ref>
<ref id="B4">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Baud</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Fasola</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Vouga</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Ijspeert</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Bouri</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Bio-inspired Standing Balance Controller for a Full-Mobilization Exoskeleton</article-title>. In <conf-name>2019 IEEE 16th International Conference on Rehabilitation Robotics (ICORR)</conf-name>. <publisher-name>IEEE</publisher-name>, <fpage>849</fpage>&#x2013;<lpage>854</lpage>. <pub-id pub-id-type="doi">10.1109/ICORR.2019.8779440</pub-id> </citation>
</ref>
<ref id="B5">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bayon</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Emmens</surname>
<given-names>A. R.</given-names>
</name>
<name>
<surname>Afschrift</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Van Wouwe</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Keemink</surname>
<given-names>A. Q. L.</given-names>
</name>
<name>
<surname>Van Der Kooij</surname>
<given-names>H.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Can Momentum-Based Control Predict Human Balance Recovery Strategies?</article-title> <source>IEEE Trans. Neural Syst. Rehabil. Eng.</source> <volume>28</volume>, <fpage>2015</fpage>&#x2013;<lpage>2024</lpage>. <pub-id pub-id-type="doi">10.1109/tnsre.2020.3005455</pub-id> </citation>
</ref>
<ref id="B6">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Bionics</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2020</year>). <source>Rex Technology</source>. <comment>Available at: <ext-link ext-link-type="uri" xlink:href="https://www.rexbionics.com/us/product-information/">https://www.rexbionics.com/us/product-information/</ext-link> </comment>(<comment>Accessed April, 2020</comment>).</citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Qin</surname>
<given-names>L.-Y.</given-names>
</name>
<name>
<surname>Gao</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Chan</surname>
<given-names>K.-M.</given-names>
</name>
<name>
<surname>Law</surname>
<given-names>S.-W.</given-names>
</name>
<etal/>
</person-group> (<year>2016</year>). <article-title>Recent Developments and Challenges of Lower Extremity Exoskeletons</article-title>. <source>J.&#x20;Orthopaedic Translation</source> <volume>5</volume>, <fpage>26</fpage>&#x2013;<lpage>37</lpage>. <pub-id pub-id-type="doi">10.1016/j.jot.2015.09.007</pub-id> </citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Clavera</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Nagabandi</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Fearing</surname>
<given-names>R. S.</given-names>
</name>
<name>
<surname>Abbeel</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Levine</surname>
<given-names>S.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>Learning to Adapt in Dynamic, Real-World Environments through Meta-Reinforcement Learning</article-title>. In <conf-name>International Conference on Learning Representations</conf-name>. </citation>
</ref>
<ref id="B9">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Crossley</surname>
<given-names>K. M.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>W.-J.</given-names>
</name>
<name>
<surname>Schache</surname>
<given-names>A. G.</given-names>
</name>
<name>
<surname>Bryant</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Cowan</surname>
<given-names>S. M.</given-names>
</name>
</person-group> (<year>2011</year>). <article-title>Performance on the Single-Leg Squat Task Indicates Hip Abductor Muscle Function</article-title>. <source>Am. J.&#x20;Sports Med.</source> <volume>39</volume>, <fpage>866</fpage>&#x2013;<lpage>873</lpage>. <pub-id pub-id-type="doi">10.1177/0363546510395456</pub-id> </citation>
</ref>
<ref id="B10">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Delp</surname>
<given-names>S. L.</given-names>
</name>
<name>
<surname>Anderson</surname>
<given-names>F. C.</given-names>
</name>
<name>
<surname>Arnold</surname>
<given-names>A. S.</given-names>
</name>
<name>
<surname>Loan</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Habib</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>John</surname>
<given-names>C. T.</given-names>
</name>
<etal/>
</person-group> (<year>2007</year>). <article-title>Opensim: Open-Source Software to Create and Analyze Dynamic Simulations of Movement</article-title>. <source>IEEE Trans. Biomed. Eng.</source> <volume>54</volume>, <fpage>1940</fpage>&#x2013;<lpage>1950</lpage>. <pub-id pub-id-type="doi">10.1109/tbme.2007.901024</pub-id> </citation>
</ref>
<ref id="B11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Emmens</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>van Asseldonk</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Masciullo</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Arquilla</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Pisotta</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Tagliamonte</surname>
<given-names>N. L.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). <article-title>Improving the Standing Balance of Paraplegics through the Use of a Wearable Exoskeleton</article-title>. In <conf-name>2018 7th IEEE International Conference on Biomedical Robotics and Biomechatronics (Biorob)</conf-name>. <fpage>707</fpage>&#x2013;<lpage>712</lpage>. <pub-id pub-id-type="doi">10.1109/BIOROB.2018.8488066</pub-id> </citation>
</ref>
<ref id="B12">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Exarchos</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>C. K.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Policy Transfer via Kinematic Domain Randomization and Adaptation</article-title>. <comment>arXiv preprint arXiv:2011.01891</comment> </citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Glorot</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Bengio</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2010</year>). <article-title>Understanding the Difficulty of Training Deep Feedforward Neural Networks</article-title>. In <conf-name>Proceedings of the thirteenth international conference on artificial intelligence and statistics</conf-name>. <fpage>249</fpage>&#x2013;<lpage>256</lpage>. </citation>
</ref>
<ref id="B14">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Horak</surname>
<given-names>F. B.</given-names>
</name>
<name>
<surname>Henry</surname>
<given-names>S. M.</given-names>
</name>
<name>
<surname>Shumway-Cook</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>1997</year>). <article-title>Postural Perturbations: New Insights for Treatment of Balance Disorders</article-title>. <source>Phys. Ther.</source> <volume>77</volume>, <fpage>517</fpage>&#x2013;<lpage>533</lpage>. <pub-id pub-id-type="doi">10.1093/ptj/77.5.517</pub-id> </citation>
</ref>
<ref id="B15">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Hu</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Hou</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2012</year>). <article-title>Training Strategies for a Lower Limb Rehabilitation Robot Based on Impedance Control</article-title>. In <conf-name>2012 Annual International Conference of the IEEE Engineering in Medicine and Biology Society</conf-name>, <volume>2012</volume>. <publisher-name>IEEE</publisher-name>, <fpage>6032</fpage>&#x2013;<lpage>6035</lpage>. <pub-id pub-id-type="doi">10.1109/EMBC.2012.6347369</pub-id> </citation>
</ref>
<ref id="B16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hwangbo</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Dosovitskiy</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Bellicoso</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Tsounis</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Koltun</surname>
<given-names>V.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>Learning Agile and Dynamic Motor Skills for Legged Robots</article-title>. <source>Sci. Robot</source> <volume>4</volume>. <pub-id pub-id-type="doi">10.1126/scirobotics.aau5872</pub-id> </citation>
</ref>
<ref id="B17">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Karunakaran</surname>
<given-names>K. K.</given-names>
</name>
<name>
<surname>Abbruzzese</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Androwis</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Foulds</surname>
<given-names>R. A.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>A Novel User Control for Lower Extremity Rehabilitation Exoskeletons</article-title>. <source>Front. Robot. AI</source> <volume>7</volume>, <fpage>108</fpage>. <pub-id pub-id-type="doi">10.3389/frobt.2020.00108</pub-id> </citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Koolen</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Bertrand</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Thomas</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>De Boer</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Smith</surname>
<given-names>J.</given-names>
</name>
<etal/>
</person-group> (<year>2016</year>). <article-title>Design of a Momentum-Based Control Framework and Application to the Humanoid Robot Atlas</article-title>. <source>Int. J.&#x20;Hum. Robot.</source> <volume>13</volume>, <fpage>1650007</fpage>. <pub-id pub-id-type="doi">10.1142/s0219843616500079</pub-id> </citation>
</ref>
<ref id="B19">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Kumar</surname>
<given-names>V. C.</given-names>
</name>
<name>
<surname>Ha</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Sawicki</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>C. K.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Learning a Control Policy for Fall Prevention on an Assistive Walking Device</article-title>. In <conf-name>2020 IEEE International Conference on Robotics and Automation (ICRA)</conf-name>. <publisher-name>IEEE</publisher-name>, <fpage>4833</fpage>&#x2013;<lpage>4840</lpage>. </citation>
</ref>
<ref id="B20">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lee</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>X. Grey</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Ha</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Kunz</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Jain</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Ye</surname>
<given-names>Y.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). <article-title>Dart: Dynamic Animation and Robotics Toolkit</article-title>. <source>Joss</source> <volume>3</volume> (<issue>22</issue>), <fpage>500</fpage>. <pub-id pub-id-type="doi">10.21105/joss.00500</pub-id> </citation>
</ref>
<ref id="B21">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lee</surname>
<given-names>S.-H.</given-names>
</name>
<name>
<surname>Goswami</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2012</year>). <article-title>A Momentum-Based Balance Controller for Humanoid Robots on Non-level and Non-stationary Ground</article-title>. <source>Auton. Robot</source> <volume>33</volume>, <fpage>399</fpage>&#x2013;<lpage>414</lpage>. <pub-id pub-id-type="doi">10.1007/s10514-012-9294-z</pub-id> </citation>
</ref>
<ref id="B22">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lee</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Park</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Scalable Muscle-Actuated Human Simulation and Control</article-title>. <source>ACM Trans. Graph.</source> <volume>38</volume>, <fpage>1</fpage>&#x2013;<lpage>13</lpage>. <pub-id pub-id-type="doi">10.1145/3306346.3322972</pub-id> </citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>McGinty</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Irrgang</surname>
<given-names>J.&#x20;J.</given-names>
</name>
<name>
<surname>Pezzullo</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2000</year>). <article-title>Biomechanical Considerations for Rehabilitation of the Knee</article-title>. <source>Clin. Biomech.</source> <volume>15</volume>, <fpage>160</fpage>&#x2013;<lpage>166</lpage>. <pub-id pub-id-type="doi">10.1016/s0268-0033(99)00061-3</pub-id> </citation>
</ref>
<ref id="B24">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mergner</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Lippi</surname>
<given-names>V.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Posture Control-Human-Inspired Approaches for Humanoid Robot Benchmarking: Conceptualizing Tests, Protocols and Analyses</article-title>. <source>Front. Neurorobot.</source> <volume>12</volume>, <fpage>21</fpage>. <pub-id pub-id-type="doi">10.3389/fnbot.2018.00021</pub-id> </citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mungai</surname>
<given-names>M. E.</given-names>
</name>
<name>
<surname>Grizzle</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Feedback Control Design for Robust Comfortable Sit-To-Stand Motions of 3d Lower-Limb Exoskeletons</article-title>. <source>IEEE Access</source>. </citation>
</ref>
<ref id="B26">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Nunez</surname>
<given-names>E. H.</given-names>
</name>
<name>
<surname>Michael</surname>
<given-names>P. A.</given-names>
</name>
<name>
<surname>Foulds</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>2-dof Ankle-Foot System: Implementation of Balance for Lower Extremity Exoskeletons</article-title>. In <conf-name>2017 International Symposium on Wearable Robotics and Rehabilitation (WeRob)</conf-name>. <publisher-name>IEEE</publisher-name>, <fpage>1</fpage>&#x2013;<lpage>2</lpage>. </citation>
</ref>
<ref id="B27">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Paszke</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Gross</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Massa</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Lerer</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Bradbury</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Chanan</surname>
<given-names>G.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>Pytorch: An Imperative Style, High-Performance Deep Learning Library</article-title>. In <conf-name>Advances in Neural Information Processing Systems 32</conf-name>. <publisher-name>Curran Associates, Inc.</publisher-name>, <fpage>8024</fpage>&#x2013;<lpage>8035</lpage>. </citation>
</ref>
<ref id="B28">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Peng</surname>
<given-names>X. B.</given-names>
</name>
<name>
<surname>Abbeel</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Levine</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>van de Panne</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>DeepMimic</article-title>. <source>ACM Trans. Graph.</source> <volume>37</volume>, <fpage>1</fpage>&#x2013;<lpage>14</lpage>. <pub-id pub-id-type="doi">10.1145/3197517.3201311</pub-id> </citation>
</ref>
<ref id="B29">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Peng</surname>
<given-names>X. B.</given-names>
</name>
<name>
<surname>Berseth</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Van de Panne</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Terrain-adaptive Locomotion Skills Using Deep Reinforcement Learning</article-title>. <source>ACM Trans. Graph.</source> <volume>35</volume>, <fpage>1</fpage>&#x2013;<lpage>12</lpage>. <pub-id pub-id-type="doi">10.1145/2897824.2925881</pub-id> </citation>
</ref>
<ref id="B30">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Peng</surname>
<given-names>X. B.</given-names>
</name>
<name>
<surname>Berseth</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Yin</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Van De Panne</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>DeepLoco</article-title>. In <conf-name>Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation</conf-name>, <fpage>1</fpage>&#x2013;<lpage>13</lpage>. <pub-id pub-id-type="doi">10.1145/3072959.3073602</pub-id> </citation>
</ref>
<ref id="B31">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Peng</surname>
<given-names>X. B.</given-names>
</name>
<name>
<surname>Coumans</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>T.-W.</given-names>
</name>
<name>
<surname>Tan</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Levine</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Learning Agile Robotic Locomotion Skills by Imitating Animals</article-title>. <comment>arXiv preprint arXiv:2004.00784</comment>. </citation>
</ref>
<ref id="B32">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rajasekaran</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Aranda</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Casals</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Pons</surname>
<given-names>J.&#x20;L.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>An Adaptive Control Strategy for Postural Stability Using a Wearable Robot</article-title>. <source>Robotics Autonomous Syst.</source> <volume>73</volume>, <fpage>16</fpage>&#x2013;<lpage>23</lpage>. <pub-id pub-id-type="doi">10.1016/j.robot.2014.11.014</pub-id> </citation>
</ref>
<ref id="B33">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sadeghi</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Levine</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Cad2rl: Real Single-Image Flight without a Single Real Image</article-title>. <comment>arXiv preprint arXiv:1611.04201</comment>. </citation>
</ref>
<ref id="B34">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Salem</surname>
<given-names>G. J.</given-names>
</name>
<name>
<surname>Powers</surname>
<given-names>C. M.</given-names>
</name>
</person-group> (<year>2001</year>). <article-title>Patellofemoral Joint Kinetics during Squatting in Collegiate Women Athletes</article-title>. <source>Clin. Biomech.</source> <volume>16</volume>, <fpage>424</fpage>&#x2013;<lpage>430</lpage>. <pub-id pub-id-type="doi">10.1016/s0268-0033(01)00017-1</pub-id> </citation>
</ref>
<ref id="B35">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sardain</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Bessonnet</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2004</year>). <article-title>Forces Acting on a Biped Robot. center of Pressure-Zero Moment point</article-title>. In <conf-name>IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 34</conf-name>, <fpage>630</fpage>&#x2013;<lpage>637</lpage>. <pub-id pub-id-type="doi">10.1109/tsmca.2004.832811</pub-id> </citation>
</ref>
<ref id="B36">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schrade</surname>
<given-names>S. O.</given-names>
</name>
<name>
<surname>D&#xe4;twyler</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>St&#xfc;cheli</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Studer</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>T&#xfc;rk</surname>
<given-names>D. A.</given-names>
</name>
<name>
<surname>Meboldt</surname>
<given-names>M.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). <article-title>Development of Varileg, an Exoskeleton with Variable Stiffness Actuation: First Results and User Evaluation from the Cybathlon 2016</article-title>. <source>J.&#x20;Neuroeng Rehabil.</source> <volume>15</volume>, <fpage>18</fpage>. <pub-id pub-id-type="doi">10.1186/s12984-018-0360-4</pub-id> </citation>
</ref>
<ref id="B37">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schulman</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Wolski</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Dhariwal</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Radford</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Klimov</surname>
<given-names>O.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Proximal Policy Optimization Algorithms</article-title>. <comment>arXiv preprint arXiv:1707.06347</comment>. </citation>
</ref>
<ref id="B38">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Shi</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Ding</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>A Review on Lower Limb Rehabilitation Exoskeleton Robots</article-title>. <source>Chin. J.&#x20;Mech. Eng.</source> <volume>32</volume>, <fpage>74</fpage>. <pub-id pub-id-type="doi">10.1186/s10033-019-0389-8</pub-id> </citation>
</ref>
<ref id="B39">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tan</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Coumans</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Iscen</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Bai</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Hafner</surname>
<given-names>D.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). <article-title>Sim-to-real: Learning Agile Locomotion for Quadruped Robots</article-title>. <comment>arXiv preprint arXiv:1804.10332</comment>. </citation>
</ref>
<ref id="B40">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Thelen</surname>
<given-names>D. G.</given-names>
</name>
</person-group> (<year>2003</year>). <article-title>Adjustment of Muscle Mechanics Model Parameters to Simulate Dynamic Contractions in Older Adults</article-title>. <source>J.&#x20;Biomech. Eng.</source> <volume>125</volume>, <fpage>70</fpage>&#x2013;<lpage>77</lpage>. <pub-id pub-id-type="doi">10.1115/1.1531112</pub-id> </citation>
</ref>
<ref id="B41">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Tobin</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Fong</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Ray</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Schneider</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Zaremba</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Abbeel</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World</article-title>. In <conf-name>2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</conf-name>. <publisher-name>IEEE</publisher-name>, <fpage>23</fpage>&#x2013;<lpage>30</lpage>. </citation>
</ref>
<ref id="B42">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Vouga</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Baud</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Fasola</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Bouri</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Bleuler</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>TWIICE - A Lightweight Lower-Limb Exoskeleton for Complete Paraplegics</article-title>. In <conf-name>2017 International Conference on Rehabilitation Robotics (ICORR)</conf-name>. <publisher-name>IEEE</publisher-name>, <fpage>1639</fpage>&#x2013;<lpage>1645</lpage>. <pub-id pub-id-type="doi">10.1109/ICORR.2017.8009483</pub-id> </citation>
</ref>
<ref id="B43">
<citation citation-type="book">
<collab>Wandercraft</collab> (<year>2020</year>). <source>Atalante</source>. <comment>Available at: <ext-link ext-link-type="uri" xlink:href="https://www.wandercraft.eu/en/exo/">https://www.wandercraft.eu/en/exo/</ext-link>
</comment>(<comment>Accessed April, 2020</comment>).</citation>
</ref>
<ref id="B44">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Xiong</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2014</year>). <article-title>Research on the Control System of the Lower Limb Rehabilitation Robot under the Single Degree of freedom</article-title>. In <conf-name>Applied Mechanics and Materials</conf-name>. <publisher-loc>Switzerland</publisher-loc>: <publisher-name>Trans Tech Publ</publisher-name>, <fpage>15</fpage>&#x2013;<lpage>20</lpage>. <pub-id pub-id-type="doi">10.4028/www.scientific.net/amm.643.15</pub-id> </citation>
</ref>
<ref id="B45">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yu</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>T.-H.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Lynn</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Sayd</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Silivanov</surname>
<given-names>V.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). &#x201c;<article-title>Sim-to-Real Transfer For Biped Locomotion</article-title>,&#x201d; In <conf-name>IEEE/RSJ International Conference on Intelligent Robots and Systems IROS</conf-name>, <conf-loc>Macau, China</conf-loc>, <volume>4</volume>, <fpage>4579</fpage>&#x2013;<lpage>4586</lpage>. <pub-id pub-id-type="doi">10.1109/lra.2019.2931427</pub-id> </citation>
</ref>
<ref id="B46">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Yu</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Kumar</surname>
<given-names>V. C.</given-names>
</name>
<name>
<surname>Turk</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>C. K.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Sim-to-real Transfer for Biped Locomotion</article-title>. In <conf-name>2019 IEEE/RSJ International Conference on Intelligent Robots and Systems</conf-name>. <publisher-name>IROS</publisher-name>, <fpage>3503</fpage>&#x2013;<lpage>3510</lpage>. </citation>
</ref>
<ref id="B47">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yu</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>C. K.</given-names>
</name>
<name>
<surname>Turk</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Policy Transfer with Strategy Optimization</article-title>. <comment>arXiv preprint arXiv:1810.05751</comment>. </citation>
</ref>
<ref id="B48">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yu</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Tan</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Bai</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Coumans</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Ha</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Learning Fast Adaptation with Meta Strategy Optimization</article-title>. <source>IEEE Robot. Autom. Lett.</source> <volume>5</volume>, <fpage>2950</fpage>&#x2013;<lpage>2957</lpage>. <pub-id pub-id-type="doi">10.1109/lra.2020.2974685</pub-id> </citation>
</ref>
<ref id="B49">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Tran</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Design and Experimental Verification of Hip Exoskeleton with Balance Capacities for Walking Assistance</article-title>. <source>Ieee/asme Trans. Mechatron.</source> <volume>23</volume>, <fpage>274</fpage>&#x2013;<lpage>285</lpage>. <pub-id pub-id-type="doi">10.1109/TMECH.2018.2790358</pub-id> </citation>
</ref>
</ref-list>
</back>
</article>