Sensor-Based Control for Collaborative Robots: Fundamentals, Challenges, and Opportunities

The objective of this paper is to present a systematic review of existing sensor-based control methodologies for applications that involve direct interaction between humans and robots, in the form of either physical collaboration or safe coexistence. To this end, we first introduce the basic formulation of the sensor-servo problem, and then, present its most common approaches: vision-based, touch-based, audio-based, and distance-based control. Afterwards, we discuss and formalize the methods that integrate heterogeneous sensors at the control level. The surveyed body of literature is classified according to various factors such as: sensor type, sensor integration method, and application domain. Finally, we discuss open problems, potential applications, and future research directions.


INTRODUCTION
Robot control is a mature field: one that is already being heavily commercialized in industry. However, the methods required to regulate interaction and collaboration between humans and robots have not been fully established yet. These issues are the subject of research in the fields of physical human-robot interaction (pHRI) (Bicchi et al., 2008) and collaborative robotics (CoBots) (Colgate et al., 1996). The authors of De Luca and  presented a paradigm that specifies three nested layers of consistent behaviors that the robot must follow to achieve safe pHRI: • Safety is the first and most important feature in collaborative robots. Although there has been a recent push toward standardization of robot safety (e.g., the ISO 13482:2014 for robots and robotic devices; ISO 13482:2014, 2014), we are still in the initial stages. Safety is generally addressed through collision avoidance (with both humans or obstacles; Khatib, 1985), a feature that requires high reactivity (high bandwidth) and robustness at both the perception and control layers. • Coexistence is the robot capability of sharing the workspace with humans. This includes applications involving a passive human (e.g., medical operations where the robot is intervening on the patients' body; Azizian et al., 2014), as well as scenarios where robot and human work together on the same task, without contact or coordination. • Collaboration is the capability of performing robot tasks with direct human interaction and coordination. There are two modes for this: physical collaboration (with explicit and intentional contact between human and robot), and contactless collaboration (where the actions are guided by an exchange of information, e.g., in the form of body gestures, voice commands, or other modalities). Especially for the second mode, it is crucial to establish means for intuitive control by the human operators, which may be non-expert users. The robot should be proactive in realizing the requested tasks, and it should be capable of inferring the user's intentions, to interact more naturally from the human viewpoint.
All three layers are hampered by the unpredictability of human actions, which vary according to situations and individuals, complicating modeling (Phoha, 2014), and use of classic control. In the robotics literature, two major approaches for task execution have emerged: path/motion planning (La Valle, 2006) and sensor-based control (Chaumette and Hutchinson, 2006). The planning methods rely on a priori knowledge of the future robot and environment states over a time window. Although they have proved their efficiency in well-structured applications, these methods are hardly applicable to human-robot collaboration, because of the unpredictable and dynamic nature of humans. It is in the authors' view that sensor-based control is more efficient and flexible for pHRI, since it closes the perception-toaction loop at a lower level than path/motion planning. Note also that sensor-based control strategies strongly resemble the processes of our central nervous system (Berthoz, 2002), and can trace their origins back to the servomechanism problem (Davison and Goldenberg, 1975). The most known example is image-based visual servoing (Chaumette and Hutchinson, 2006) which relies directly on visual feedback to control robot motion, without requiring a cognitive layer nor a precise model of the environment.
The aim of this article is to survey the current state of art in sensor-based control, as a means to facilitate the interaction between robots, humans, and surrounding environments. Although we acknowledge the need for other techniques within a complete human-robot collaboration framework (e.g., path planning as mentioned, machine learning, etc.), here we review and classify the works which exploit sensory feedback to directly command the robot motion.
The timing and relevance of this survey is twofold. On one hand, while there have been previous reviews on topics such as (general) human-robot collaboration (Ajoudani et al., 2017;Villani et al., 2018) and human-robot safety (Haddadin et al., 2017), there is no specific review on the use of sensor-based control for human-robot collaborative tasks. On the other hand, we introduce a unifying paradigm for designing controllers with four sensing modalities. This feature gives our survey a valuable tutorial-like nature.
The rest of this manuscript is organized as follows: Section 2 presents the basic formulation of the sensor-based control problem; Section 3 describes the common approaches that integrate multiple sensors at the control level. Section 4 provides several classifications of the reviewed works. Section 5 presents insights and discusses open problems and areas of opportunity. Conclusions are given in section 6.

SENSING MODALITIES FOR CONTROL
Recent developments on bio-inspired measurement technologies have made sensors affordable and lightweight, easing their use on robots. These sensors include RGB-D cameras, tactile skins, force/moment transducers, etcetera (see Figure 1). The works reviewed here rely on different combinations of sensing modalities, depending on the task at stake. We consider the following four robot senses: • Vision. This includes methods for processing and understanding images, to produce numeric or symbolic information reproducing human sight. Although image processing is complex and computationally expensive, the richness of this sense is unique. Robotic vision is fundamental for understanding the environment and human intention, so as to react accordingly. • Touch. In this review, touch includes both proprioceptive force and tact, with the latter involving direct physical contact with an external object. Proprioceptive force is analogous to the sense of muscle force (Proske and Gandevia, 2012). The robot can measure it either from the joint position errors or via torque sensors embedded in the joints; it can then use both methods to infer and adapt to human intentions, by relying on force control (Raibert and Craig, 1981;Hogan, 1985;Morel et al., 1998;Villani and De Schutter, 2008). Human tact (somatosensation), on the other hand, results from activation of neural receptors, mostly in the skin. These have inspired the design of artificial tactile skins (Wettels et al., 2008;Schmitz et al., 2011), thoroughly used for human-robot collaboration. • Audition. In humans, localization of sound is performed by using binaural audition (i.e., two ears). By exploiting auditory cues in the form of level/time/phase differences between left and right ears we can determine the source's horizontal position and its elevation (Rayleigh, 1907). Microphones artificially emulate this sense, and allow robots to "blindly" locate sound sources. Although robotic hearing typically uses two microphones mounted on a motorized head, other nonbiological configurations exist, e.g., a head instrumented with a single microphone or an array of several omni-directional microphones (Nakadai et al., 2006). • Distance. This is the only sense among the four that humans cannot directly measure. Yet, numerous examples exist in the mammal kingdom (e.g., bats and whales), in the form of echolocation. Robots measure distance with optical (e.g., infrared or lidar), ultrasonic, or capacitive (Göger et al., 2010) sensors. The relevance of this particular "sense" in humanrobot collaboration is motivated by the direct relationship existing between the distance from obstacles (here, the human) and safety.
Roboticists have designed other bio-inspired sensors, to smell (see Kowadlo and Russell, 2008 for a comprehensive survey and Russell, 2006;Gao et al., 2016;Rahbar et al., 2017 for 3D tracking applications) and taste (Shimazu et al., 2007;Kobayashi et al., 2010;Ha et al., 2015). However, in our opinion, artificial smell and taste are not yet mature enough for human-robot collaboration. Most of the current work on these senses is for localization/identification of hazardous gases/substances. It is also worth mentioning the increasing popularity of physiological signals for controlling robots. These include, for example, Electromyography and Brain-Computer Interfaces (Ajoudani et al., 2017). Albeit promising, these technologies generally provide unilateral (from human to robot) control, without feedback loop closure. For these reasons, this review will focus only on the four senses mentioned above, namely vision, touch, audition, and distance.

Formulation of Sensor-Based Control
Sensor-based control aims at deriving the robot control input u (operational space velocity, joint velocity, displacement, etc.) that minimizes a trajectory error e = e(u), which can be estimated by sensors and depends on u. A general way of formulating this controller [accounting for actuation redundancy dim(u) > dim(e), sensing redundancy dim(u) < dim(e), and task constraints] is as the quadratic minimization problem: u = arg min u 1 2 e(u) 2 subject to task constraints. (1) This formulation encompasses the classic inverse kinematics problem (Whitney, 1969) of controlling the robot joint velocities (u =q), so that the end-effector operational space position x converges to a desired value x * . By defining the desired endeffector rate asẋ * = −λ (x − x * ), for λ > 0, and setting e = Jq −ẋ * for J = ∂x/∂q as the Jacobian matrix, it is easy to show that the solution to (1) (in the absence of constraints) iṡ q = J +ẋ * , with J + the generalized inverse of J. This leads to the set-point controller 1 : In the following sections, we show how each of the four senses (vision, touch, audition and distance) has been used for robot control, either with (1), or with similar techniques. Figure 2 shows relevant variables for the four cases. For simplicity, we assume there are no constraints in (1), although off-the-shelf quadratic programming solvers (Nocedal and Wright, 2000) could account for them.

Formulation
Visual servoing refers to the use of vision to control the robot motion (Chaumette and Hutchinson, 2006). The camera may be mounted on a moving part of the robot, or fixed in the workspace. These two configurations are referred to as "eye-in-hand" and "eye-to-hand" visual servoing, respectively. The error e is defined with regards to some image features, here denoted by s, to be regulated to a desired configuration s * (s is analogous to x in the inverse kinematic formulation above). The visual error is: Visual servoing schemes are called image-based if s is defined in image space, and position-based if s is defined in the 3D operational space. Here we only briefly recall the image-based approach (on its eye-in-hand modality), since the positionbased one consists in projecting the task from the image to the operational space to obtain x and then apply (2). The simplest image-based controller uses s = [X, Y] ⊤ , with X and Y as the coordinates of an image pixel, to generate u that drives s to a reference s * = [X * , Y * ] ⊤ (in Figure 2A the centroid of the human hand). This is done by defining e as: If we use the camera's 6D velocity as the control input u = v c , the image Jacobian matrix 2 relating [Ẋ,Ẏ] ⊤ and u is: where ζ denotes the depth of the point with respect to the camera. In the absence of constraints, the solution of (1) is:

Application to Human-Robot Collaboration
Humans generally use vision to teach the robot relevant configurations for collaborative tasks. For example, Cai et al. (2016) demonstrate an application where a human operator used a QR code to specify the target poses for a 6 degrees-offreedom (dof) robot arm. In Gridseth et al. (2016), the user provided target tasks via a tablet-like interface that sent the robot the desired reference view; here, the human specified 2 Also known as interaction matrix in the visual servoing literature.
various motions such as point-to-point, line-to-line, etc., that were automatically performed via visual feedback. The authors of Gridseth et al. (2015) presented a grasping system for a teleoperated dual arm robot, where the user specified the object to be manipulated, and the robot completed the task using visual servoing. Assistive robotics represents another very common application domain for visual servoing. The motion of robotic wheelchairs has been semi-automated at various degrees. For instance, Narayanan et al. (2016) presented a corridor following method that exploited the projection of parallel lines. In this work, the user provided target directions with a haptic interface, and the robot corrected the trajectories with visual feedback. Other works have focused on mobile manipulation. The authors of Tsui et al. (2011) developed a vision-based controller for a robotic arm mounted on a wheelchair; in this work, the user manually specified the object to be grasped and retrieved by the robot. A similar approach was reported in Dune et al. (2008), where the desired poses were provided with "clicks" on an screen interface.
Medical robotics is another area that involves sensor-based interactions between humans and robots, and where vision has huge potential (see Azizian et al., 2014 for a comprehensive review). For example, the authors of Agustinos et al. (2014) developed a laparoscopic camera, which regulated its pan/tilt motions to track human-held instruments.

Formulation
Touch (or force) control requires the measurement of one or multiple (in the case of tactile skins) wrenches h, which are (at most) composed of three translational forces, and three torques; h is fed to the controller that moves the robot so that it exerts a desired interaction force with the human or environment. Force control strategies can be grouped into the following two classes (Villani and De Schutter, 2008): • Direct control regulates the contact wrench to obtain a desired wrench h * . Specifying h * requires an explicit model of the task and environment. A widely adopted strategy is hybrid position/force control (Raibert and Craig, 1981), which regulates the velocity and wrench along unconstrained and constrained task directions, respectively. Referring to (1), this is equivalent to setting with S = S ⊤ ≥ 0 a binary diagonal selection matrix, and I as the identity matrix. Applying a motion u that nullifies e in (7) guarantees that the components ofẋ (respectively h) specified via S (respectively I − S) converge toẋ * (respectively h * ). • Indirect control (illustrated in Figure 2B) does not require an explicit force feedback loop. To this category belong impedance control and its dual admittance control (Hogan, 1985). It consists in modeling the deviation of the contact point from a reference trajectory x r (t) associated to the desired h * , via a virtual mechanical impedance with adjustable parameters (inertia M, damping B, and stiffness K). Referring to (1), this is equivalent to setting: Here, x represents the "deviated" contact point pose, withẋ andẍ as time derivatives. When e = 0, the displacement x − x r responds as a mass-spring-damping system under the action of an external force h − h * . In most cases, x r (t) is defined for motion in free space (h * = 0). The general formulation in (1) and (8) can account for both impedance control (x is measured and u = h) and admittance control (h measured and u = x).

Application to Human-Robot Collaboration
The authors of Bauzano et al. (2016) used direct force control for collaborative human-robot laparoscopic surgery. In their method, the instruments are controlled with a hybrid position/force approach. In Cortesao and Dominici (2017), a robot regulated the applied forces onto a beating human heart. Since the end-effector's 3 linear dof were fully-constrained, position control could not be performed, i.e., S = 0 in (7). A drawback of direct control is that it can realize only the tasks which can be described via constraint surfaces. If their location is unknown and/or the contact geometry is complex-as often in human-robot collaboration-indirect control is more suited since: (i) it allows to define a priori how the robot should react to unknown external force disturbances, (ii) it can use a reference trajectory x r (t) output by another sensor (e.g., vision). In the next paragraph, we review indirect force control methods.
By sensing force, the robot can infer the motion commands (e.g., pushing, pulling) from the human user. For example, Maeda et al. (2001) used force sensing and human motion estimation (based on minimum jerk) within an indirect (admittance) control framework for cooperative manipulation. In Suphi Erden and Tomiyama (2010) and Suphi Erden and Maric (2011), an assistant robot suppressed involuntary vibrations of a human, who controlled direction and speed of a welding operation. By exploiting kinematic redundancy, Ficuciello et al. (2013) also addressed a manually guided robot operation. The papers (Bussy et al., 2012;Wang et al., 2015) presented admittance controllers for two-arm robots moving a table in collaboration with a human. In Baumeyer et al. (2015), a human controlled a medical robot arm with an admittance controller. Robot tele-operation is another common human-robot collaboration application where force feedback plays a crucial role; see Passenberg et al. (2010) for a comprehensive review on the topic.
All these works relied on local force/moment measurements. Up to this date, tactile sensors and skins (measuring the wrench along the robot body, see Argall and Billard, 2010 for a review) have been used for object exploration (Natale and Torres-Jara, 2006) or recognition (Abderrahmane et al., 2018), but not for control as expressed in (1). One reason is that they are at a preliminary design stage, which still requires complex calibration (Del Prete et al., 2011;Lin et al., 2013) that constitutes a research topic per se. An exception is Li et al. (2013), which presented a method that used tactile measures within (1). Similarly, in Zhang and Chen (2000), tactile sensing was used to regulate interaction with the environment. Yet, neither of these works considered pHRI. In our opinion, there is huge potential in the use of skins and tactile displays for humanrobot collaboration.

Formulation
The purpose of audio-based control is to locate the sound source, and move the robot toward it. For simplicity, we present the twodimensional binaural (i.e., with two microphones) configuration in Figure 2C, with the angular velocity of the microphone rig as control input: u =α. We hereby review the two most popular methods for defining error e in (1) • ITD-based aural servoing uses the difference τ between the arrival times of the sound on each microphone; τ must be regulated to a desired τ * . The controller can be represented with (1), by setting e =τ −τ * , with the desired ratė τ * = −λ (τ − τ * ) (to obtain set-point regulation to τ * ). Feature τ can be derived in real-time by using standard crosscorrelation of the signals (Youssef et al., 2012). Under a far field assumption: with c the sound celerity and b the microphones baseline.
From (9), the scalar ITD Jacobian is: The motion that minimizes e is: which is locally defined for α ∈ (0, π), to ensure that |J τ | = 0. • ILD-based aural servoing uses ρ, the difference in intensity between the left and right signals. This can be obtained in a time window of size N as ρ = E l /E r , where the E l,r = N n=0 γ l,r [n] 2 denote the signals' sound energies and the γ l,r [n] are the intensities at iteration n. To regulate ρ to a desired ρ * , one can set e =ρ −ρ * withρ * = −λ (ρ − ρ * ). Assuming spherical propagation and slowly varying signal: where y s is the sound source frontal coordinate in the moving auditory frame, and L r the distance between the right microphone and the source. From (11), the scalar ILD Jacobian is J ρ = y s (ρ + 1)b/L 2 r . The motion that minimizes e is: where J −1 ρ is defined for sources located in front of the rig. In contrast with ITD-servoing, here the source location (i.e., y s and L r ) must be known or estimated.
While the methods above only control the angular velocity of the rig (u =α), Magassouba extended both to also regulate 2D translations of a mobile platform (ITD in Magassouba et al., 2015, 2016cand ILD in Magassouba et al., 2016a.

Application to Human-Robot Collaboration
Due to the nature of this sense, audio-based controllers are mostly used in contact-less applications, to enrich other senses (e.g., force, distance) with sound, or to design intuitive humanrobot interfaces.
Audio-based control is currently (in our opinion) an underdeveloped research area with great potential for humanrobot collaboration, e.g., for tracking a speaker. Besides the cited works (Magassouba et al., 2015(Magassouba et al., , 2016a, that closely followed the framework of section 3, others have formulated the problem differently. For example, the authors of Kumon et al. (2003Kumon et al. ( , 2005 proposed a linear model to describe the relation between the pan motion of a robot head and the difference of intensity between its two microphones. The resulting controllers were much simpler than (10) and (12). Yet, their operating range was smaller, making them less robust than their-more analytical-counterparts.

Formulation
The simplest (and most popular) distance-based controller is the artificial potential fields method (Khatib, 1985). Despite being prone to local minima, it has been thoroughly deployed both on manipulators and on autonomous vehicles for obstacle avoidance. Besides, it is acceptable that a collaborative robot stops (e.g., because of local minima) as long as it avoids the human user. The potential fields method consists in modeling each obstacle as a source of repulsive forces, related to the robot distance from the obstacle (see Figure 2D). All the forces are summed up resulting in a velocity in the most promising direction. Given d, the position of the nearest obstacle in the robot frame, the original version (Khatib, 1985) consists in applying operational space velocity Here d o > 0 is the (arbitrarily tuned) minimal distance required for activating the controller. Since the quadratic denominator in (13) yields abrupt accelerations, more recent versions adopt a linear behavior. Referring to (1), this can be obtained by setting e =ẋ −ẋ * withẋ * = λ 1 − d 0 / d d as reference velocity: By defining as control input u =ẋ, the solution to (1) is:  (15), but smoothed by a sigmoid. Proximity servoing is a similar technique, which regulates-via capacitive sensors-the distance between the robot surface and the human. In Schlegl et al. (2013), these sensors modified the position and velocity of a robot arm when a human approached it, to avoid collisions. The authors of Bergner et al. (2017), Leboutet et al. (2016), and Dean-Leon et al. (2017) developed a new capacitive skin for a dual-arm robot. They designed a collision avoidance method based on an admittance model similar to (8), which relied on the joint torques (measured by the skin) to control the robot motion.

INTEGRATION OF MULTIPLE SENSORS
In section 3, we presented the most common sensor-based methods used for collaborative robots. Just like natural senses, artificial senses provide complementary information about the environment. Hence, to effectively perform a task, the robot should measure (and use for control) multiple feedback modalities. In this section, we review various methods for integrating multiple sensors in a unique controller.
Inspired by how humans merge their percepts (Ernst and Banks, 2002), researchers have traditionally fused heterogeneous sensors to estimate the state of the environment. This can be done in the sensors' Cartesian frames (Smits et al., 2008) by relying on an Extended Kalman Filter (EKF) (Taylor and Kleeman, 2006). Yet the sensors must be related to a single quantity, which is seldom the case when measuring different physical phenomena (Nelson and Khosla, 1996). An alternative is to use the sensed feedback directly in (1). This idea, proposed for position-force control in Raibert and Craig (1981) and extended to vision in Nelson et al. (1995), brings new challenges to the control design, e.g., sensor synchronization, task compatibility, and task representation. For instance, the designer should take care when transforming 6 D velocities or wrenches to a unique frame. This requires (when mapping from frame A to frame B) multiplication by for a velocity, and by B V ⊤ A for a wrench. In (16), B R A is the rotation matrix from A to B and B t A × the skew-symmetric matrix associated to translation B t A .
According to Nelson et al. (1995), the three methods for combining N sensors within a controller are: FIGURE 3 | The most common scheme for shared vision/touch (admittance) control, used in Morel et al. (1998), Agravante et al. (2013Agravante et al. ( , 2014. The goal is to obtain desired visual features s * and wrench h * , based on current image I and wrench h. The outer visual servoing loop based on error (3) outputs a reference velocityẋ r that is then deformed by the inner admittance control loop based on error (8), to obtain the desired robot position x.
• Traded: the sensors control the robot one at a time. Predefined conditions on the task trigger the switches: (17) • Shared: All sensors control the robot throughout operation. A common way is via nested control loops, as shown-for shared vision/touch control-in In the example of Figure 3: u = x, u o =ẋ r , e o = e v applying (3) and e i = e t applying (8). • Hybrid: the sensors act simultaneously, but on different axes of a predefined Cartesian task-frame (Baeten et al., 2003). The directions are selected by binary diagonal matrices S j , j = 1, . . . , N with the dimension of the task space, and such that N j=1 S = I: To express all e j in the same task frame, one should apply B V A and/or B V ⊤ A . Note the analogy between (19) and the hybrid position/force control framework (7).
We will use this classification to characterize the works reviewed in the rest of this Section.

Traded Control
The paper (Cherubini et al., 2016) presented a human-robot manufacturing cell for collaborative assembly of car joints. The approach (traded vision/touch) could manage physical contact between robot and human, and between robot and environment, via admittance control (8). Vision would take over in dangerous situations to trigger emergency stops. The switching condition was determined by the position of the human wrt the robot.
In Okuno et al. (2001Okuno et al. ( , 2004, a traded vision/audio controller enabled a mobile robot to exploit sound source localization for visual control. The robot head would automatically rotate toward the estimated direction of the human speaker, and then visually track him/her. The switching condition is that the sound source is visible. The audio-based task is equivalent to regulating τ to 0 or ρ to 1, as discussed in section 3.4. Paper (Hornstein et al., 2006) presented another traded vision/audio controller for the iCub robot head to localize a human speaker. This method constructed audio-motor maps and integrated visual feedback to update the map. Again, the switching condition is that the speaker's face is visible. In Chan et al. (2012), another traded vision/audio controller was deployed on a mobile robot, to drive it toward an unknown sound source; the switching condition is defined by a threshold on the frontal localization error.
The authors of Papageorgiou et al. (2014) presented a mobile assistant for people with walking impairments. The robot was equipped with: two wrench sensors to measure physical interaction with the human, an array of microphones for audio commands, laser sensors for detecting obstacles, and an RGB-D camera for estimating the users' state. Its controller integrated audio, touch, vision, and distance in a traded manner, with switching conditions determined by a knowledgebased layer.
The work (Navarro et al., 2014) presented an object manipulation strategy, integrating distance (capacitive proximity sensors) and touch (tactile sensors). While the method did not explicitly consider humans, it may be applied for human-robot collaboration, since proximity sensors can detect humans if vision is occluded. The switching condition between the two modes is the contact with the object.
Another example of traded control-here, audio/distanceis Huang et al. (1999), which presented a method for driving a mobile robot toward hidden sound sources, via an omnidirectional array of microphones. The controller switched to ultrasound-based obstacle avoidance in the presence of humans/objects. The detection of a nearby obstacle is the switching condition.

Shared Control
In applications where the robot and environment/human are in permanent contact (e.g., collaborative object transportation), shared control is preferable. Let us first review a pioneer controller (Morel et al., 1998) that relied on shared vision/touch, as outlined in Figure 3; Morel et al. (1998) addressed teleoperated peg-in-hole assembly, by placing the visual loop outside the force loop. The reference trajectoryẋ r output by visual servoing was deformed in the presence of contact by the admittance controller, to obtain the robot position command x. Human interaction was not considered in this work.
The authors of Natale et al. (2002) estimated sensorymotor responses to control a pan-tilt robot head with shared visual/audio feedback from humans. They assumed local linear relations between the robot motions and the ITD/ILD measures. This resulted in a controller which is simpler than the one presented in section 3.4. The scheme is similar to Figure 3, with an outer vision loop generating a reference motion, and audio modifying it. Pomares et al. (2011) proposed a hybrid vision/touch controller for grasping objects, using a robot arm equipped with a hand. Visual feedback drives an active camera (installed on the robot tip) to observe the object and detect humans to be avoided, whereas touch feedback moves the fingers, to grasp the object. The authors defined matrix S in (7) to independently control arm and fingers with the respective sensor.

Hybrid Control
In Chatelain et al. (2017), a hybrid scheme controlled an ultrasonic probe in contact with the abdomen of a patient. The goal was to center the lesions in the ultrasound image observed by the surgeon. The probe was moved by projecting, via S, the touch, and vision (from the ultrasound image) tasks in orthogonal directions.

Other Control Schemes
Some works do not strictly follow the classification given above. These are reviewed below.
The authors of Agravante et al. (2013Agravante et al. ( , 2014 combined vision and touch to address joint human-humanoid table carrying. The table must stay flat, to prevent objects on top from falling off. Vision controlled the table inclination, whereas the forces exchanged with the human made the robot follow his/her intention. The approach is shared, with visual servoing in the outer loop of admittance control (Figure 3), to make all dof compliant. However, it is also hybrid, since some dof are controlled only with admittance. Specifically vision regulated only the table height in Agravante et al. (2013), and both table height and roll angle in Agravante et al. (2014).
The works (Cherubini and Chaumette, 2013;Cherubini et al., 2014) merged vision and distance to guarantee lidarbased obstacle avoidance during camera-based navigation. While following a pre-taught path, the robot must avoid obstacles which were not present before. Meanwhile, it moves the camera pan angle, to maintain scene visibility. Here, the selection matrix in (19) was a scalar function S ∈ [0, 1] dependent on the timeto-collision. In the safe context (S = 0), the robot followed the taught path, with camera looking forward. In the unsafe context (S = 1) the robot circumnavigated the obstacles. Therefore, the scheme is hybrid when S = 0 or S = 1 (i.e., vision and distance operate on independent components of the task vector), and shared when S ∈ (0, 1).
In Dean-Leon et al. (2016), proximity (distance) and tactile (touch) measurements controlled a robot arm in a pHRI scenario to avoid obstacles or-when contact is inevitable-to generate compliant behaviors. The framework linearly combined the two senses, and provided this signal to an inner admittance-like control loop (as in the shared scheme of Figure 3). Since the operation principle of both senses was complementary (one requires contact while the other does not), the integration can also be seen as traded.
The authors of Cherubini et al. (2015) enabled a robot to adapt to changes in the human behavior, during a humanrobot collaborative screwing task. In contrast with classic hybrid vision-touch-position control, their scheme enabled smooth transitions, via weighted combinations of the tasks. The robot could execute vision and force tasks, either exclusively on different dof (hybrid approach) or simultaneously (shared approach).

CLASSIFICATION OF WORKS AND DISCUSSION
In this section, we use five criteria to classify all the surveyed papers which apply sensor-based control to collaborative robots. This taxonomy then serves as an inspiration to drive the following discussion on design choices, limitations, and future challenges.

References
Sense ( vision provides the richest perceptual information to structure the world and perform motion (Hoffman, 1998). Touch is the second most commonly used sensor (18 papers) and fundamental in pHRI, since it is the only one among the four that can be exploited directly to modulate contact. Also note that, apart from Papageorgiou et al. (2014), no paper integrates more than two sensors. Given the sensors wide accessibility and the recent progress in computation power, this is probably due to the difficulty in designing a framework capable of managing such diverse and broad data. Another reason may be the presumed (but disputable) redundancy of the three contactless senses, which biases toward opting for vision, given its diffusion and popularity (also in terms of software). Touchthe only sensor measuring contact-is irreplaceable. This may also be the reason why, when merging two sensors, researchers have generally opted for vision+touch (7 out of 17 papers). The most popular among the three integration methods is traded control, probably because it is the easiest to set up. In recent years, however, there has been a growing interest toward the shared+hybrid combination, which guarantees nice properties in terms of control smoothness.
Finally, Table 5 gives a classification based on the robotic platform. We can see that (unsurprisingly) most works use fixed base arms. The second most used platforms here are wheeled robots. Then, the humanoids category, which refers to robots with anthropomorphic design (two arms and biped locomotion capabilities). Finally, we consider robot heads, which are used exclusively for audio-based control. While robot heads are commonly used for face tracking in Social Human Robot Interaction, such works are not reviewed in this survey as they do not generally involve contact.

CONCLUSIONS
This work presents a systematic review of sensor-based controllers which enable collaboration and/or interaction between humans and robots. We considered four senses:
Finally, some open problems must be addressed, to develop robust controllers for real-world applications. For example, the use of task constraints has not been sufficiently explored when multiple sensors are integrated. Also, difficulty in obtaining models describing and predicting human behavior hampers the implementation of human-robot collaborative tasks. The use of multimodal data such as RGB-D cameras with multiple proximity sensors may be an interesting solution for this human motion sensing and estimation problem. More research needs to be conducted in this direction.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/supplementary materials, further inquiries can be directed to the corresponding author/s.