Vision-based safe autonomous UAV docking with panoramic sensors

The remarkable growth of unmanned aerial vehicles (UAVs) has also sparked concerns about safety measures during their missions. To advance towards safer autonomous aerial robots, this work presents a vision-based solution to ensuring safe autonomous UAV landings with minimal infrastructure. During docking maneuvers, UAVs pose a hazard to people in the vicinity. In this paper, we propose the use of a single omnidirectional panoramic camera pointing upwards from a landing pad to detect and estimate the position of people around the landing area. The images are processed in real-time in an embedded computer, which communicates with the onboard computer of approaching UAVs to transition between landing, hovering or emergency landing states. While landing, the ground camera also aids in finding an optimal position, which can be required in case of low-battery or when hovering is no longer possible. We use a YOLOv7-based object detection model and a XGBooxt model for localizing nearby people, and the open-source ROS and PX4 frameworks for communication, interfacing, and control of the UAV. We present both simulation and real-world indoor experimental results to show the efficiency of our methods.


Introduction
Recently, unmanned aerial vehicles (UAVs, or drones) have seen an unprecedented rise in their adoption rate, primarily thanks to technological advancements improving their availability and dependability [1].They have been vital components in multiple civil applications, ranging from remote sensing applications [2] to aerial delivery [3].
One of the key issues stopping wider adoption of UAVs for civilian applications in urban areas is safety and security [4].Autonomous UAVs flying over populated areas pose inherent hazards.The risk increases significantly during take-off and docking maneuvers, with potential risks for nearby passers.This paper seeks to address the safety of persons near a landing area and define a framework for safety-aware autonomous UAV landing with minimal ground infrastructure.
Specifically, the aim is to first design and develop a solution with minimal infrastructure footprint and commercial off-the-self components.Then, we validate the functionality of the system through a series of experiments in the Gazebo simulator and our 9 × 8 × 5 m indoor test area.Our goal is to provide a solution that can further enhance the safety of autonomous UAV landing operations.One of the fundamental aspects of UAV landing safety is the avoidance of potential hazards on the landing path.While the concept of hazards avoidance during UAV landing is vast, we narrow the scope to protecting pedestrians near the landing area.
Leveraging on the recent rapid development of deep-learning-enabled computer vision on embedded hardware [5] and the high potential of 360°panoramic sensors, we approach the problem with an on-ground vision-based system for landing area monitoring to identify people who are at risk from the landing UAV.Our envisioned system is a lightweight landing pad with a single panoramic camera in the center providing a bottom-view that gives the system a 360°view of the surroundings.An embedded computing unit processes the information to generate relevant information, and packages them as lightweight, efficient messages to send to the UAV to adjust its landing trajectory.Figure 1 illustrates our envisioned system and the intended behavior.On-ground approach for safe UAV landing have two significant advantages over its onboard counterpart.First, it widens the options for computing platforms and sensors.UAV payloads are limited, so for tasks such as aerial delivery, every gram of weight that can be saved by replacing heavy companion computers and sensors with more lightweight options is directly transferred to the weight that their primary task requires them to carry.The solution described and implemented in this paper does not involve a very high-end computing platform.Second, ground-based solutions are potentially more robust to limited environment observability from UAVs, and can also serve as a redundant way of ensuring safety in such critical scenarios.
Moreover, we design and implement the safe UAV landing software based on open-source libraries.Our software components include the detection module, which consists of an object detector and a distance estimator to identify and localize people in a two-dimensional space, and an autonomous flight program that safely allows the UAV to land while maintaining complete autonomy by using the information about the surroundings provided by the detection module.The functionality of each software component and the communication between them is facilitated by the free and open-source Robot Operating System(ROS), which has become the de facto standard for robotic applications in recent years.The popular autopilot library PX4 is also utilized for high-level UAV control and integration of autonomous flight algorithms.
The rest of this paper is organized as follows.Section 2 introduces related works in computer vision for panoramic sensors, and vision-based approaches used in UAV landing.Section 3 introduces our methodology for a ground-based vision-based safe UAV landing framework.Section 4 then reports our experimental setup and results.Finally, Section 5 concludes the work.

Vision-based systems for autonomous UAV landing
In the literature, research for vision-enabled autonomous landing systems for UAVs, primarily multi-rotor vehicles, can be divided into two main categories: onboard and on-ground.According to the survey by [6], the former approach is the more predominant and well-studied approach, with multiple systems developed for landing on known, unknown and moving areas, while works done for on-ground vision-based landing systems are still scarce.
A common point among on-ground systems is that they utilize a diverse range of sensing units because these systems are not restricted by UAV payload.However, most of the work in this category focuses on the pose estimation and control of the UAV rather than the monitoring of the landing site.In one of the earlier research on on-ground monitoring systems, [7] introduced a computer control camera platform to identify square markers with known size patched on micro aircraft to measure their three-dimensional coordinates.The main limitation of this method was the camera's narrow field of view (FOV) and reliance on a step motor to shift its orientation to access other viewpoints.[8] later introduced a system that can estimate UAV's position based on onboard key features in real-time by extracting information provided by a trinocular camera system on the ground.Alternatively, instead of standard RGB cameras, [9] presented a ground-based guidance system utilizing an array of near-infrared cameras, which significantly increases the detection range to detect, track, and autonomously land a fixed-wing UAV without reliance on GPS data.
Several other works have also presented onboard methods that select safe landing zones by detecting potential hazards on the landing path [10].In these papers, the authors utilize lightweight convolutional neural networks such as YOLO [11] and MobileNet [12] to detect safe landing zones, which are away from individual or groups of people in populated areas [13], or flat and obstaclefree areas [14].

Object detection on panoramic images
Object detection on panoramic images is a topic that is also less well-studied than its pinhole counterpart within the literature.One concept that has been researched to adapt object detection models to fisheye imagery, which frequently has oriented and radially distorted objects, is alternative representations for standard bounding boxes.[15] explored the usage of curved boxes, oriented boxes, ellipses, and polygons.YOLOv3 [16] was adapted and modified to output these different representations.The results show that 24-sided polygons achieved the most reasonable tradeoffs between model complexity and accuracy.Further analysis also reports no drops in inference speed when increasing the number of vertices.Alternatively, [17] proposed a simple framework for oriented box representation by gliding each vertex of the original horizontal box on its corresponding side to get more accurate coverage of the detected object and demonstrated the method's effectiveness in object detection on aerial images, texts, and pedestrians in fisheye images.
[18] presented a localization method by leveraging top-view fisheye images from a UAV and altitude data.The proposed framework first involves acquiring pixel positions of objects using an object detection model implemented based on the RetinaNet model [19] with MobileNet [20] backbone for more efficient computing.Then, by fusing the camera's parameter and height data from other sensors, a series of coordinate transforms is performed to obtain the object's position in world coordinates.
In addition, some public fisheye image datasets were published to facilitate the development of this field of research.Most noticeable is the Woodscape dataset [21] for autonomous driving, comprising over 100,000 images from four surrounding cameras.Later, in 2022, the KITTI-360 dataset [22] was released as a successor to the popular KITTI dataset [23].It expanded on the original work with more data for suburban driving from multiple sensor units, including two 180°fisheye cameras on each side of the station wagon.

Detection module
To identify whether the landing spot is safe, we implement a system that detects people around the area and estimates their distances to the camera.For the rest of this work, we will refer to this combination of human detection and distance estimation system as the detection module.
Human detection This project's object detection model is based on YOLOv7 [24], whose official project provides different model versions with varying sizes and complexity.The standard models are the tiny version, which optimizes for high throughput and minimal footprint to run on edge GPU; the normal version, namely YOLOv7, for regular consumer-grade GPUs; and the more powerful, cloud GPU-oriented YOLOv7-W6.To further optimize YOLOv7-tiny for edge GPUs, the authors use Rectified Linear Unit (ReLU) as the activation function.On the other hand, for other versions, Sigmoid-weighted Linear Unit (SiLU) [25] is used as the activation function.For this work, we mainly consider the tiny and the normal versions of YOLOv7, as empirical testing shows they are more suitable for deployment on our embedded platform.
Distance estimation Monocular depth estimation is a challenging topic that has received much attention recently.The most common approach is to train a deep learning model to predict depth from an arbitrary input image [26,27].
The training data can be from multiple measuring tools like LIDAR, RGB-D, and stereo cameras.Unfortunately, most public datasets only have depth images in perspective view.To our knowledge, no available pre-trained monocular depth estimation models trained on data with the same characteristics as ours exist.Another possible approach that has been studied is integrating a distance estimator head into the object detector's architecture [28].
The goal for the system is not to prioritize precisely predicting the distance of the person to the camera but instead to get a rough estimate of whether the person is close or far away from the camera to determine if the surrounding area is safe for UAV landing.Therefore, we choose a more straightforward solution that integrates well with the rest of our system and requires little computational power during inference time.Specifically, we leverage the bounding boxes information from the object detector as input for a regression model to predict the people's distance to the camera.The regression model of choice is the gradient-boosted decision trees algorithm implemented with the XGBoost library [29].While previously shown in Figure 2 that the bounding box areas are correlated to the distance, better results can be obtained when inferring with other bounding box details, including its center point coordinates (x, y) and its dimensions (w, h), since two pictures showing the same person at the same distance to the camera can have much different bounding box shapes when the orientation changes.Furthermore, varying poses, e.g., crouching and sitting, can drastically change the shapes of the bounding boxes as well.

Vision-based localization
The bounding boxes from the object detector can provide insight into the relative orientation of a detected person to the camera, and the distance predicted by the XGBoost model can estimate how far they are from the camera.Fusing these two pieces of information allows a person to be sufficiently localized in a two-dimensional space.Initially, the image coordinate system must be transformed to one that matches the camera's coordinate system.To simplify the experiment, we position the camera and vehicle to align their coordinate axes with the world coordinate system (in this case, the coordinates of the MOCAP system).In the standard image coordinate system, the origin lies in the top left corner, with a horizontal x-axis from left to right and a vertical y-axis pointing downwards.To transform the image coordinates (in pixels) to the camera coordinate system depicted in Figure 3, the transformations are as follows: Suppose the camera is not aligned with the world coordinate system.In that case, the offset angle between the world's and the camera's coordinate system must be pre-known, and a two-dimensional rotation transformation must be performed to obtain the coordinates of the detected people with respect to the From the transformed coordinate system, the orientation of the directional vector pointing to detected people can be obtained.For simplicity, we define this directional vector as the normalized vector from the camera coordinates origin to the center point of the corresponding bounding box.Then, the predicted distance from the XGBoost model is used to scale this vector to an approximate position of the detected person in two-dimensional space.We denote the coordinates of a person in the world coordinate system as X p for simplicity, and the formula for calculating it is: d x cam and d y cam are the camera's position in the world coordinate system, and d pred is the distance prediction result from the distance estimator.Figure 4 summarizes the localization process, from acquiring the people's coordinates in a frame to projecting them into the world coordinates.

Safe landing program
System behavior In our deployment, both in the Gazebo simulation and in a real-world indoor experiment, the vehicle operates in offboard mode, which allows it full autonomy, with position mode as the fallback in case of failure.In indoor environments, the UAV utilizes local coordinates for localization and determining mission setpoints.The experimental flight mission consists of four phases: taking off, performing the flight mission, pre-landing, and landing.The first, second, and last phases are self-explanatory, while the pre-landing phase activates the safe landing mechanism.We define pre-landing as going to a setpoint at a height safe for complete landing while continuously communicating over a ROS network with the detection module on the ground for information of the surrounding as shown in Figure 5.
During pre-landing, the vehicle will retreat to a safe position and hover if the detection module detects a person within a predefined safe threshold.After a set period, the vehicle switches to adaptive emergency landing mode and searches for an optimal landing position.The vehicle can also resort to this behavior in circumstances where hovering is impossible, e.g., when the payload is over a threshold or when the battery is low.We assume that aside from people around the landing area, there are no other direct threats to the landing procedure.

Adaptive emergency landing
When the vehicle can no longer hover at a safe position and must land immediately, the optimal position for landing, X o = {x o , y o }), considering the vehicle's current position, which is also the camera's position, must satisfy multiple criteria.Firstly, it must move as far away from the surrounding people as possible.Secondly, its landing position must be away from each person by a specific range.Last but not least, we must ensure that the landing position is within a certain threshold, so the search range needs to be limited.To satisfy all requirements mentioned above, we reformulate the optimal landing position search into an optimization problem and utilize a solver to get the results.We implemented the landing spot search with SciPy's minimize function with the optimization method Sequential Least Squares Programming (SLSQP), which is suitable for constrained optimizations.Figure 6 illustrates how we approach the problem.
The positions of detected people are denoted as X 1 , ..., X np , and the position of the camera, which is also the hovering position of the UAV, is denoted as X c .We define a search zone as a circular area with radius r l where the optimizer can search for a landing spot.Around each detected person is a danger zone with range r d , which the vehicle should avoid.Finally, the scan zone with range r s is where all the people are considered to be in danger and should be avoided.The scan zone is also the area in which the emergency state for the flight controller is triggered, causing it to retreat the vehicle to a safe position and hover initially.To ensure that the vehicle does not go out of the scan zone, where the camera and the detection module do not provide enough information to conclude whether there are people, we restrict that r l ≤ r s .
Because we want the UAV to land as far as possible from the people standing in close vicinity of the camera/UAV's initial landing spot, for n p humans detected in the scan zone, the function that the optimization solver must maximize is as follows: Then, to ensure the solution is not within the danger zone, the first constraint is formulated as: Lastly, the selected landing position must be in the predefined search zone: While maximizing the function 3 results in a landing position that is the furthest from all detected people, it is sometimes safer to emphasize the people who are closer to the camera, which is also the hovering position of the UAV.To do so, we introduce another term to address how close a person is to function 3, and rewrite it as: The parameter α controls how much the distance of each detected person in the scan zone to the UAV's current position impacts the selection of the landing spot.In other words, the higher α is, the more the UAV tries to avoid people close to it.

Offboard navigation
Algorithm 1 summarizes the safe landing program on the UAV's companion computer.It amalgamates the visual-based localization algorithm, the adaptive emergency landing algorithm, and a simple finite state machine determining each mission phase's setpoint.Several parameters related to the flight mission must be pre-determined, including the take-off height, the mission waypoints, and the pre-landing height above the landing pad.Furthermore, the parameters mentioned in Section 3.3 should be tuned for different situations.The condition mentioned in line 1 of the algorithm is used for the simulated and indoor experiments, which only sets a timeout period for the UAV's hovering.This condition can be extended to adapt to more types of emergencies that require immediate landing.

Data preparation
The inputs for the detector described in Section 3.1 are image frames collected from a single PICAM360 panoramic camera module.The images are retained in their original circular panoramic form to minimize the amount of pre-processing required and streamline implementation on embedded platforms.Our dataset comprises a training set with 5,062 images from 7 ROS bags and a test set with 2,030 images from 2 ROS bags.
To further enrich the dataset, data augmentation is a viable option that has proven effective in improving deep learning models' performance in various domains, including computer vision [30,31].We applied rotational transformations to the original training and test sets with angle θ ∈ [90°, 180°, 270°].Conventionally, augmenting the test set is not advisable because it is crucial to maintain the authenticity of the unseen data that the model might encounter in the real world.However, rotating a circular fisheye image is valid in the context of this work as it simulates changes in the camera placement angle, which is very likely to happen in our application.Unlike pinhole cameras that have to be upfront when taking regular photos, the panoramic camera in this setup does not have to be in any specific orientation.Furthermore, while people are moving around the camera when the dataset is collected, the background does not, so hypothetically, this will also enhance the robustness of the trained models.
Another requirement for this project is estimating the distance between each detected human and the camera module.We experimented with two methods to get this information: Decawave's ultra-wideband (UWB) module DWM1001 and the Optitrack MOCAP system.Each person in the experiment holds a UWB module or a set of reflective markers; another module will be placed where the camera is.The distance between the person to the camera is calculated as the Euclidean distance in two-dimensional space between the module they are holding and the camera.

Evaluation metric
The trained object detector should be able to reliably detect potential hazards to the landing operation, in this case, people around the iment, i.e., the distance that a detected person is considered close to the camera.To evaluate the trained model's performance on people at different ranges, we use a slightly modified version of COCO's AP across object size metric.Examining the distance data illustrated in Figure 2 shows that this distance negatively correlates with the normalized area of the corresponding bounding box, so it is reasonable to use the box area as a rough estimate to separate instances that are close to or far from the camera.The median bounding box area of samples within 3 ± 0.2 m vicinity of the camera from the dataset is approximately 0.0135, so the metrics that we use are: -AP F : AP for far objects, bounding boxes with area ≥ 0.0135h im w im -AP N : AP for near objects, bounding boxes with area ≤ 0.0135h im w im -AP all : AP for all objects Fine-tuning on panoramic dataset To leverage the well-initialized weights of the pre-trained models, we use them as the foundation and fine-tune them on our training set.With this technique, it is possible to obtain a model capable of performing in the target environment with a relatively small amount of data compared to the large-scale COCO dataset.As mentioned in Section 3.1, we focus on training two model versions, YOLOv7-tiny and YOLOv7.Both models are trained with base inference size 640 and multi-resolution.Multi-resolution training is a technique that varies the training resolution to ±50% of the base resolution, which should improve the model's robustness to scaling changes and prediction performance on small objects.Previous research has reported promising results on this training technique's effectiveness in improving object detection models' resolution scalability [32,33].

Ablation studies
We now analyze the effect of data augmentation and multiresolution training in the performance of the trained model.

Rotational augmentation
When forming the dataset, we apply rotational augmentation to enrich the dataset.We conducted this ablation study by evaluating the performance of the fine-tuned models on both the unaugmented and augmented datasets.Table 1 shows that the models trained on the augmented dataset outperform those trained only on unrotated data in all test cases.

Multi-resolution training
We utilized multi-resolution training to improve inference accuracy and robustness to scaling changes during the training process.As shown in Table 1 this training method improves the model's performance compared to the model trained with fixed resolution.

Deployment on embedded platform
When the models are well-trained on the custom dataset and ready for deployment, they are converted to TensorRT engines and deployed on the target embedded platform, the NVIDIA Jetson Xavier NX.When integrating with the ROS detection node running on the ground computing platform, the ROS bounding box and distance messages must be published at a high frequency to address the tight latency requirements of real-time applications.The maximum frequency this message can be published is 30Hz, the highest supported framerate of the PICAM360 module.Because the XGBoost prediction has little to no effect on the topic's frequency during our experiments, the essential factor in the detection module's speed is the object detector's throughput.Table 2 shows the frequency of the bounding box and distance messages when the ROS node is running on the target embedded computer.We gather these measurements using the rostopic tool.

Distance estimator training details
We divided the training data into five sets of bounding box data from 5 different ROS bags.To optimize and validate the performance of the distance estimator, we keep one holdout set and perform a randomized search with cross-validation (Scikit-learn's RandomizedSearchCV) using the training set to obtain the optimal hyperparameters as follows: max depth: 3 learning rate: 0.05 n estimators: 500 colsample bytree: 0.5 colsample bylevel: 0.8 subsample: 0.6 These hyperparameters slightly improve the performance on the holdout set over the default ones, and the results in mean absolute error (MAE), median absolute error (MedAE), maximum error (MaxErr), and explained variation (Ex-pVar) are shown in Table 3

Evaluation of vision-based localization algorithm
As both components of the detection module have been trained and tested, we proceed to evaluate the performance of the localization algorithm base on visual information.To simplify the evaluation process, we select three datasets with one person walking around the camera for testing.Three temporary XGBoost models were also trained without the test sets to avoid high accuracy due to overfitting.The object detection model used for this test was the YOLOv7-tiny.
After applying the algorithm mentioned in Section 3.1, the resulting trajectories are recorded and visualized in Figure 7.
We evaluate these results with cosine similarity (Cossim) and average positioning error (APE) metrics.The former gives insight into how accurate our method is at determining the direction in which the person is with respect to the camera.The latter quantifies how accurate the predicted trajectories in Figure 7 are.The experimental results are shown in Figure 8.For simplicity, the missing bounding boxes and the frames without corresponding Optitrack data are omitted when calculating the metrics.

Autonomous landing experiments
Simulation The autonomous flight programs are thoroughly tested in a simulation environment before deployment to guarantee safety.The simulation environment was implemented with PX4 Gazebo SITL.The tests are conducted on a Laptop with an NVIDIA RTX3070 GPU to run the YOLOv7 models on image frames from the PICAM360.From the simulated results, we validate that both designed behaviors, hovering and adaptive emergency landing, function correctly during the pre-landing phase.more information on the algorithm).We deliberately choose a small range for the search zone to keep the experimental UAV within the flight zone.Furthermore, to simplify the experiment, the mission only consists of the UAV taking off and landing at the same spot afterward because we are most interested in the latter's behavior for the scope of this thesis.Because the safe landing software is still in development, to ensure the safety of the people involved in the experiments, as well as to protect the equipment of the on-ground monitoring system, we place the embedded computer running the detection module and the panoramic camera away from the drone during the experiment and interpolate the positions of the camera and people to the drone's position while analyzing the results.The experimental setup is described in Figure 9 We experimented with one and two people approaching the camera in multiple directions.The hovering and adaptive emergency landing behaviors work as expected in all experiments.Figure 10 presents the optimal landing position selection results for clearer visualization.In Figure 10C, while the optimal landing position maximizes the distance to the detected people, it does not prioritize the closest person to the drone.As explained in Section 3.3, this behavior can be altered by modifying the parameter α in Function 3 to increase the prioritization of the algorithm on the distance of the detected person to the camera/drone's position.This effect is demonstrated in Figure 10C-F, which shows that increasing α increases the distance between the landing position and the closest person.

Conclusion
In this paper, we propose a novel on-ground vision-based solution for safe UAV landing by leveraging the omnidirectional view capability of panoramic sensors.The detection module, comprising a YOLOv7-based object detector and an XGBoost-based distance estimator, demonstrates high capability in detecting and localizing humans near the landing zone while delivering real-time performance.Furthermore, a series of indoors experiments has proven the system's reliability in enabling landing UAVs to avoid surrounding pedestrians.Rather than completely replacing available onboard methods [13,14], our solution serves as an extra layer of safety for UAV landing applications.Our ultimate goal is a collaborative autonomy approach where sensor and detection data from the micro-airports is fused with the UAVs' sensors and computational capabilities to enhance the system's reliability, safety, and efficiency.

Fig. 2 .
Fig. 2.Relationship between normalized bounding box area and distance between the object to the camera.It is worth noting that high position accuracy is not needed; instead, a high recall in the detection (low probability of false negatives) and good classification accuracy (safe or unsafe distances) are more important.

Fig. 4 .
Fig. 4. Illustration of how image coordinates are transformed to positions of detected people.Because of the panoramic nature of the sensor, images are heavily distorted near the ground level or image edges.

Fig. 5 .
Fig. 5. Communication between the detection module and the flight controller over a ROS network.Only the detection and position estimation results are sent to the UAV.

Fig. 6 .
Fig. 6.Optimization problem to solve to obtain safe landing position.

Fig. 7 .
Fig. 7. Vision-based localization trajectories in comparison with the trajectories from the Optitrack system (missed detections are omitted).

Fig. 8 .
Fig. 8. Evaluation of vision-based localization method.Each of the labels Ri represent three different experiments.

Table 1 .
Performance of the fine-tuned (FT) models and ablation study on the effect of multi-resolution (MR) training and training on rotationally augmented data (Aug).

Table 2 .
Frequency of /yolov7/boundindboxes dist ROS topic at inference time with Pytorch and TensorRT implementations on an NVIDIA Jetson Xavier NX

Table 3 .
Performance comparison between XGBoost models with default and tuned hyperparameters.