- Informatics, Cobots and Intelligent Construction (ICIC) Lab, Engineering School of Sustainable Infrastructure & Environment, University of Florida, Gainesville, FL, United States
Autonomous drones are increasingly deployed for navigation, inspection, and monitoring in urban building and infrastructure environments that are dynamic, partially observable, and safety critical. These missions must balance conflicting objectives such as goal completion, wind avoidance, collision avoidance, signal coverage, and flight efficiency, making Multi-Objective Reinforcement Learning (MORL) an attractive control method. However, current explainability methods rarely examine how MORL policies prioritize different sensor channels during urban drone operations, leaving objective trade-offs and input priorities opaque to human operators. This paper introduces a lightweight group-gating architecture that augments MORL policies with an interpretable priority interface. The module aggregates raw observations into several meaningful categories (goal information, kinematics, wind, position, signal coverage, penalties, obstacle distance) and learns a gate vector that reweights these groups at every decision step. Integrated into a Proximal Policy Optimization (PPO) agent and evaluated in high-fidelity Unity simulations of urban operations with dynamic wind fields, the architecture preserves task performance while revealing stable priority patterns. Based on the results, three main findings emerge. First, the group-gating layer preserves asymptotic reward and value loss relative to ungated baselines. Second, gate dynamics exhibit dual-mode behavior, with a shared component that tracks global task difficulty and category-specific reallocations that differentiate wind and obstacle distance. Third, observation priorities align with environmental dynamics, with Dynamic Time Warping analysis showing 39% improved alignment for wind and 19% for obstacle distance when tracking changes rather than absolute levels. The resulting protocol provides a basis for real-time monitoring and for exploring adaptive sensor scheduling and early fault-detection heuristics in autonomous urban drone operations.
1 Introduction
While autonomous drones have been widely used for navigation, inspection, and monitoring in construction and infrastructure operations, deploying these drones effectively and safely in urban settings remains a significant challenge (Choi et al., 2023; Choi et al., 2024; Liu et al., 2024). The operating environment of urban settings is not only complex but also inherently dynamic and partially observable: cluttered skylines, unpredictable wind gusts, and moving equipment together create uncertainties and risks that complicate real-time decision-making (Son et al., 2022; Zajic et al., 2011). Therefore, the algorithms guiding these drones must be trained to handle complex multi-objective control problems, such as avoiding collisions while reaching targets, compensating for wind disturbances (Wu et al., 2024a), efficiently planning charging strategies to prevent power depletion (Das et al., 2022; Dash et al., 2025), coordinating with other drones to execute missions, and optimizing paths through wireless sensor networks (Das and Dash, 2023a; Das and Dash, 2023b). To tackle these challenges, researchers increasingly turn to Deep Learning frameworks like Multi-Objective Reinforcement Learning (MORL) (Roijers et al., 2013). MORL extends traditional RL to settings with multiple, often conflicting criteria. In practice, these settings are addressed either by approximating Pareto-efficient solution sets or by optimizing a scalarized utility under an explicit preference model (Van Moffaert and Nowé, 2014). In this study, we adopt the scalarized formulation and use preference-conditioned linear scalarization to train a single policy. This enables a policy to represent trade-offs, such as balancing mission success, energy efficiency, and collision risk, within a unified decision framework. Studies have shown that these algorithms can effectively control drones in complex, high-dimensional environments that adapt to varying operational priorities (Fei et al., 2024; Wu et al., 2024b).
While deep MORL frameworks have demonstrated strong performance in balancing competing objectives such as collision avoidance, energy conservation, and trajectory stability, their internal decision-making processes remain opaque. This opacity has direct operational implications. Without understanding why a drone selects one action over another in a complex situation, engineers and supervisors cannot perform meaningful post-hoc failure analysis or anticipate when the policy might behave unpredictably (Adadi and Berrada, 2018; Avacharmal, 2024). In safety-critical urban environments, such uncertainty undermines the foundation of human-autonomy collaboration. A human supervisor can only maintain calibrated trust if the drone’s behavior is legible and its priorities are predictable (Schött et al., 2023; Wu et al., 2025a). When the learned policy functions as a black box, trust either erodes, leading to overcautious interventions, or becomes misplaced, resulting in unsafe overreliance (Hancock et al., 2011). Explainable AI (XAI) methods offer potential remedies, particularly those that provide post-hoc state or action saliency to visualize which features and time steps influence decisions (Adebayo et al., 2018; Greydanus et al., 2018; Hausknecht and Stone, 2015). While a growing body of work explains reinforcement learning policies via state saliency or action attribution, existing methods rarely investigate how multi-objective policies in urban drone missions prioritize information across heterogeneous sensor channels (e.g., wind, obstacle distance, signal coverage) over time. This gap makes it difficult for practitioners to understand which environmental cues a MORL controller is relying on when balancing safety and mission performance (e.g., wind disturbance vs. obstacle avoidance), whether those priority signals are stable across episodes, and how well they correspond to actual environmental conditions. This constitutes the core knowledge gap addressed here: we lack practical, reproducible methods to obtain and validate interpretable priority-allocation traces that reveal how a deep RL policy distributes attention or priority across semantically grouped sensor channels over time in complex urban missions. Without such observation tools, the deployment of multi-criteria RL policies in safety-critical settings becomes harder to audit, harder to diagnose after failures, and more difficult to calibrate trust appropriately (Puiutta and Veith 2020a; Puiutta and Veith 2020b).
To address this gap, our goal is to investigate whether a lightweight group-gating layer interpretation can serve as a reliable and useful observation window into a MORL policy’s allocation strategy. We argue that a useful signal must be (1) stable and non-random, (2) exhibit partial decoupling across categories (allowing independent risk tracking), and (3) vary with external conditions in a way that matches the control problem, all while maintaining task performance. In practice, MORL settings are commonly handled either by approximating Pareto-efficient policy sets or by optimizing a scalarized utility under an explicit preference model. In this work, we adopt the scalarized formulation: we train a single PPO (Schulman et al., 2017) policy to maximize a scalar reward constructed via preference-conditioned linear scalarization of multiple reward components. We cast the task as a partially observable control problem and define several semantic observation groups (e.g., wind, obstacle distance, goal information). Experiments run in a high-fidelity Unity simulation (Juliani et al., 2018) of an urban environment, complete with dynamic wind fields (Wu et al., 2024b). This scalarized multi-criteria setup is used as a controlled platform to evaluate the proposed group-gated priority traces, and we do not claim Pareto-front coverage or Pareto-optimality guarantees. Our analysis compares the internal priority signals with external environment measurements, using Dynamic Time Warping (DTW) and Spearman correlation to quantify temporal alignment and monotonic relationships (Senin, 2008). The remainder of the paper reviews related work, details the environment and analysis methods, reports the full experimental results, and discusses the implications of these findings for future work in verifiable and transparent autonomous systems.
2 Related work
2.1 Autonomous drones in construction operations
Unmanned aerial systems (UAS) have evolved from experimental demonstrations to routine tools for surveying, progress monitoring, façade inspection, and post-event assessment on construction sites (Albeaino et al., 2019; Zhou and Gheisari, 2018). Operating in dense urban environments, however, presents significant challenges that directly affect autonomous control. Tall structures block or reflect satellite signals, causing Global Positioning System (GPS) occlusion and multipath interference. Wind gusts channeled by street canyons, temporary occlusions from cranes, and proximity to scaffolds create dynamic disturbances and intermittent loss of line-of-sight (Kim et al., 2025). Studies using Global Navigation Satellite System (GNSS) and Remotely Piloted Aircraft System (RPAS) data confirm that these “urban canyon” effects bias position estimates and reduce flight reliability (Paradis and Chapdelaine, 2025; Zhang and Hsu, 2021; Zheng et al., 2024). As a result, perception and control degrade precisely when accurate behavior is most needed, a persistent difficulty for drones operating in real construction contexts (Jiang et al., 2017; Zheng et al., 2024).
Recent work no longer treats drones purely as data collectors but as autonomous agents making decisions during missions. In construction robotics, this shift introduces human-centered requirements such as supervision, diagnostic transparency, and calibrated trust alongside tracking accuracy and efficiency (Agrawal and Cleland-Huang, 2021; Fei et al., 2024; Gupta and Nair, 2023). Safety frameworks originally developed for factory-based collaborative robots (Standardization, 2016) remain relevant for aerial systems operating near workers and equipment. Task-specific risk assessment, interpretable interaction, and bounded velocity or force contribute to safer field operations. Overall, these conditions reveal a broader issue: as autonomy increases, understanding how the control policy makes decisions becomes as critical as maintaining precise flight performance. This realization motivates research on learning mechanisms capable of operating safely under uncertainty, and on methods that make these mechanisms transparent to human supervisors.
2.2 Learning in a partially observable environment
Autonomous drones must often act with incomplete information. In cluttered and dynamic construction sites, sensors provide only partial observations of the true environmental state. Occlusions, limited field-of-view, and stochastic wind all introduce uncertainty. Such conditions are naturally modeled by the Partially Observable Markov Decision Process (POMDP) framework (Lauri et al., 2022; Spaan, 2012), where policies must make decisions under uncertainty about hidden states.
To manage partial observability, modern controllers integrate learning architectures that infer or remember latent variables. Recurrent value and policy learners, such as the Deep Recurrent Q-Network (DRQN), compress sequences of past observations into hidden states for improved temporal reasoning (Fan et al., 2020). Latent-state models like Data Valuation Reinforcement Learning (DVRL) further infer unobserved factors online (Yoon et al., 2020). Recent sequence models extend this capability using memory that aggregates information over longer horizons with enhanced stability (Hausknecht and Stone, 2015; Igl et al., 2018; Parisotto et al., 2020). These techniques are particularly valuable when key cues, such as wind variation or obstacle motion, arrive intermittently.
In practice, construction missions involve multiple, sometimes competing, objectives: mission efficiency, collision avoidance, energy economy, and safety. MORL explicitly represents such trade-offs by modeling returns as vectors rather than scalar rewards (Roijers et al., 2013; Wu et al., 2025b). Complementary Constrained MDP (CMDP) formulations incorporate explicit safety constraints during training and execution (Achiam et al., 2017; Chow et al., 2018), while Lyapunov-based updates provide theoretical guarantees of near-constraint satisfaction. Broader reviews in robotics highlight design patterns for reducing unsafe exploration, especially for aerial vehicles operating in shared workspaces with humans (Chow et al., 2019; Garcıa and Fernández, 2015). Depending on the formulation, MORL methods may target Pareto-efficient policy sets or optimize a scalarized utility under an explicit preference model. In this paper, we use preference-conditioned linear scalarization with a single PPO policy, and we leverage this setup to study interpretable group-gated priority traces rather than proposing a new MORL optimizer.
Case studies demonstrate progress through end-to-end and hybrid policies, disturbance-aware control, and energy-sensitive planning in turbulent environments (Banerjee and Bradner, 2024). Surveys of sensing and training configurations underline the importance of effective history use, explicit preference modeling, and safety-aware updates for successful field transfer (Chen et al., 2024; Zhao et al., 2022). Yet, as these learning-based controllers grow more capable, they also become opaque: it remains unclear which cues they rely on to infer hidden states or resolve competing objectives. Understanding this internal reasoning, particularly in uncertain, safety-critical environments, requires additional interpretive tools.
2.3 Interpretable methods in reinforcement learning
Interpretability in reinforcement learning (RL) has been studied under several complementary paradigms, including feature-importance analyses, process-level visualizations, and inherently interpretable policy representations (Puiutta and Veith 2020a; Puiutta and Veith 2020b). Recent surveys of explainable reinforcement learning (XRL) organize existing methods into feature-importance, learning-process, and policy-level categories, highlighting both the progress and the remaining gaps in making RL decisions transparent to human users (Milani et al., 2024). A large body of work focuses on explaining how specific observations influence individual actions. Saliency-map approaches visualize which parts of the input state most affect the chosen action in visual domains, for example, by perturbing pixels and measuring the impact on the action-value or policy output (Huber et al., 2022; Puri et al., 2019). Other methods reconstruct actions via surrogate models to attribute importance to input features, extending feature-attribution ideas from supervised learning to deep RL (Chen et al., 2020; Guo et al., 2021). These approaches provide fine-grained importance scores for individual state dimensions but typically remain local (per state-action pair) and are not designed to summarize stable prioritization patterns over semantically grouped sensor channels.
Other works pursues interpretability by design, replacing opaque neural policies with more structured or symbolic representations. Programmatically interpretable RL searches in a restricted space of human-readable policies, using neural networks only as oracles to guide the search (Verma et al., 2018). Formal-methods-based RL combines temporal-logic specifications with control barrier functions, producing policies whose safety and high-level behavior can be verified and explained (Li et al., 2019). More recent approaches use neural-symbolic logic or Shapley-based decompositions to derive stable, interpretable policies while preserving performance (Ma et al., 2020; Xing et al., 2023). In multi-objective settings, interpretability has been explored through Pareto-front structure or explicit regularization on interpretable preference representations (Rosero and Dusparic, 2025; Xia and Herrmann, 2025). These methods, however, primarily explain trade-offs in objective space or constrain policy form, rather than explicitly revealing how heterogeneous observations are prioritized during execution. In robotics and navigation, explainable RL has been applied to mobile robots and unmanned aerial vehicles, often using visual saliency or rule-based policy structures to provide human-understandable rationales for path planning and obstacle avoidance (He et al., 2020; Potteiger and Koutsoukos, 2023). Such work demonstrates the value of interpretable controllers in safety-critical domains, yet explanations usually target high-level behavior (e.g., why a particular trajectory was chosen) or localized state importance.
Existing methods are valuable for understanding local feature importance, verifying safety, or summarizing policy structure. However, there remains a lack of methods that systematically characterize how an RL agent allocates priority across grouped observation channels (e.g., wind, obstacle geometry, signal coverage) over time, especially in multi-objective, autonomous urban drone operations. Current approaches seldom connect input-group prioritization to evolving environmental dynamics in a way that is directly aligned with domain concepts in building and infrastructure missions. This gap motivates the group-gating approach proposed in this work, which aims to expose real-time priority allocation over sensor groups within a MORL-based autonomous drone controller.
Table 1 shows a concise comparison of the representative research in attention-based policies, mixture-of-experts, and post-hoc attribution. We emphasize that attention-based policies, mixture-of-experts routing, and post-hoc attribution have each made substantial contributions to interpretable and scalable RL. Our contribution is complementary and more narrowly scoped. We focus on producing group-gated priority traces, meaning a low-dimensional, time-resolved gate signal over predefined semantic observation groups that are recorded during execution and can be used for auditing and event association in urban drone operations.
3 System design
3.1 Simulation environment settings
To investigate how a policy learns to prioritize competing signals (like wind versus obstacles), we developed a sophisticated simulator within the Unity engine with the ML-Agent package, which serves as the testbed for training our multi-objective PPO policy with a group-gating layer (Juliani et al., 2018), featuring a detailed model of a DJI Mavic 2 Pro drone. The core of the simulator is a custom-built landscape representing a 3,000-foot square area of Manhattan. While realistic textures from the Unity Asset Store were used for visual fidelity, we manually constructed the primary building meshes (As shown in Figure 1). This approach was crucial, as it allowed us to create simplified colliders (bounding boxes) optimized for our physics-based wind simulation (Wu et al., 2024b). The street-level environment was intentionally simplified to isolate the core navigation challenge, including streets, sidewalks, and streetlights, but omitting dynamic objects like vehicles or pedestrians. Crucially, only the building meshes were equipped with colliders, defining them as the only physical obstacles in the flight path. The entire scene is illuminated by a fixed directional light set at a 45-degree angle to simulate consistent lighting conditions. Targets were placed on the map’s corners to create a complex flight route. A 200-foot altitude limit was imposed during training, forcing the drone to navigate through this custom-designed urban terrain.
To complement the static landscape, we developed a dynamic wind simulation to create a robust and realistic training environment. The primary challenge was generating authentic aerodynamic scenarios without the prohibitive computational cost of traditional Computational Fluid Dynamics (CFD), which is unsuitable for the thousands of iterations required by Deep Reinforcement Learning (DRL) (Abichandani et al., 2020). Our solution utilizes a custom-built Convolutional Autoencoder, a representation model that is highly efficient at approximating complex airflow. While the detailed technical methodology for this model is discussed in our previous publications, the process begins by defining an initial, global wind condition (speed and direction) across the entire landscape. At predefined intervals, we systematically alter this global wind state.
3.2 Network design and training
3.2.1 Overview of the structure
Figure 2 illustrates the proposed policy network for autonomous urban drone flight. At each decision step
3.2.2 Category encoders
We first group the 16 features into seven semantically interpretable subsets. Let Equation 1:
denote the features of the group
corresponding to goal, kinematics, wind, position, signal, penalty, and distance.
Each group is passed through a dedicated encoder
where
The seven embeddings are concatenated along the feature dimension as Equations 4, 5:
To decide how strongly each group should influence the policy at the time
The gating network first maps this 720-D vector to a 64-D hidden state and then to 7 logits as Equations 7, 8:
where
We then apply a sigmoid element-wise to obtain non-negative, independently scaled gates as Equation 9:
In the proposed network, the gate outputs
where
Each gating weight rescales its corresponding group embedding via element-wise multiplication as Equation 12 (broadcast across the embedding dimension):
The reweighted representation is obtained by concatenating these gated embeddings as Equation 13:
The gated representation
These shares can be visualized as the fraction of total gating mass assigned to each category at time
The seven-group partition is a design choice made for semantic auditability and controlled interpretability analysis. When higher-dimensional sensing modalities are introduced (e.g., LiDAR, thermal imagery, dense point clouds), the same principle can be retained through hierarchical grouping, where modality-level encoders form coarse groups and sub-groups are defined within each modality based on task semantics. In addition, the grouping itself can be made learnable by introducing structured group assignment (e.g., sparse or clustered feature-to-group mapping) while maintaining an interpretable group interface. We leave these scalable regrouping strategies as future work, and in this study we focus on a fixed, semantically grounded grouping to ensure that the logged priority traces correspond to domain-meaningful sensor categories.
During inference, we log the raw gate vector
3.2.3 Multilayer perceptron fusion
The gated representation
with
where
with
where each
To account for partial observability and temporal dependencies (e.g., wind changes, motion history), we feed
where
Overall, this architecture encodes heterogeneous observations into interpretable group embeddings, modulates them with learnable gates that quantify per-category importance, and fuses the gated representation with raw observations through a deep, recurrent control head. The gating variables
3.3 Reward function and policy
The autonomous drone agent is the core intelligence of our system, designed to enable complex navigation through the challenging urban environment, trained using a MORL framework (Nguyen et al., 2020). This approach is essential for enabling the agent to balance multiple, often conflicting, flight objectives such as minimizing travel time, ensuring complete obstacle avoidance, and mitigating the effects of dynamic wind. We model the task as a Partially Observable Markov Decision Process. For the training algorithm, we employed PPO (Schulman et al., 2017) due to its recognized stability, efficiency, and robustness in complex control tasks. It trains the agent by optimizing a policy, denoted as
the agent receives a shaped reward based on distance improvement at each step as Equation 25:
where
where
where
where
the episode terminates with success. When the agent collides with obstacles or boundaries, it receives a penalty as Equation 30:
the episode terminates with failure.
To enable a single policy to handle diverse mission requirements, the objective weights are randomized at episode initialization as Equation 32:
where
where
where each component controls movement in one axis, −1 means negative direction, 0 means no movement and +1 means Positive direction. The action is converted to velocity as Equation 35:
the actual velocity includes wind effect as Equation 36:
where
where
4 Results
4.1 Network performance
We use the task success rate over the entire training process as the criterion for model convergence. During training, each episode is counted when the drone either reaches the goal, crashes, or times out, with a maximum episode length of 15,000 steps. When the task success rate (the number of successful episodes divided by the total number of episodes so far) remains stably above 95%, we regard the policy as converged. Under this criterion, the original network without group gating converged after about 7,000 episodes, the network with constrained group gating converged after roughly 9,000 episodes, and the network with unconstrained group gating converged after about 12,000 episodes.
The reward curves (Figure 3) show that adding a group-gated layer does not harm performance. All three policies converge toward the same reward band and remain stable once training passes the long plateau near the end of the run. The policy with unconstrained gating converges more slowly and exhibits larger variance early in training, then closes the gap. The policy with normalized gating (constrained to sum to one) enforces competition among observation categories, which can reduce variance early but limits representational flexibility. In contrast, the policy with unconstrained gating treats priorities independently and permits multiple observation categories to be emphasized simultaneously, which increases flexibility but can lead to slower calibration when saturate near boundary values.
The value-loss curves (Figure 4) corroborate this interpretation. Loss decreases over time for all three policies and stabilizes at a low level by the end of training. Independent gating remains in the higher-loss regime longer before reaching a loss floor comparable to the baseline and normalized gating variants. This combination of comparable final reward (Figure 3) and comparable final loss indicates that unconstrained priority assignment does not degrade the quality of the learned value function or policy at convergence, the cost appears primarily in training efficiency rather than asymptotic capability. For autonomous navigation applications, the practical implication is that group-gated priorities provide an interpretable mechanism for understanding which observation categories the agent prioritizes in different contexts, without sacrificing task performance. In our simulation study, this interpretability benefit incurs a modest training-time overhead. We view this trade-off as acceptable for applications requiring explainability, such as human-robot collaboration in Urban Search and Rescue scenarios, where priority traces could be surfaced to operators as decision-support signals after further validation in operational settings.
The maps overlay repeated inference rollouts using the exported well-trained model. In Figure 5, white polygons are buildings, and the color field encodes the ambient wind pattern. Each magenta trace is one flight from start to goal. Across both sectors the agent reaches the goal consistently and the trajectories cluster into a narrow corridor, which indicates a stable policy under repeated trials. Most variability appears where streets intersect or where the wind gradient is steep, and the deviations are short and self-correcting rather than failure modes. Route choice is consistent with a strategy that limits crosswind exposure by shadowing building edges and committing to straight street segments once a safe corridor is identified. In other words, the trained controller completes the task reliably and shows low run-to-run spread, which matches the reward and loss results reported earlier.
4.2 Grouped priority analysis
Figure 6 tracks the percentage of grouped priority values during inference as the drone is finishing the task. Goal info is the normalized relative offset to the target, Kinematics is the drone’s velocity vector. Wind is the wind vector in the environment. Position is the agent location in map coordinates. Signal is a binary indicator that equals 1 when the drone is within the radius of either tower, and we also track its running average. Penalty is the pair (time penalty, signal coverage penalty) sampled each episode, these weights are used to scalarize objectives and are included to make preferences explicit, either going with short time or high signal coverage. Distance is the closest obstacle distance from raycasts, with ML-Agents ray-sensor outputs also available through the attached component. Each category remains within a stable range, which indicates that the trained policy has settled on a consistent allocation strategy rather than oscillating across inputs. Goal information sits at the top band and shows short bursts at wayfinding moments. Wind priority stays elevated and varies smoothly with the background field, which matches the path choices that hugged building edges in the trajectory maps. Position and distance remain in a mid range and rise when the agent approaches corridor transitions. Kinematics stays lower and smoother, which is natural once the velocity profile is regulated by the actor. The penalty channel is mostly quiet and spikes briefly near tight clearances or sharp heading corrections. The share of total weight lies in a narrow band of about 12%–17%, and the ranking matches the left panel. This near-conserved resource budget implies that the policy reallocates relative weights across channels rather than changing the total amount of priority.
Figure 7 shows uplift over the 0.5 baseline. Each category remains within a stable band over time and exhibits small, synchronous reallocation at key moments. Goal information maintains the highest uplift, often around 30%–45% later in the run, which indicates a strong goal drive throughout the trajectory. Position and wind form a second tier and show local increases during corridor transitions or when the wind gradient strengthens, suggesting that geometry and wind disturbance jointly shape route commitment and fine adjustments. Distance and signal stay at moderate levels and align with periods of turning or narrowing passages, reflecting local feasibility checks. Kinematics stays lowest across the run and its fluctuations gradually contract, which suggests that once the policy stabilizes, velocity and acceleration do not require persistent high priority and are instead maintained by the learned action regulation.
Normalized priority in Figure 6 shows relatively balanced allocation across categories (range: 12.45%–15.44%). To assess temporal consistency, we compute the coefficient of variation (CV = std/mean × 100%), which quantifies relative variability. Different groups exhibit different stability levels: wind, distance, and signal demonstrate highest consistency (CV < 2%, std <0.25%), while goal_info shows comparatively greater variability (CV = 3.4%, std = 0.52%). Examining priority uplifts relative to the baseline in Figure 7 further reveals temporal dynamics. Goal_info exhibits the largest uplift variability (std = 6.9%, range = 22.0%), while penalty and position maintain more stable uplifts (std = 1.0% and 1.7%, respectively). Kinematics shows high relative uplift variability (CV = 51.4%) despite low mean priority, suggesting selective activation in specific contexts. Overall, although priority distributions remain within a relatively stable range, observable fluctuations exist that reflect task-dependent adaptation.
Then we evaluate whether grouped gate priorities operate independently or co-vary showing in the correlation matrix (Figure 8). Using the unconstrained gates (not the normalized shares), we compute the Spearman correlation matrix across time and summarize the mean absolute off-diagonal correlation (MAC) and the decoupling index DI = 1−MAC. As shown in the analysis, the gates are partially decoupled rather than fully independent: MAC = 0.460 and DI = 0.540. The first principal component explains 0.532 of the variances, indicating a substantial co-varying mode. Consistent with this, co-activation independence ratios (CAIR) cluster around 0.46 for some pairs but vary for others. Overall, the evidence supports partial decoupling: groups can rise or fall without strict conservation, yet a shared mode still accounts for a significant portion of variability. This reveals two concurrent effects. There is common-mode modulation where several gates increase or decrease together when the scene changes or the task becomes more difficult, evidenced by the moderate PC1 ratio and positive correlations. There is also category-specific reallocation, where relative weights shift among channels even when the overall level remains similar, reflected by the moderate DI and the heterogeneous off-diagonal structure. The pairwise pattern varies wind aligns strongly with distance (ρ = 0.947) and goal information (ρ = 0.846), kinematics mostly aligns strongly with signal (ρ = 0.769), position shows weak coupling to goal information (ρ = 0.119) and wind (ρ = 0.142), while penalty exhibits varied coupling patterns to different channels. In subsequent correlation and DTW analyses, we consider both effects by examining raw gates and, when helpful, residualized gates after removing the common mode so that per-category alignment with observations reflects category-specific variation rather than global shifts.
Figures 9, 10 are included only as examples to help the reader see what our alignment tools reveal at the level of a single rollout. For these two plots we can describe the following patterns without claiming that they generalize. Wind priority and the wind observation tend to move together in short episodes with small leads or lags. Pointwise correlation looks weak because the priority often shifts a little before or after the local change in wind. DTW tolerates those small timing offsets and recovers a clear episodic match, which is why the alignment path stays near the diagonal and bends only at transition points. Goal info priority presents a different picture in this example. Over long-range intervals, its trend diverges from the distance signal, resulting in a negative correlation on the scatter plot. This indicates that agents lower the priority to relative position as they move farther from the target. DTW requires significant deformation at both ends of the curve to achieve alignment, resulting in flat or steep segments along the alignment path. In short, in both scenarios, wind acts as a rapid environmental driver, capable of instantly capturing priority targets. Distance, conversely, serves as a gradual process indicator, primarily functioning when temporarily moving away from or approaching a target. We do not infer global strategies based on just two examples. To test whether similar patterns emerge across categories and runs, we repeated the same process for all seven groups, collecting ten reasoning trajectories under identical conditions. For each trajectory and category, we calculated normalized DTW distances and computed Spearman correlation coefficients at matching time indices. Trajectories were then aggregated to obtain average DTW distances and average correlation coefficients.
Figure 9. DTW analysis and correlation analysis for priority to wind and raw observation of wind. Upper left: Original time series of priority and wind observation values over steps. Upper right: Normalized comparison of priority and observation time series, with the corresponding DTW distance. Bottom left: Spearman correlation analysis between priority and observation values, including the correlation coefficient and p-value. Bottom right: DTW alignment path illustrating the temporal correspondence between priority and observation sequences.
Figure 10. DTW analysis and correlation analysis for attention weight and raw observation values. Upper left: Original time series of attention weight and observation values across steps. Upper right: Normalized comparison of attention weight and observation time series, with the corresponding DTW distance. Bottom left: Spearman correlation analysis between attention weight and observation values, including the correlation coefficient and p-value. Bottom right: DTW alignment path illustrating the temporal correspondence between attention weight and observation sequences.
Table 3 shows DTW and correlation analysis for normalized observation and percentage of priority for each category. Read from the DTW column first, since it captures episode-level similarity under small temporal offsets. Wind shows the strongest match, indicating that the priority co-vary with local wind fluctuations across contiguous segments. Kinematics and distance occupy an intermediate band, consistent with priority being revisited periodically rather than tracked frame by frame. Goal info and position display weaker shape agreement, and signal and penalty are weaker still, which suggests that these channels are influenced more by task phase or decision context than by waveform similarity to their raw observations. The correlation columns provide the trend direction at the same time index. Goal info, position, and signal exhibit stable negative associations. A natural reading is that as the route stabilizes and information becomes more certain, these channels are down-weighted, with brief increases around turns or corridor switches. Kinematics and penalty are positively associated, in line with higher allocation when motion regulation or local risk rises. Wind and distance yield little pointwise correlation on the raw series, a result that is not at odds with their DTW behavior since episodic responses and slow trends rarely synchronize at the exact time index. The table indicates two recurrent modes of priority allocation. Inputs governed by environmental dynamics tend to produce segment-level coupling that DTW detects even when correlation is weak. Inputs tied to progress and geometry tend to produce slower reallocations for which correlation carries the signal even when shapes do not align closely. Because wind and distance show non-significant correlations on the raw series, we redefine these observations as one-second deltas and repeat the same DTW and correlation analyses to test whether the priority is more sensitive to changes than to original values.
Table 3. DTW and correlation analysis for normalized observation and percentage of priority for each category.
To test whether priority responds more to changes in the scene than to absolute levels, we redefined the observation for wind and distance as a one-second delta and ran the alignment tests again. These two categories (as shown in Figures 11, 12) were chosen because their pointwise correlations on the raw series were weak. The delta view suppresses slow drift and highlights local transitions, which is where the policy reallocates priority in many rollouts. With this transformation the shape match improves markedly. DTW drops by about 39% for wind and about 19% for distance, indicating that the priority traces and the delta series now share sharper onsets and offsets. Spearman correlation results move toward stronger association, and the corresponding p values decrease. The DTW alignment paths stay near the diagonal for long stretches and bend at the same transition points, which is the pattern expected when priority modulates around gusts, corridor entries, and brief approach phases rather than tracking raw levels frame by frame.
Figure 11. DTW analysis and correlation analysis for priority to distance and distance change intensity. (A) Original time series comparison between priority and delta distance observation across training steps. (B) Normalized comparison of priority and delta distance observation time series, with the corresponding DTW distance highlighted. (C) Spearman correlation analysis between priority and delta distance observation, including the correlation coefficient, p-value, and monotonic trend. (D) DTW alignment path illustrating the temporal correspondence between priority and delta distance observation sequences.
Figure 12. DTW analysis and correlation analysis for priority to wind and wind change intensity. (A) Original time series comparison between priority and delta wind observation (Wind_X, Wind_Y, Wind_Z aggregated) across training steps. (B) Normalized comparison of priority and delta wind observation time series, with the corresponding DTW distance highlighted. (C) Spearman correlation analysis between priority and delta wind observation, including the correlation coefficient, p-value, and monotonic trend. (D) DTW alignment path illustrating the temporal correspondence between priority and delta wind observation sequences.
Across ten inference trajectories, the learned policy exhibits two complementary priority allocation patterns. For change sensitive inputs such as wind, the priority and the observations align in bursts with small timing offsets. DTW captures this alignment even when pointwise correlation on raw levels is weak. After redefining wind as delta wind, the episodic match remains and correlations strengthen, which indicates that the gates react to gust onsets rather than to absolute wind level. For slow context such as goal progress and geometry, the priority allocation pattern follows broader monotonic shifts and is revisited mainly at route transitions. Distance fits this second pattern on raw levels and moves closer to the change sensitive regime once expressed as delta distance, where short approach or deceleration episodes carry more weight than the absolute value.
Placed alongside the earlier table, the categories separate cleanly without overstatement. Wind is the clearest episodic case. Goal info, position, and signal retain stable negative correlations with only moderate DTW, consistent with gradual reallocation as routes settle and brief upweighting near turns. Kinematics and penalty remain positively associated, reflecting higher gate values when motion regulation or risk rises, while their DTW is weaker than wind because adjustments unfold across longer segments. Taking together, these results, along with the stable task performance and the partially decoupled gates, provide the final empirical basis of the study. The Discussion will consider mechanisms, limitations, and how these patterns can inform interface design and future field validation in construction settings.
5 Discussion
This study examined whether grouped priorities could provide an interpretable window on what a learned policy prioritizes during autonomous flight near structures, and whether that window can be obtained without loss of task performance. The motivation comes from construction practice, where autonomous drone operation under partial observability and multiple objectives need signals that are predictable, auditable, and linked to safety-relevant events. Priority is treated here as an alignment pattern that can be measured against observations, not as a causal explanation of action. Performance results set the boundary condition. Adding the priority (group-gate) module increased training time, yet final reward and value loss converged to the levels of the baselines. During inference, repeated rollouts in the same environment reached the goal reliably and produced tightly clustered trajectories. We do not infer a specific navigation strategy from these paths. Their stability is sufficient to support the subsequent priority analysis, because unstable behavior would make any priority trace difficult to interpret. The structure of the gates clarifies how to read alignment metrics. The correlation matrix and principal component analysis reveal a common varying mode that moves several gates together when conditions change, alongside category-specific reallocations that adjust the relative mix. A portion of the variance therefore reflects global difficulty or phase, while the remainder reflects where the policy directs priority within a phase.
With that context, the allocation patterns are regular. During inference, category bands remain within narrow ranges and show brief increases at route transitions. Goal-related priority remains elevated, wind- and geometry-related cues increase around corridor changes, and kinematics and penalty increase when regulation or risk rises. The summary table aggregates ten trajectories by computing DTW between each priority series and its paired observation and by computing Spearman correlations at matched indices, followed by trajectory-level averaging. DTW captures episode-level similarity after allowing small temporal offsets. Spearman correlation analysis captures monotonic association that can be nonlinear. Agreement between DTW and correlation is not required because the measures target different properties of the series. Read together, they indicate that inputs driven by fast environmental dynamics tend to produce segment-level coupling that DTW detects even when correlation is weak, while signals related to task progress and geometry tend to produce slower reallocations that appear in correlation even when waveform shapes do not closely match. Wind and distance required a targeted follow-up because their correlations on the raw series were not statistically significant, these two categories change slowly or in bursts, which disadvantages pointwise tests.
Redefining these observations as disturbance suppresses drift and emphasizes local transitions. Under this definition, shape matching improves for both categories and correlations strengthen. The priority gates, therefore, appear more sensitive to increments in wind and to short approach or deceleration episodes than to absolute levels for these signals. Other categories retain the raw series view because their correlations are already consistent with the DTW evidence. The sensing perspective translates these findings into practical guidance. For change-sensitive channels such as wind, sensing and preprocessing should preserve fast onsets with low latency and adequate temporal resolution. In many platforms this can be achieved without new hardware by computing short-window differences or derivatives from existing estimates and ensuring that these derived channels are available to downstream logic. Priority-aligned scheduling is also natural. When a category repeatedly shows short priority bursts at transitions, sampling rate or computation budget for that sensing chain can be raised during those segments and reduced during steady flight. Alignment statistics further provide a simple diagnostic. If the DTW and correlation profiles drift in a sustained way for a given channel, a sensor fault or timing issue may be detectable earlier than task-level reward changes, although validating this requires dedicated fault-injection and field-style evaluations. A sensor fault or timing issue may be detectable earlier than task-level reward changes, although validating this requires dedicated fault-injection and field-style evaluations.
The study also contributes a method for turning internal signals into external evidence that can support knowledge engineering and risk management for drone operations in building environments. Grouped priorities are defined over human-meaningful observation categories, which anchors model internals to site concepts such as wind exposure, obstacle clearance, motion regulation, and goal progress. The priority and observation comparison is summarized with trajectory-level statistics that can be reproduced across different initial wind settings, which fits practical needs for auditability and record keeping. The delta analysis for wind and distance shows how change-oriented features can be surfaced when raw levels are not informative, a transformation that can be implemented in software and deployed on existing platforms. Together these elements outline a path for supervisors and engineers to monitor priority shifts, to align them with site events in logs, and to connect alignment statistics to safety reviews and procedure updates. The contribution is not a user interface or a human study. It is an interpretable measurement protocol and a set of design implications that reflect the complexity of building environments while remaining compatible with existing sensing and operations.
This study has several limitations. First, all experiments are conducted in a single simulated urban environment. As a result, the stability of the observed priority regimes under different layouts, alternative turbulence or wind models, and substantially different sensing configurations remains untested. While we include limited randomization through a small set of precomputed wind-field realizations and modest variation in clutter patterns and sensing latencies, these variations do not fully reflect real-world variability. Second, the study is simulation-based and does not include user interface development or human-subject evaluation. Accordingly, any implications for operator understanding, calibrated trust, or oversight effectiveness should be viewed as prospective rather than validated outcomes. Third, the reported gate–observation alignment results should be interpreted as descriptive indicators of association. The Spearman correlation and DTW analyses quantify temporal alignment but do not establish causal or counterfactual relationships between observations and priority weights. Finally, the delta transformation is used only as an analysis-time lens to highlight change cues in key signals and is not part of the policy input during training or inference, nor does it modify the learned policy.
Future work will expand evaluation beyond the current setting by introducing additional environments, geometry layouts, obstacle densities, and a broader library of wind-field realizations, to test the robustness of both performance and priority allocation under stronger distribution shifts. A natural next step is hardware validation using real drones and field sensing pipelines to assess whether the proposed priority tracing remains informative under measurement noise, latency, and actuation disturbances. In parallel, prototype supervisory interfaces can be developed to present grouped priorities alongside mission context, and controlled user studies can examine how operators interpret these signals, whether they support appropriate interventions, and how presentation choices affect workload and trust calibration. Additional sensing studies can systematically vary sampling rate, latency, and sensor placement to characterize how sensing design influences priority alignment and to derive practical configuration guidance. Training efficiency is another practical direction for future work. While the unconstrained group-gating variant requires more training steps to reach comparable asymptotic performance in our setting, this overhead may be reduced through simple training schedules that do not change the core method. Examples include warm-starting the gated policy from a converged ungated baseline, using a two-stage schedule that enables gating only after the base policy stabilizes, initializing encoders from the baseline and briefly training only the gating module before end-to-end finetuning, and adding mild gate regularization to avoid early saturation. We did not conduct a dedicated acceleration study in this revision, and we present these items as actionable directions rather than validated results. Finally, richer interaction settings, such as multi-agent operation or environments with moving equipment, can be used to stress-test the framework and to refine trajectory-level aggregation and uncertainty estimates, including evaluating whether alignment statistics could serve as candidate online monitoring signals under realistic deployment constraints.
6 Conclusion
This study examined whether grouped priorities could provide a readable view of what a learned policy prioritizes when an autonomous drone operates near structures, and whether that view can be obtained without loss of task performance. Within a high-fidelity simulation of complex building environments, adding a lightweight group-gate module into the network of an autonomous drone increased training time but preserved final reward and value loss. Inference rollouts were stable across repeated trials, which supports the validity of post hoc priority analysis. Three empirical findings ground the contribution. First, the gate structure shows a shared varying mode together with category-specific reallocations. Several gates move together at scene change while relative weights still shift across channels. Second, alignment metrics summarize how priority relates to observations across ten trajectories. DTW captures episode-level similarity under small timing offsets. Spearman correlation captures pointwise and monotonic tendencies at matched indices. Read together, the results indicate episodic coupling for environment dynamics and slower reallocations for signals linked to task progress and geometry. Third, replacing raw observations with disturbances improves shape matching and strengthens correlations, which suggests that priority is sensitive to increments in gusts and to short approach or deceleration episodes rather than to absolute levels.
These signals also matter in complex construction and urban building environments because they connect internal computation to site concepts that practitioners already track, such as wind exposure, obstacle clearance, motion regulation, and goal pressure. Grouped priorities over these categories can be logged alongside trajectories and summarized at the trajectory level for audit and review. The same statistics can guide sensing and processing choices. Change-sensitive channels should preserve rapid onsets with low latency, for example, by increasing sampling rate, refresh rate, or effective bandwidth when their priority rises and using coarser sensing during quiescent periods. The learned priority profiles can inform sensing strategy, indicating which signals may merit higher temporal resolution or communication capacity during critical transitions. In this way, alignment acts not only as an interpretable monitoring signal but also as a design handle for adaptive sensing and resource allocation in the field.
The work is simulation-based, it does not claim a user interface or a human study. Priority weights are alignment indicators rather than causal proofs. Correlation and DTW do not replace counterfactual tests. Within these boundaries, the contribution is a reproducible protocol for exposing and validating grouped priorities in a multi-objective, partially observable setting, together with evidence that the resulting signals are stable, intelligible, and compatible with performance requirements. This provides a path from interpretability to deployment decisions in construction robotics and supports field trials that connect alignment statistics to safety and productivity outcomes.
Data availability statement
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.
Author contributions
BS: Formal Analysis, Writing – original draft, Software, Methodology, Investigation. HY: Methodology, Software, Validation, Writing – review and editing. JW: Writing – review and editing, Investigation, Validation, Visualization. JD: Writing – review and editing, Resources, Project administration, Supervision, Conceptualization, Methodology.
Funding
The author(s) declared that financial support was received for this work and/or its publication. This material is supported by the Air Force Office of Scientific Research (AFOSR) under grant FA9550-22-1-0492. Any opinions, findings, conclusions, or recommendations expressed in this article are those of the authors and do not reflect the views of the AFOSR.
Conflict of interest
The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declared that generative AI was used in the creation of this manuscript. I only use AI to polish my language, which is grammar and typos check. The research was originally designed and conducted by us.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
Abichandani, P., Lobo, D., Ford, G., Bucci, D., and Kam, M. (2020). Wind measurement and simulation techniques in multi-rotor small unmanned aerial vehicles. IEEE Access 8, 54910–54927. doi:10.1109/access.2020.2977693
Achiam, J., Held, D., Tamar, A., and Abbeel, P. (2017). Constrained policy optimization. International conference on machine learning.
Adadi, A., and Berrada, M. (2018). Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access 6, 52138–52160. doi:10.1109/access.2018.2870052
Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., and Kim, B. (2018). Sanity checks for saliency maps. Adv. Neural Information Processing Systems 31.
Agrawal, A., and Cleland-Huang, J. (2021). “Explaining autonomous decisions in swarms of human-on-the-loop small unmanned aerial systems,” in Proceedings of the AAAI Conference on Human Computation and Crowdsourcing.
Albeaino, G., Gheisari, M., and Franz, B. W. (2019). A systematic review of unmanned aerial vehicle application areas and technologies in the AEC domain. J. Information Technology Construction 24.
Avacharmal, R. (2024). Explainable AI: bridging the gap between machine learning models and human understanding. J. Inf. Educ. Res. 4 (2). doi:10.52783/jier.v4i2.960
Banerjee, P., and Bradner, K. (2024). Energy-optimized path planning for uas in varying winds Via reinforcement learning. AIAA Aviat. Forum And Ascend 2024.
Chen, X., Wang, Z., Fan, Y., Jin, B., Mardziel, P., Joe-Wong, C., et al. (2020). Reconstructing actions to explain deep reinforcement learning.
Chen, S., Chen, S., Mo, Y., Wu, X., Xiao, J., and Liu, Q. (2024). Reinforcement learning-based energy-saving path planning for UAVs in turbulent wind. Electronics 13 (16), 13. doi:10.3390/electronics13163190
Choi, H.-W., Kim, H.-J., Choi, H.-W., Kim, H.-J., Kim, S.-K., and Na, W. S. (2023). An overview of drone applications in the construction industry. Drones 7 (8), 515. doi:10.3390/drones7080515
Choi, W., Na, S., and Heo, S. (2024). Integrating drone imagery and AI for improved construction site management through building information modeling. Buildings 14 (4), 1106. doi:10.3390/buildings14041106
Chow, Y., Nachum, O., Duenez-Guzman, E., and Ghavamzadeh, M. (2018). A lyapunov-based approach to safe reinforcement learning. Adv. Neural Information Processing Systems 31.
Chow, Y., Nachum, O., Faust, A., Duenez-Guzman, E., and Ghavamzadeh, M. (2019). Lyapunov-based safe policy optimization for continuous control. doi:10.48550/arXiv.1901.10031
Das, R., and Dash, D. (2023a). Collaborative data gathering and recharging using multiple mobile vehicles in wireless rechargeable sensor network. Int. J. Commun. Syst. 36 (15), e5573. doi:10.1002/dac.5573
Das, R., and Dash, D. (2023b). Joint on-demand data gathering and recharging by multiple mobile vehicles in delay sensitive WRSN using variable length GA. Comput. Commun. 204, 130–146. doi:10.1016/j.comcom.2023.03.022
Das, R., Dash, D., Yadav, C. B. K., Das, R., Dash, D., and Yadav, C. B. K. (2022). An efficient charging scheme using battery constrained mobile charger in wireless rechargeable sensor networks. Telecommun. Syst. 2022 81 (3), 81–415. doi:10.1007/s11235-022-00951-w
Dash, D., Das, R., and Yadav, C. B. K. (2025). Enhancing wireless sensor durability via on-demand Mobile charging and energy estimation. Concurrency Comput. Pract. Exp. 37 (18-20), e70205. doi:10.1002/cpe.70205
Fan, J., Wang, Z., Xie, Y., and Yang, Z. (2020). A theoretical analysis of deep Q-learning. Learning for dynamics and control,
Fei, W., Xiaoping, Z., Zhou, Z., and Yang, T. (2024). Deep-reinforcement-learning-based UAV autonomous navigation and collision avoidance in unknown environments. Chin. J. Aeronautics 37 (3), 237–257.
Garcıa, J., and Fernández, F. (2015). A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16 (1), 1437–1480.
Greydanus, S., Koul, A., Dodge, J., and Fern, A. (2018). “Visualizing and understanding atari agents,” in International conference on machine learning.
Guo, W., Wu, X., Khan, U., and Xing, X. (2021). Edge: explaining deep reinforcement learning policies. Adv. Neural Information Processing Systems 34, 12222–12236.
Gupta, S., and Nair, S. (2023). A review of the emerging role of UAVs in construction site safety monitoring. Mater. Today Proc. doi:10.1016/j.matpr.2023.03.135
Hancock, P. A., Billings, D. R., Schaefer, K. E., Chen, J. Y., De Visser, E. J., and Parasuraman, R. (2011). A meta-analysis of factors affecting trust in human-robot interaction. Hum. Factors 53 (5), 517–527. doi:10.1177/0018720811417254
Hausknecht, M., and Stone, P. (2015). Deep recurrent Q-Learning for partially observable MDPs. doi:10.48550/arXiv.1507.06527
He, L., Nabil, A., and Song, B. (2020). “Explainable deep reinforcement learning for uav autonomous navigation,”. arXiv preprint arXiv:2009.14551. Ithaca, NY: Cornell University.
Hochreiter, S., and Schmidhuber, J. (1997). Long short-term memory. Neural Computation 9 (8), 1735–1780. doi:10.1162/neco.1997.9.8.1735
Huber, T., Limmer, B., and André, E. (2022). Benchmarking perturbation-based saliency maps for explaining atari agents. Front. Artif. Intell. 5, 903875. doi:10.3389/frai.2022.903875
Igl, M., Zintgraf, L., Le, T. A., Wood, F., and Whiteson, S. (2018). “Deep variational reinforcement learning for POMDPs,” in International conference on machine learning.
Jiang, S., Jiang, W., Huang, W., and Yang, L. (2017). UAV-based oblique photogrammetry for outdoor data acquisition and offsite visual inspection of transmission line. Remote Sens. 9 (3), 278. doi:10.3390/rs9030278
Juliani, A., Berges, V.-P., Teng, E., Cohen, A., Harper, J., Elion, C., et al. (2018). Unity: a general platform for intelligent agents. doi:10.48550/arXiv.1809.02627
Kim, H.-I., Park, K.-D., Kim, H.-I., and Park, K.-D. (2025). Satellite positioning accuracy improvement in urban canyons through a new weight model utilizing GPS signal strength variability. Sensors 2025 (15), 25. doi:10.3390/s25154678
Lauri, M., Hsu, D., and Pajarinen, J. (2022). Partially observable markov decision processes in robotics: a survey. IEEE Trans. Robotics 39 (1), 21–40. doi:10.1109/tro.2022.3200138
Li, X., Serlin, Z., Yang, G., and Belta, C. (2019). A formal methods approach to interpretable reinforcement learning for robotic planning. Sci. Robotics 4 (37), eaay6276. doi:10.1126/scirobotics.aay6276
Liu, P., Sun, B., Wang, Y., and Tang, P. (2024). Bridge inspection strategy analysis through human-drone interaction games. Comput. Civ. Eng. 2023, 597–605. doi:10.1061/9780784485224.072
Ma, Z., Zhuang, Y., Weng, P., Li, D., Shao, K., Liu, W., et al. (2020). Interpretable reinforcement learning with neural symbolic logic.
Milani, S., Topin, N., Veloso, M., and Fang, F. (2024). Explainable reinforcement learning: a survey and comparative review. ACM Comput. Surv. 56 (7), 1–36. doi:10.1145/3616864
Nguyen, T. T., Nguyen, N. D., Vamplew, P., Nahavandi, S., Dazeley, R., and Lim, C. P. (2020). A multi-objective deep reinforcement learning framework. Eng. Appl. Artif. Intell. 96, 103915. doi:10.1016/j.engappai.2020.103915
Paradis, N., and Chapdelaine, B. (2025). Effects of urban canyons and electromagnetic interference on RPAS performance.
Parisotto, E., Song, F., Rae, J., Pascanu, R., Gulcehre, C., Jayakumar, S., et al. (2020). Stabilizing transformers for reinforcement learning international conference on machine learning,
Potteiger, N., and Koutsoukos, X. (2023). Safe explainable agents for autonomous navigation using evolving behavior trees. 2023 IEEE international conference on Assured Autonomy (ICAA).
Puiutta, E., and Veith, E. M. (2020a). EExplainable reinforcement learning: a survey. International cross-domain conference for machine learning and knowledge extraction,
Puiutta, E., and Veith, E. M. S. P. (2020b). Explainable reinforcement learning: a survey. Lecture notes in computer science. 77, 95. doi:10.1007/978-3-030-57321-8_5
Puri, N., Verma, S., Gupta, P., Kayastha, D., Deshmukh, S., Krishnamurthy, B., et al. (2019). Explain your move: understanding agent actions using specific and relevant feature attribution. arXiv Preprint arXiv:1912.12191.
Roijers, D. M., Vamplew, P., Whiteson, S., and Dazeley, R. (2013). A survey of multi-objective sequential decision-making. J. Artif. Intell. Res. 48, 67–113. doi:10.1613/jair.3987
Rosero, J. C., and Dusparic, I. (2025). Explainable multi-objective Reinforcement Learning: challenges and considerations.
Schött, S. Y., Amin, R. M., Butz, A., Schött, S. Y., Amin, R. M., and Butz, A. (2023). A Literature Survey of how to convey transparency in Co-Located human–robot interaction. Multimodal Technol. Interact. 2023 7 (3), 7. doi:10.3390/mti7030025
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
Senin, P. (2008). Dynamic time warping algorithm review. Inf. Comput. Sci. Dep. Univ. Hawaii A. T. Manoa Honolulu. 855. USA, 40.
Son, M., Lee, J.-I., Kim, J.-J., Park, S.-J., Kim, D., Kim, D.-Y., et al. (2022). Evaluation of the wind environment around multiple urban canyons using numerical modeling. Atmosphere 13 (5), 13. doi:10.3390/atmos13050834
Spaan, M. T. (2012). “Partially observable markov decision processes,” in Reinforcement learning: state-of-the-art (Springer), 387–414.
Van Moffaert, K., and Nowé, A. (2014). Multi-objective reinforcement learning using sets of pareto dominating policies. J. Mach. Learn. Res. 15 (1), 3483–3512.
Verma, A., Murali, V., Singh, R., Kohli, P., and Chaudhuri, S. (2018). Programmatically interpretable reinforcement learning. International conference on machine learning.
Wu, J., Ye, Y., and Du, J. (2024a). Multi-objective reinforcement learning for autonomous drone navigation in urban area. Constr. Res. Congr. 2024, 707–716. doi:10.1061/9780784485262.072
Wu, J., Ye, Y., and Du, J. (2024b). Multi-objective reinforcement learning for autonomous drone navigation in urban areas with wind zones. Automation Constr. 158, 105253. doi:10.1016/j.autcon.2023.105253
Wu, J., Sun, B., You, H., and Du, J. (2025a). Enhancing Human-AI perceptual alignment through visual-haptic feedback system for autonomous drones. Int. J. Industrial Ergonomics 109, 103780. doi:10.1016/j.ergon.2025.103780
Wu, J., You, H., Sun, B., and Du, J. (2025b). LLM-driven pareto-optimal multi-mode reinforcement learning for adaptive UAV navigation in urban wind environments. IEEE Access. doi:10.1109/ACCESS.2025.3611336
Xia, Q., and Herrmann, J. M. (2025). Interpretability by design for efficient multi-objective reinforcement learning. arXiv preprint arXiv:2506.04022.
Xing, J., Nagata, T., Zou, X., Neftci, E., and Krichmar, J. L. (2023). Achieving efficient interpretability of reinforcement learning via policy distillation and selective input gradient regularization. Neural Netw. 161, 228–241. doi:10.1016/j.neunet.2023.01.025
Yoon, J., Arik, S., and Pfister, T. (2020). Data valuation using reinforcement learning. International Conference on Machine Learning.
Zajic, D., Fernando, H. J. S., Calhoun, R., Princevac, M., Brown, M. J., Pardyjak, E. R., et al. (2011). Flow and turbulence in an urban canyon. J. Appl. Meteorology Climatol. 50 (1), 203–223. doi:10.1175/2010JAMC2525.1
Zhang, G., and Hsu, L. T. (2021). Performance assessment of GNSS diffraction models in urban areas. NAVIGATION 68 (2), 369–389. doi:10.1002/navi.417
Zhao, J., Liu, H., Sun, J., Wu, K., Cai, Z., Ma, Y., et al. (2022). Deep reinforcement learning-based end-to-end control for UAV dynamic target tracking. Biomimetics 7 (4), 7. doi:10.3390/biomimetics7040197
Zheng, S., Zeng, K., Li, Z., Wang, Q., Xie, K., Liu, M., et al. (2024). Improving the prediction of GNSS satellite visibility in urban canyons based on a graph transformer. NAVIGATION J. Inst. Navigation 71 (4), navi.676. doi:10.33012/navi.676
Keywords: autonomous drones, explainable artificial intelligence, human-AI alignment, multi-objective reinforcement learning (MORL), urban operations
Citation: Sun B, You H, Wu J and Du J (2026) Interpretable group gated priority traces for scalarized multi-objective reinforcement learning in autonomous drone operations. Front. Built Environ. 12:1747709. doi: 10.3389/fbuil.2026.1747709
Received: 20 November 2025; Accepted: 09 January 2026;
Published: 04 February 2026.
Edited by:
Pengkun Liu, Hong Kong Center for Construction Robotics, ChinaReviewed by:
Hassan Sarmadi, Ferdowsi University of Mashhad, IranJunren Luo, National University of Defense Technology, China
Rupayan Das, Institute of Engineering and Management (IEM), India
Copyright © 2026 Sun, You, Wu and Du. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Jing Du, ZXJpYy5kdUBlc3NpZS51ZmwuZWR1
Jiahao Wu