Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Built Environ., 04 February 2026

Sec. Structural Sensing, Control and Asset Management

Volume 12 - 2026 | https://doi.org/10.3389/fbuil.2026.1747709

This article is part of the Research TopicAI Foundation Models and Knowledge Engineering for Smart Construction and Infrastructure SystemsView all articles

Interpretable group gated priority traces for scalarized multi-objective reinforcement learning in autonomous drone operations

  • Informatics, Cobots and Intelligent Construction (ICIC) Lab, Engineering School of Sustainable Infrastructure & Environment, University of Florida, Gainesville, FL, United States

Autonomous drones are increasingly deployed for navigation, inspection, and monitoring in urban building and infrastructure environments that are dynamic, partially observable, and safety critical. These missions must balance conflicting objectives such as goal completion, wind avoidance, collision avoidance, signal coverage, and flight efficiency, making Multi-Objective Reinforcement Learning (MORL) an attractive control method. However, current explainability methods rarely examine how MORL policies prioritize different sensor channels during urban drone operations, leaving objective trade-offs and input priorities opaque to human operators. This paper introduces a lightweight group-gating architecture that augments MORL policies with an interpretable priority interface. The module aggregates raw observations into several meaningful categories (goal information, kinematics, wind, position, signal coverage, penalties, obstacle distance) and learns a gate vector that reweights these groups at every decision step. Integrated into a Proximal Policy Optimization (PPO) agent and evaluated in high-fidelity Unity simulations of urban operations with dynamic wind fields, the architecture preserves task performance while revealing stable priority patterns. Based on the results, three main findings emerge. First, the group-gating layer preserves asymptotic reward and value loss relative to ungated baselines. Second, gate dynamics exhibit dual-mode behavior, with a shared component that tracks global task difficulty and category-specific reallocations that differentiate wind and obstacle distance. Third, observation priorities align with environmental dynamics, with Dynamic Time Warping analysis showing 39% improved alignment for wind and 19% for obstacle distance when tracking changes rather than absolute levels. The resulting protocol provides a basis for real-time monitoring and for exploring adaptive sensor scheduling and early fault-detection heuristics in autonomous urban drone operations.

1 Introduction

While autonomous drones have been widely used for navigation, inspection, and monitoring in construction and infrastructure operations, deploying these drones effectively and safely in urban settings remains a significant challenge (Choi et al., 2023; Choi et al., 2024; Liu et al., 2024). The operating environment of urban settings is not only complex but also inherently dynamic and partially observable: cluttered skylines, unpredictable wind gusts, and moving equipment together create uncertainties and risks that complicate real-time decision-making (Son et al., 2022; Zajic et al., 2011). Therefore, the algorithms guiding these drones must be trained to handle complex multi-objective control problems, such as avoiding collisions while reaching targets, compensating for wind disturbances (Wu et al., 2024a), efficiently planning charging strategies to prevent power depletion (Das et al., 2022; Dash et al., 2025), coordinating with other drones to execute missions, and optimizing paths through wireless sensor networks (Das and Dash, 2023a; Das and Dash, 2023b). To tackle these challenges, researchers increasingly turn to Deep Learning frameworks like Multi-Objective Reinforcement Learning (MORL) (Roijers et al., 2013). MORL extends traditional RL to settings with multiple, often conflicting criteria. In practice, these settings are addressed either by approximating Pareto-efficient solution sets or by optimizing a scalarized utility under an explicit preference model (Van Moffaert and Nowé, 2014). In this study, we adopt the scalarized formulation and use preference-conditioned linear scalarization to train a single policy. This enables a policy to represent trade-offs, such as balancing mission success, energy efficiency, and collision risk, within a unified decision framework. Studies have shown that these algorithms can effectively control drones in complex, high-dimensional environments that adapt to varying operational priorities (Fei et al., 2024; Wu et al., 2024b).

While deep MORL frameworks have demonstrated strong performance in balancing competing objectives such as collision avoidance, energy conservation, and trajectory stability, their internal decision-making processes remain opaque. This opacity has direct operational implications. Without understanding why a drone selects one action over another in a complex situation, engineers and supervisors cannot perform meaningful post-hoc failure analysis or anticipate when the policy might behave unpredictably (Adadi and Berrada, 2018; Avacharmal, 2024). In safety-critical urban environments, such uncertainty undermines the foundation of human-autonomy collaboration. A human supervisor can only maintain calibrated trust if the drone’s behavior is legible and its priorities are predictable (Schött et al., 2023; Wu et al., 2025a). When the learned policy functions as a black box, trust either erodes, leading to overcautious interventions, or becomes misplaced, resulting in unsafe overreliance (Hancock et al., 2011). Explainable AI (XAI) methods offer potential remedies, particularly those that provide post-hoc state or action saliency to visualize which features and time steps influence decisions (Adebayo et al., 2018; Greydanus et al., 2018; Hausknecht and Stone, 2015). While a growing body of work explains reinforcement learning policies via state saliency or action attribution, existing methods rarely investigate how multi-objective policies in urban drone missions prioritize information across heterogeneous sensor channels (e.g., wind, obstacle distance, signal coverage) over time. This gap makes it difficult for practitioners to understand which environmental cues a MORL controller is relying on when balancing safety and mission performance (e.g., wind disturbance vs. obstacle avoidance), whether those priority signals are stable across episodes, and how well they correspond to actual environmental conditions. This constitutes the core knowledge gap addressed here: we lack practical, reproducible methods to obtain and validate interpretable priority-allocation traces that reveal how a deep RL policy distributes attention or priority across semantically grouped sensor channels over time in complex urban missions. Without such observation tools, the deployment of multi-criteria RL policies in safety-critical settings becomes harder to audit, harder to diagnose after failures, and more difficult to calibrate trust appropriately (Puiutta and Veith 2020a; Puiutta and Veith 2020b).

To address this gap, our goal is to investigate whether a lightweight group-gating layer interpretation can serve as a reliable and useful observation window into a MORL policy’s allocation strategy. We argue that a useful signal must be (1) stable and non-random, (2) exhibit partial decoupling across categories (allowing independent risk tracking), and (3) vary with external conditions in a way that matches the control problem, all while maintaining task performance. In practice, MORL settings are commonly handled either by approximating Pareto-efficient policy sets or by optimizing a scalarized utility under an explicit preference model. In this work, we adopt the scalarized formulation: we train a single PPO (Schulman et al., 2017) policy to maximize a scalar reward constructed via preference-conditioned linear scalarization of multiple reward components. We cast the task as a partially observable control problem and define several semantic observation groups (e.g., wind, obstacle distance, goal information). Experiments run in a high-fidelity Unity simulation (Juliani et al., 2018) of an urban environment, complete with dynamic wind fields (Wu et al., 2024b). This scalarized multi-criteria setup is used as a controlled platform to evaluate the proposed group-gated priority traces, and we do not claim Pareto-front coverage or Pareto-optimality guarantees. Our analysis compares the internal priority signals with external environment measurements, using Dynamic Time Warping (DTW) and Spearman correlation to quantify temporal alignment and monotonic relationships (Senin, 2008). The remainder of the paper reviews related work, details the environment and analysis methods, reports the full experimental results, and discusses the implications of these findings for future work in verifiable and transparent autonomous systems.

2 Related work

2.1 Autonomous drones in construction operations

Unmanned aerial systems (UAS) have evolved from experimental demonstrations to routine tools for surveying, progress monitoring, façade inspection, and post-event assessment on construction sites (Albeaino et al., 2019; Zhou and Gheisari, 2018). Operating in dense urban environments, however, presents significant challenges that directly affect autonomous control. Tall structures block or reflect satellite signals, causing Global Positioning System (GPS) occlusion and multipath interference. Wind gusts channeled by street canyons, temporary occlusions from cranes, and proximity to scaffolds create dynamic disturbances and intermittent loss of line-of-sight (Kim et al., 2025). Studies using Global Navigation Satellite System (GNSS) and Remotely Piloted Aircraft System (RPAS) data confirm that these “urban canyon” effects bias position estimates and reduce flight reliability (Paradis and Chapdelaine, 2025; Zhang and Hsu, 2021; Zheng et al., 2024). As a result, perception and control degrade precisely when accurate behavior is most needed, a persistent difficulty for drones operating in real construction contexts (Jiang et al., 2017; Zheng et al., 2024).

Recent work no longer treats drones purely as data collectors but as autonomous agents making decisions during missions. In construction robotics, this shift introduces human-centered requirements such as supervision, diagnostic transparency, and calibrated trust alongside tracking accuracy and efficiency (Agrawal and Cleland-Huang, 2021; Fei et al., 2024; Gupta and Nair, 2023). Safety frameworks originally developed for factory-based collaborative robots (Standardization, 2016) remain relevant for aerial systems operating near workers and equipment. Task-specific risk assessment, interpretable interaction, and bounded velocity or force contribute to safer field operations. Overall, these conditions reveal a broader issue: as autonomy increases, understanding how the control policy makes decisions becomes as critical as maintaining precise flight performance. This realization motivates research on learning mechanisms capable of operating safely under uncertainty, and on methods that make these mechanisms transparent to human supervisors.

2.2 Learning in a partially observable environment

Autonomous drones must often act with incomplete information. In cluttered and dynamic construction sites, sensors provide only partial observations of the true environmental state. Occlusions, limited field-of-view, and stochastic wind all introduce uncertainty. Such conditions are naturally modeled by the Partially Observable Markov Decision Process (POMDP) framework (Lauri et al., 2022; Spaan, 2012), where policies must make decisions under uncertainty about hidden states.

To manage partial observability, modern controllers integrate learning architectures that infer or remember latent variables. Recurrent value and policy learners, such as the Deep Recurrent Q-Network (DRQN), compress sequences of past observations into hidden states for improved temporal reasoning (Fan et al., 2020). Latent-state models like Data Valuation Reinforcement Learning (DVRL) further infer unobserved factors online (Yoon et al., 2020). Recent sequence models extend this capability using memory that aggregates information over longer horizons with enhanced stability (Hausknecht and Stone, 2015; Igl et al., 2018; Parisotto et al., 2020). These techniques are particularly valuable when key cues, such as wind variation or obstacle motion, arrive intermittently.

In practice, construction missions involve multiple, sometimes competing, objectives: mission efficiency, collision avoidance, energy economy, and safety. MORL explicitly represents such trade-offs by modeling returns as vectors rather than scalar rewards (Roijers et al., 2013; Wu et al., 2025b). Complementary Constrained MDP (CMDP) formulations incorporate explicit safety constraints during training and execution (Achiam et al., 2017; Chow et al., 2018), while Lyapunov-based updates provide theoretical guarantees of near-constraint satisfaction. Broader reviews in robotics highlight design patterns for reducing unsafe exploration, especially for aerial vehicles operating in shared workspaces with humans (Chow et al., 2019; Garcıa and Fernández, 2015). Depending on the formulation, MORL methods may target Pareto-efficient policy sets or optimize a scalarized utility under an explicit preference model. In this paper, we use preference-conditioned linear scalarization with a single PPO policy, and we leverage this setup to study interpretable group-gated priority traces rather than proposing a new MORL optimizer.

Case studies demonstrate progress through end-to-end and hybrid policies, disturbance-aware control, and energy-sensitive planning in turbulent environments (Banerjee and Bradner, 2024). Surveys of sensing and training configurations underline the importance of effective history use, explicit preference modeling, and safety-aware updates for successful field transfer (Chen et al., 2024; Zhao et al., 2022). Yet, as these learning-based controllers grow more capable, they also become opaque: it remains unclear which cues they rely on to infer hidden states or resolve competing objectives. Understanding this internal reasoning, particularly in uncertain, safety-critical environments, requires additional interpretive tools.

2.3 Interpretable methods in reinforcement learning

Interpretability in reinforcement learning (RL) has been studied under several complementary paradigms, including feature-importance analyses, process-level visualizations, and inherently interpretable policy representations (Puiutta and Veith 2020a; Puiutta and Veith 2020b). Recent surveys of explainable reinforcement learning (XRL) organize existing methods into feature-importance, learning-process, and policy-level categories, highlighting both the progress and the remaining gaps in making RL decisions transparent to human users (Milani et al., 2024). A large body of work focuses on explaining how specific observations influence individual actions. Saliency-map approaches visualize which parts of the input state most affect the chosen action in visual domains, for example, by perturbing pixels and measuring the impact on the action-value or policy output (Huber et al., 2022; Puri et al., 2019). Other methods reconstruct actions via surrogate models to attribute importance to input features, extending feature-attribution ideas from supervised learning to deep RL (Chen et al., 2020; Guo et al., 2021). These approaches provide fine-grained importance scores for individual state dimensions but typically remain local (per state-action pair) and are not designed to summarize stable prioritization patterns over semantically grouped sensor channels.

Other works pursues interpretability by design, replacing opaque neural policies with more structured or symbolic representations. Programmatically interpretable RL searches in a restricted space of human-readable policies, using neural networks only as oracles to guide the search (Verma et al., 2018). Formal-methods-based RL combines temporal-logic specifications with control barrier functions, producing policies whose safety and high-level behavior can be verified and explained (Li et al., 2019). More recent approaches use neural-symbolic logic or Shapley-based decompositions to derive stable, interpretable policies while preserving performance (Ma et al., 2020; Xing et al., 2023). In multi-objective settings, interpretability has been explored through Pareto-front structure or explicit regularization on interpretable preference representations (Rosero and Dusparic, 2025; Xia and Herrmann, 2025). These methods, however, primarily explain trade-offs in objective space or constrain policy form, rather than explicitly revealing how heterogeneous observations are prioritized during execution. In robotics and navigation, explainable RL has been applied to mobile robots and unmanned aerial vehicles, often using visual saliency or rule-based policy structures to provide human-understandable rationales for path planning and obstacle avoidance (He et al., 2020; Potteiger and Koutsoukos, 2023). Such work demonstrates the value of interpretable controllers in safety-critical domains, yet explanations usually target high-level behavior (e.g., why a particular trajectory was chosen) or localized state importance.

Existing methods are valuable for understanding local feature importance, verifying safety, or summarizing policy structure. However, there remains a lack of methods that systematically characterize how an RL agent allocates priority across grouped observation channels (e.g., wind, obstacle geometry, signal coverage) over time, especially in multi-objective, autonomous urban drone operations. Current approaches seldom connect input-group prioritization to evolving environmental dynamics in a way that is directly aligned with domain concepts in building and infrastructure missions. This gap motivates the group-gating approach proposed in this work, which aims to expose real-time priority allocation over sensor groups within a MORL-based autonomous drone controller.

Table 1 shows a concise comparison of the representative research in attention-based policies, mixture-of-experts, and post-hoc attribution. We emphasize that attention-based policies, mixture-of-experts routing, and post-hoc attribution have each made substantial contributions to interpretable and scalable RL. Our contribution is complementary and more narrowly scoped. We focus on producing group-gated priority traces, meaning a low-dimensional, time-resolved gate signal over predefined semantic observation groups that are recorded during execution and can be used for auditing and event association in urban drone operations.

Table 1
www.frontiersin.org

Table 1. Concise comparison.

3 System design

3.1 Simulation environment settings

To investigate how a policy learns to prioritize competing signals (like wind versus obstacles), we developed a sophisticated simulator within the Unity engine with the ML-Agent package, which serves as the testbed for training our multi-objective PPO policy with a group-gating layer (Juliani et al., 2018), featuring a detailed model of a DJI Mavic 2 Pro drone. The core of the simulator is a custom-built landscape representing a 3,000-foot square area of Manhattan. While realistic textures from the Unity Asset Store were used for visual fidelity, we manually constructed the primary building meshes (As shown in Figure 1). This approach was crucial, as it allowed us to create simplified colliders (bounding boxes) optimized for our physics-based wind simulation (Wu et al., 2024b). The street-level environment was intentionally simplified to isolate the core navigation challenge, including streets, sidewalks, and streetlights, but omitting dynamic objects like vehicles or pedestrians. Crucially, only the building meshes were equipped with colliders, defining them as the only physical obstacles in the flight path. The entire scene is illuminated by a fixed directional light set at a 45-degree angle to simulate consistent lighting conditions. Targets were placed on the map’s corners to create a complex flight route. A 200-foot altitude limit was imposed during training, forcing the drone to navigate through this custom-designed urban terrain.

Figure 1
A composite image showing a drone's perspective over a cityscape on the left, with skyscrapers and navigation data like wind information, height, battery, and flying time. On the right, a top-down map with a red border highlighting a specific city area.

Figure 1. Simulator overview.

To complement the static landscape, we developed a dynamic wind simulation to create a robust and realistic training environment. The primary challenge was generating authentic aerodynamic scenarios without the prohibitive computational cost of traditional Computational Fluid Dynamics (CFD), which is unsuitable for the thousands of iterations required by Deep Reinforcement Learning (DRL) (Abichandani et al., 2020). Our solution utilizes a custom-built Convolutional Autoencoder, a representation model that is highly efficient at approximating complex airflow. While the detailed technical methodology for this model is discussed in our previous publications, the process begins by defining an initial, global wind condition (speed and direction) across the entire landscape. At predefined intervals, we systematically alter this global wind state.

3.2 Network design and training

3.2.1 Overview of the structure

Figure 2 illustrates the proposed policy network for autonomous urban drone flight. At each decision step t, the agent receives a 16-dimensional observation vector otR16. The network is organized into two main modules. For Category encoders with group gating, the 16 raw features are partitioned into seven groups (goal information, kinematics, wind, position, signal, penalty, and distance). Each group is encoded by a small multilayer perceptron (MLP) into a higher-dimensional embedding. A lightweight gating MLP then outputs one scalar weight per group, which rescales the corresponding embedding multiplicatively. This yields an interpretable representation in which each contiguous block corresponds to a specific observation category and is modulated by a single gate. For weighted fusion and recurrent control head, the concatenated, reweighted embeddings are fused with a parallel encoding of the raw 16-D input, passed through a stack of fully connected layers, and finally through a Long Short Term Memory (LSTM) block (Hochreiter and Schmidhuber, 1997) that captures temporal dependencies under partial observability before producing the action output.

Figure 2
Flowchart depicting a neural network architecture with two main sections: Category Encoders and Weighted Fusion. Category Encoders process inputs like Goal Info, Kinematics, and others through three to one hundred twenty-eight dimensions. Outputs are concatenated and passed to MUT for observation weights. In Weighted Fusion, MLP layers reduce dimensions, using LSTM for final Action decisions.

Figure 2. Network architecture.

3.2.2 Category encoders

We first group the 16 features into seven semantically interpretable subsets. Let Equation 1:

otkRdk,k=1,,7,(1)

denote the features of the group k at time t. In our implementation (As shown in Equation 2):

d1,,d7=3,3,3,3,1,2,1,k=17dk=16,(2)

corresponding to goal, kinematics, wind, position, signal, penalty, and distance.

Each group is passed through a dedicated encoder fϕk (a small MLP) to obtain an embedding (As shown in Equation 3):

ztk=fϕkotkRmk,(3)

where mk128,64. The four 3-D groups (goal, kinematics, wind, position) use 128 units to capture richer nonlinear structure, and the three smaller groups use 64 units to avoid unnecessary parameters, giving the total embedding dimension.

k=17mk=4×128+3×64=704.(4)

The seven embeddings are concatenated along the feature dimension as Equations 4, 5:

Zt=zt1;zt2;;zt7R704.(5)

To decide how strongly each group should influence the policy at the time t, we feed both the encoded features and the original observation into a gating MLP as Equation 6. The gating input is

stgate=Zt;otR704+16=R720.(6)

The gating network first maps this 720-D vector to a 64-D hidden state and then to 7 logits as Equations 7, 8:

htgate=ρWg1stgate+bg1R64,(7)
at=Wg2htgate+bg2R7,(8)

where ρ· is a ReLU nonlinearity, Wg1R64×720,bg1R64,Wg2R7×64, and bg2R7.

We then apply a sigmoid element-wise to obtain non-negative, independently scaled gates as Equation 9:

gt=σatR7,with 0<gt,k<1 for k=1,,7.(9)

In the proposed network, the gate outputs gt,k are used as independent scaling factors during both training and inference. No sum-to-one constraint is enforced, which allows multiple categories to be emphasized simultaneously (e.g., wind and distance to obstacles during a gust) and avoids introducing artificial inter-group coupling. For comparison, we also consider a constrained gating baseline where a sum-to-one normalization is enforced during both training and inference. We define normalized gates as Equation 10:

g^t,k=gt,kj=17gt,j+ε,k=1,,7,(10)

where ε is a small constant for numerical stability, so that k=17g^t,k1 at each time step. To unify the two variants, we denote the gating weights applied to each group embedding as Equation 11:

βt,k=gt,k,unconstrainedgatingmainmethodg^t,k,constrainedgatingbaseline,k=1,,7(11)

Each gating weight rescales its corresponding group embedding via element-wise multiplication as Equation 12 (broadcast across the embedding dimension):

ztk=βt,kztk for k=1,,7.(12)

The reweighted representation is obtained by concatenating these gated embeddings as Equation 13:

Zt=zt1;zt2;;zt7R704.(13)

The gated representation Zt is then passed to the subsequent weighted fusion module and the recurrent policy network (LSTM), so gating is applied before feature fusion and temporal aggregation. For interpretability, we also report normalized gate “shares” as Equation 14:

pt,k=gt,kj=17gt,j+ε,k=1,,7,(14)

These shares can be visualized as the fraction of total gating mass assigned to each category at time t. Importantly, Equation 14 is used only for analysis and visualization of the unconstrained model’s logged gt and does not affect the policy forward pass. For the constrained baseline, the gating weights used in the forward pass are already g^t and therefore sum to one by construction.

The seven-group partition is a design choice made for semantic auditability and controlled interpretability analysis. When higher-dimensional sensing modalities are introduced (e.g., LiDAR, thermal imagery, dense point clouds), the same principle can be retained through hierarchical grouping, where modality-level encoders form coarse groups and sub-groups are defined within each modality based on task semantics. In addition, the grouping itself can be made learnable by introducing structured group assignment (e.g., sparse or clustered feature-to-group mapping) while maintaining an interpretable group interface. We leave these scalable regrouping strategies as future work, and in this study we focus on a fixed, semantically grounded grouping to ensure that the logged priority traces correspond to domain-meaningful sensor categories.

During inference, we log the raw gate vector gt (and, when needed, the constrained normalized gates g^t for the baseline). Unless otherwise stated, all priority-trace figures and interpretability analyses reported in this paper are based on the unconstrained gating model.

3.2.3 Multilayer perceptron fusion

The gated representation Zt is then fused with a parallel encoding of the raw observation. The motivation is to preserve a direct path from the original measurements to the control head while still allowing the policy to exploit the structured categorical representation. First, Zt is passed through two fully connected layers as Equations 15, 16:

ht1=ρWf1Zt+bf1R512,(15)
ht2=ρWf2ht1+bf2R512,(16)

with Wf1R512×704 and Wf2R512×512. In parallel, the raw observation ot is encoded by a separate MLP as Equation 17:

rt=ρWrot+brR512,(17)

where WrR512×16. The two 512-D vectors are concatenated and compressed back to 512 dimensions as Equations 18, 19:

ct=ht2;rtR1024,(18)
xt=ρWcct+bcR512,(19)

with WcR512×1024. The resulting 512-D feature xt is then processed by a stack of N=8 additional fully connected layers of size 512512, providing sufficient capacity for complex control policies while keeping the latent width fixed as Equation 20:

xti=ρWixti1+bi,i=1,,8,(20)

where each WiR512×512 and biR512. We denote the output of this stack by Equation 21:

x¯t=xt8R512.(21)

To account for partial observability and temporal dependencies (e.g., wind changes, motion history), we feed x¯t into an LSTM as Equation 22:

htLSTM,st=LSTMx¯t,st1,(22)

where st is the recurrent hidden state (cell and hidden vectors). Finally, separate MLP heads map the LSTM output to the policy and value function used by PPO as Equation 23:

πθato1:t=MLPπhtLSTM,Vθo1:t=MLPVhtLSTM.(23)

Overall, this architecture encodes heterogeneous observations into interpretable group embeddings, modulates them with learnable gates that quantify per-category importance, and fuses the gated representation with raw observations through a deep, recurrent control head. The gating variables gt,k and their normalized shares pt,k provide a principled way to analyse which observation groups the policy priorities, without sacrificing control performance. Table 2 shows the parameters and settings used during agent training.

Table 2
www.frontiersin.org

Table 2. Parameters and settings for the network.

3.3 Reward function and policy

The autonomous drone agent is the core intelligence of our system, designed to enable complex navigation through the challenging urban environment, trained using a MORL framework (Nguyen et al., 2020). This approach is essential for enabling the agent to balance multiple, often conflicting, flight objectives such as minimizing travel time, ensuring complete obstacle avoidance, and mitigating the effects of dynamic wind. We model the task as a Partially Observable Markov Decision Process. For the training algorithm, we employed PPO (Schulman et al., 2017) due to its recognized stability, efficiency, and robustness in complex control tasks. It trains the agent by optimizing a policy, denoted as πθatst, which defines the probability of taking an action a in each given state s. The total reward at time step t is defined as Equation 24:

Rt=Rtdistance+Rttime+Rtcoverage+Rsuccess+Rcollision+Rdetour,(24)

the agent receives a shaped reward based on distance improvement at each step as Equation 25:

Rtdistance=dt1dtd0×100,(25)

where dt=ptagentptarget is the Euclidean distance to target at time t, d0=p0agentptarget is the initial distance (normalization factor), dt1dt>0 when moving closer to target (positive reward), dt1dt<0 when moving away from target (negative reward). A small negative reward at each step encourages faster task completion as Equation 26:

Rttime=0.01×ω0,(26)

where ω00,1 is the goal-reaching weight (randomized per episode) to prevent the agent from taking unnecessarily long paths. When the agent is not in a coverage zone, it receives a penalty as Equation 27:

Rtcoverage=0,ifCt=10.01×ω1,ifCt=0,(27)

where Ct is the binary coverage indicator at time t, ω1=1ω00,1 is the coverage weight. The coverage indicator is defined as Equation 28:

Ct=1,ifptagentci<rcover0,otherwise,(28)

where ci is the position of the center of the signal cover area, rcover=250 meters is the coverage radius. When the agent reaches and stays at the target for τhold=20 frames, it receives a reward as Equation 29:

Rsuccess=+10,(29)

the episode terminates with success. When the agent collides with obstacles or boundaries, it receives a penalty as Equation 30:

Rcollision=10,(30)

the episode terminates with failure. Rdetour encourages the drone to find alternative routes to avoid the stronger wind zones. For every epoch, if the drone encounters a strong wind zone and avoids it, it will receive a positive reward. A negative reward is applied for each step the drone spends in a strong-wind zone. As shown in Equation 31, this will encourage the drone to minimize the time it spends in those areas and encourage it to find a route around them. We set this function as a linear function. The closer the drone is to a wind zone, the larger the penalty it will have. This will encourage the drone to move away from the strong wind zones.

Rdetour=0.1×diswind20perstepifinstrongwindzone+10ifavoidsstrongwindzone(31)

To enable a single policy to handle diverse mission requirements, the objective weights are randomized at episode initialization as Equation 32:

ω0U0,1,ω1=1ω0,(32)

where U0,1 denotes the uniform distribution over 0,1. These weights control the trade-off between speed priority (w0 high: reach target quickly, less emphasis on coverage) and coverage priority (w1 high: maintain signal coverage, less emphasis on speed). The weights are included in the observation space, enabling task-conditioned policy learning. As shown in Equation 33, the observation vector at time t is:

οt=rt,,rt,wt,pt,Ct,ω0,ω1,dminTR16,(33)

where rt is relative position to target (normalized), vtR3 is agent velocity (normalized), wtR3 is wind force vector, ptR3 is agent absolute position (normalized), Ct0,1 is binary coverage indicator, w0,w10,1 are objective weights, dminR is distance to closest obstacle (normalized). The agent outputs discrete actions for 3D movement as Equation 34:

αt=ax,ay,azT,ai1,0,1(34)

where each component controls movement in one axis, −1 means negative direction, 0 means no movement and +1 means Positive direction. The action is converted to velocity as Equation 35:

vtintended=αt×vmax,(35)

the actual velocity includes wind effect as Equation 36:

vtactual=vtintended+wt,(36)

where vmax=15 m/s is the maximum speed, wt is the wind force computed from CFD-based model. We use PPO with the standard clipped surrogate objective and generalized advantage estimation (Schulman et al., 2017), as implemented in Unity ML-Agents. We do not modify the optimizer. The algorithm’s objective is to maximize the expected cumulative reward, which is formally defined as Equation 37:

Jθ=Eπt=0TγtRt,(37)

where γ is the discount factor that balances immediate and future rewards, rt is the reward received at time step t, and θ represents the parameters of the neural network policy that are adjusted during training.

4 Results

4.1 Network performance

We use the task success rate over the entire training process as the criterion for model convergence. During training, each episode is counted when the drone either reaches the goal, crashes, or times out, with a maximum episode length of 15,000 steps. When the task success rate (the number of successful episodes divided by the total number of episodes so far) remains stably above 95%, we regard the policy as converged. Under this criterion, the original network without group gating converged after about 7,000 episodes, the network with constrained group gating converged after roughly 9,000 episodes, and the network with unconstrained group gating converged after about 12,000 episodes.

The reward curves (Figure 3) show that adding a group-gated layer does not harm performance. All three policies converge toward the same reward band and remain stable once training passes the long plateau near the end of the run. The policy with unconstrained gating converges more slowly and exhibits larger variance early in training, then closes the gap. The policy with normalized gating (constrained to sum to one) enforces competition among observation categories, which can reduce variance early but limits representational flexibility. In contrast, the policy with unconstrained gating treats priorities independently and permits multiple observation categories to be emphasized simultaneously, which increases flexibility but can lead to slower calibration when saturate near boundary values.

Figure 3
Line graph depicting Environment/Cumulative Reward over Steps for three networks. The cyan line represents the original network without Group Gate, the orange line represents Group Gated (constrained), and the purple line represents Group Gated (unconstrained). All lines tend to stabilize around the 80 reward mark toward the right of the graph at approximately 140 million steps.

Figure 3. Cumulative reward during training process for different networks.

The value-loss curves (Figure 4) corroborate this interpretation. Loss decreases over time for all three policies and stabilizes at a low level by the end of training. Independent gating remains in the higher-loss regime longer before reaching a loss floor comparable to the baseline and normalized gating variants. This combination of comparable final reward (Figure 3) and comparable final loss indicates that unconstrained priority assignment does not degrade the quality of the learned value function or policy at convergence, the cost appears primarily in training efficiency rather than asymptotic capability. For autonomous navigation applications, the practical implication is that group-gated priorities provide an interpretable mechanism for understanding which observation categories the agent prioritizes in different contexts, without sacrificing task performance. In our simulation study, this interpretability benefit incurs a modest training-time overhead. We view this trade-off as acceptable for applications requiring explainability, such as human-robot collaboration in Urban Search and Rescue scenarios, where priority traces could be surfaced to operators as decision-support signals after further validation in operational settings.

Figure 4
Line graph showing Losses/Value Loss over 140 million steps. Three lines represent different models: an original network without Group Gate (cyan), Group Gated constrained (orange), and Group Gated unconstrained (purple). All lines exhibit decreasing trends with fluctuations.

Figure 4. Value loss during training process for different networks.

The maps overlay repeated inference rollouts using the exported well-trained model. In Figure 5, white polygons are buildings, and the color field encodes the ambient wind pattern. Each magenta trace is one flight from start to goal. Across both sectors the agent reaches the goal consistently and the trajectories cluster into a narrow corridor, which indicates a stable policy under repeated trials. Most variability appears where streets intersect or where the wind gradient is steep, and the deviations are short and self-correcting rather than failure modes. Route choice is consistent with a strategy that limits crosswind exposure by shadowing building edges and committing to straight street segments once a safe corridor is identified. In other words, the trained controller completes the task reliably and shows low run-to-run spread, which matches the reward and loss results reported earlier.

Figure 5
Side-by-side color-coded maps of a city block with white building outlines. Each map shows different looping red paths overlaying yellow, green, blue, and purple areas. The paths indicate variations in possible routes through the area.

Figure 5. Trajectories of drone agent during inference phase.

4.2 Grouped priority analysis

Figure 6 tracks the percentage of grouped priority values during inference as the drone is finishing the task. Goal info is the normalized relative offset to the target, Kinematics is the drone’s velocity vector. Wind is the wind vector in the environment. Position is the agent location in map coordinates. Signal is a binary indicator that equals 1 when the drone is within the radius of either tower, and we also track its running average. Penalty is the pair (time penalty, signal coverage penalty) sampled each episode, these weights are used to scalarize objectives and are included to make preferences explicit, either going with short time or high signal coverage. Distance is the closest obstacle distance from raycasts, with ML-Agents ray-sensor outputs also available through the attached component. Each category remains within a stable range, which indicates that the trained policy has settled on a consistent allocation strategy rather than oscillating across inputs. Goal information sits at the top band and shows short bursts at wayfinding moments. Wind priority stays elevated and varies smoothly with the background field, which matches the path choices that hugged building edges in the trajectory maps. Position and distance remain in a mid range and rise when the agent approaches corridor transitions. Kinematics stays lower and smoother, which is natural once the velocity profile is regulated by the actor. The penalty channel is mostly quiet and spikes briefly near tight clearances or sharp heading corrections. The share of total weight lies in a narrow band of about 12%–17%, and the ranking matches the left panel. This near-conserved resource budget implies that the policy reallocates relative weights across channels rather than changing the total amount of priority.

Figure 6
Line graph titled “Category Share of Total Weight” showing percentage share on the y-axis and steps on the x-axis. It features seven fluctuating lines in different colors representing goal_info, kinematics, wind, position, signal, panelty, and distance over 5000 steps.

Figure 6. Percentage of priority for each category.

Figure 7 shows uplift over the 0.5 baseline. Each category remains within a stable band over time and exhibits small, synchronous reallocation at key moments. Goal information maintains the highest uplift, often around 30%–45% later in the run, which indicates a strong goal drive throughout the trajectory. Position and wind form a second tier and show local increases during corridor transitions or when the wind gradient strengthens, suggesting that geometry and wind disturbance jointly shape route commitment and fine adjustments. Distance and signal stay at moderate levels and align with periods of turning or narrowing passages, reflecting local feasibility checks. Kinematics stays lowest across the run and its fluctuations gradually contract, which suggests that once the policy stabilizes, velocity and acceleration do not require persistent high priority and are instead maintained by the learned action regulation.

Figure 7
Line graph titled

Figure 7. Increasing rate of priority compared to baseline.

Normalized priority in Figure 6 shows relatively balanced allocation across categories (range: 12.45%–15.44%). To assess temporal consistency, we compute the coefficient of variation (CV = std/mean × 100%), which quantifies relative variability. Different groups exhibit different stability levels: wind, distance, and signal demonstrate highest consistency (CV < 2%, std <0.25%), while goal_info shows comparatively greater variability (CV = 3.4%, std = 0.52%). Examining priority uplifts relative to the baseline in Figure 7 further reveals temporal dynamics. Goal_info exhibits the largest uplift variability (std = 6.9%, range = 22.0%), while penalty and position maintain more stable uplifts (std = 1.0% and 1.7%, respectively). Kinematics shows high relative uplift variability (CV = 51.4%) despite low mean priority, suggesting selective activation in specific contexts. Overall, although priority distributions remain within a relatively stable range, observable fluctuations exist that reflect task-dependent adaptation.

Then we evaluate whether grouped gate priorities operate independently or co-vary showing in the correlation matrix (Figure 8). Using the unconstrained gates (not the normalized shares), we compute the Spearman correlation matrix across time and summarize the mean absolute off-diagonal correlation (MAC) and the decoupling index DI = 1−MAC. As shown in the analysis, the gates are partially decoupled rather than fully independent: MAC = 0.460 and DI = 0.540. The first principal component explains 0.532 of the variances, indicating a substantial co-varying mode. Consistent with this, co-activation independence ratios (CAIR) cluster around 0.46 for some pairs but vary for others. Overall, the evidence supports partial decoupling: groups can rise or fall without strict conservation, yet a shared mode still accounts for a significant portion of variability. This reveals two concurrent effects. There is common-mode modulation where several gates increase or decrease together when the scene changes or the task becomes more difficult, evidenced by the moderate PC1 ratio and positive correlations. There is also category-specific reallocation, where relative weights shift among channels even when the overall level remains similar, reflected by the moderate DI and the heterogeneous off-diagonal structure. The pairwise pattern varies wind aligns strongly with distance (ρ = 0.947) and goal information (ρ = 0.846), kinematics mostly aligns strongly with signal (ρ = 0.769), position shows weak coupling to goal information (ρ = 0.119) and wind (ρ = 0.142), while penalty exhibits varied coupling patterns to different channels. In subsequent correlation and DTW analyses, we consider both effects by examining raw gates and, when helpful, residualized gates after removing the common mode so that per-category alignment with observations reflects category-specific variation rather than global shifts.

Figure 8
Two heatmaps showing absolute Spearman correlations. The top heatmap displays correlations among seven variables, labeled goal_info to distance, with values ranging from 0.40 to 0.95. The DI is 0.540, and the PCI ratio is 0.532. The bottom heatmap shows residualized correlations, removing common mode, with values from 0.00 to 0.91 and a DI of 0.518. Each heatmap includes a color scale from 0 to 1 on the right.

Figure 8. Pairwise Spearman correlation matrix.

Figures 9, 10 are included only as examples to help the reader see what our alignment tools reveal at the level of a single rollout. For these two plots we can describe the following patterns without claiming that they generalize. Wind priority and the wind observation tend to move together in short episodes with small leads or lags. Pointwise correlation looks weak because the priority often shifts a little before or after the local change in wind. DTW tolerates those small timing offsets and recovers a clear episodic match, which is why the alignment path stays near the diagonal and bends only at transition points. Goal info priority presents a different picture in this example. Over long-range intervals, its trend diverges from the distance signal, resulting in a negative correlation on the scatter plot. This indicates that agents lower the priority to relative position as they move farther from the target. DTW requires significant deformation at both ends of the curve to achieve alignment, resulting in flat or steep segments along the alignment path. In short, in both scenarios, wind acts as a rapid environmental driver, capable of instantly capturing priority targets. Distance, conversely, serves as a gradual process indicator, primarily functioning when temporarily moving away from or approaching a target. We do not infer global strategies based on just two examples. To test whether similar patterns emerge across categories and runs, we repeated the same process for all seven groups, collecting ten reasoning trajectories under identical conditions. For each trajectory and category, we calculated normalized DTW distances and computed Spearman correlation coefficients at matching time indices. Trajectories were then aggregated to obtain average DTW distances and average correlation coefficients.

Figure 9
Four charts display data analysis. Top left: original time series comparing priority and observation values against steps. Top right: normalized comparison using Dynamic Time Warping (DTW) with a distance of 1.7967. Bottom left: Spearman correlation analysis showing low correlation with a p-value of 0.9818. Bottom right: DTW alignment path illustrating weight index versus observation index with a perfect alignment line.

Figure 9. DTW analysis and correlation analysis for priority to wind and raw observation of wind. Upper left: Original time series of priority and wind observation values over steps. Upper right: Normalized comparison of priority and observation time series, with the corresponding DTW distance. Bottom left: Spearman correlation analysis between priority and observation values, including the correlation coefficient and p-value. Bottom right: DTW alignment path illustrating the temporal correspondence between priority and observation sequences.

Figure 10
Four-panel chart analysis: Top left panel shows the original time series with priority and observation values. Top right panel presents normalized time series comparison with Dynamic Time Warping (DTW) distance of 1.9656. Bottom left panel displays Spearman correlation analysis with a rho of 0.175 and p-value of 0.2238. Bottom right panel illustrates the DTW alignment path with observation index and perfect alignment line.

Figure 10. DTW analysis and correlation analysis for attention weight and raw observation values. Upper left: Original time series of attention weight and observation values across steps. Upper right: Normalized comparison of attention weight and observation time series, with the corresponding DTW distance. Bottom left: Spearman correlation analysis between attention weight and observation values, including the correlation coefficient and p-value. Bottom right: DTW alignment path illustrating the temporal correspondence between attention weight and observation sequences.

Table 3 shows DTW and correlation analysis for normalized observation and percentage of priority for each category. Read from the DTW column first, since it captures episode-level similarity under small temporal offsets. Wind shows the strongest match, indicating that the priority co-vary with local wind fluctuations across contiguous segments. Kinematics and distance occupy an intermediate band, consistent with priority being revisited periodically rather than tracked frame by frame. Goal info and position display weaker shape agreement, and signal and penalty are weaker still, which suggests that these channels are influenced more by task phase or decision context than by waveform similarity to their raw observations. The correlation columns provide the trend direction at the same time index. Goal info, position, and signal exhibit stable negative associations. A natural reading is that as the route stabilizes and information becomes more certain, these channels are down-weighted, with brief increases around turns or corridor switches. Kinematics and penalty are positively associated, in line with higher allocation when motion regulation or local risk rises. Wind and distance yield little pointwise correlation on the raw series, a result that is not at odds with their DTW behavior since episodic responses and slow trends rarely synchronize at the exact time index. The table indicates two recurrent modes of priority allocation. Inputs governed by environmental dynamics tend to produce segment-level coupling that DTW detects even when correlation is weak. Inputs tied to progress and geometry tend to produce slower reallocations for which correlation carries the signal even when shapes do not align closely. Because wind and distance show non-significant correlations on the raw series, we redefine these observations as one-second deltas and repeat the same DTW and correlation analyses to test whether the priority is more sensitive to changes than to original values.

Table 3
www.frontiersin.org

Table 3. DTW and correlation analysis for normalized observation and percentage of priority for each category.

To test whether priority responds more to changes in the scene than to absolute levels, we redefined the observation for wind and distance as a one-second delta and ran the alignment tests again. These two categories (as shown in Figures 11, 12) were chosen because their pointwise correlations on the raw series were weak. The delta view suppresses slow drift and highlights local transitions, which is where the policy reallocates priority in many rollouts. With this transformation the shape match improves markedly. DTW drops by about 39% for wind and about 19% for distance, indicating that the priority traces and the delta series now share sharper onsets and offsets. Spearman correlation results move toward stronger association, and the corresponding p values decrease. The DTW alignment paths stay near the diagonal for long stretches and bend at the same transition points, which is the pattern expected when priority modulates around gusts, corridor entries, and brief approach phases rather than tracking raw levels frame by frame.

Figure 11
Four-panel data visualization comparing priority, delta observation, and their relationships. Panel A shows priority versus delta observation over training steps. Panel B highlights normalized comparison with a DTW distance of 1.576. Panel C presents a Spearman correlation analysis indicating a monotonic relationship with rho = 0.202. Panel D displays a DTW alignment path on a heatmap, illustrating cost variations by weight and delta indices.

Figure 11. DTW analysis and correlation analysis for priority to distance and distance change intensity. (A) Original time series comparison between priority and delta distance observation across training steps. (B) Normalized comparison of priority and delta distance observation time series, with the corresponding DTW distance highlighted. (C) Spearman correlation analysis between priority and delta distance observation, including the correlation coefficient, p-value, and monotonic trend. (D) DTW alignment path illustrating the temporal correspondence between priority and delta distance observation sequences.

Figure 12
Panel A displays a line graph of Priority versus Delta Observation over Training Steps, showing fluctuating patterns. Panel B highlights a normalized comparison with DTW Distance and a shaded difference area. Panel C presents a scatter plot for Spearman Correlation Analysis indicating a positive correlation. Panel D features a heatmap with a DTW Alignment Path, displaying cost variation. The graphs analyze relationships and patterns within data related to delta observations of wind components.

Figure 12. DTW analysis and correlation analysis for priority to wind and wind change intensity. (A) Original time series comparison between priority and delta wind observation (Wind_X, Wind_Y, Wind_Z aggregated) across training steps. (B) Normalized comparison of priority and delta wind observation time series, with the corresponding DTW distance highlighted. (C) Spearman correlation analysis between priority and delta wind observation, including the correlation coefficient, p-value, and monotonic trend. (D) DTW alignment path illustrating the temporal correspondence between priority and delta wind observation sequences.

Across ten inference trajectories, the learned policy exhibits two complementary priority allocation patterns. For change sensitive inputs such as wind, the priority and the observations align in bursts with small timing offsets. DTW captures this alignment even when pointwise correlation on raw levels is weak. After redefining wind as delta wind, the episodic match remains and correlations strengthen, which indicates that the gates react to gust onsets rather than to absolute wind level. For slow context such as goal progress and geometry, the priority allocation pattern follows broader monotonic shifts and is revisited mainly at route transitions. Distance fits this second pattern on raw levels and moves closer to the change sensitive regime once expressed as delta distance, where short approach or deceleration episodes carry more weight than the absolute value.

Placed alongside the earlier table, the categories separate cleanly without overstatement. Wind is the clearest episodic case. Goal info, position, and signal retain stable negative correlations with only moderate DTW, consistent with gradual reallocation as routes settle and brief upweighting near turns. Kinematics and penalty remain positively associated, reflecting higher gate values when motion regulation or risk rises, while their DTW is weaker than wind because adjustments unfold across longer segments. Taking together, these results, along with the stable task performance and the partially decoupled gates, provide the final empirical basis of the study. The Discussion will consider mechanisms, limitations, and how these patterns can inform interface design and future field validation in construction settings.

5 Discussion

This study examined whether grouped priorities could provide an interpretable window on what a learned policy prioritizes during autonomous flight near structures, and whether that window can be obtained without loss of task performance. The motivation comes from construction practice, where autonomous drone operation under partial observability and multiple objectives need signals that are predictable, auditable, and linked to safety-relevant events. Priority is treated here as an alignment pattern that can be measured against observations, not as a causal explanation of action. Performance results set the boundary condition. Adding the priority (group-gate) module increased training time, yet final reward and value loss converged to the levels of the baselines. During inference, repeated rollouts in the same environment reached the goal reliably and produced tightly clustered trajectories. We do not infer a specific navigation strategy from these paths. Their stability is sufficient to support the subsequent priority analysis, because unstable behavior would make any priority trace difficult to interpret. The structure of the gates clarifies how to read alignment metrics. The correlation matrix and principal component analysis reveal a common varying mode that moves several gates together when conditions change, alongside category-specific reallocations that adjust the relative mix. A portion of the variance therefore reflects global difficulty or phase, while the remainder reflects where the policy directs priority within a phase.

With that context, the allocation patterns are regular. During inference, category bands remain within narrow ranges and show brief increases at route transitions. Goal-related priority remains elevated, wind- and geometry-related cues increase around corridor changes, and kinematics and penalty increase when regulation or risk rises. The summary table aggregates ten trajectories by computing DTW between each priority series and its paired observation and by computing Spearman correlations at matched indices, followed by trajectory-level averaging. DTW captures episode-level similarity after allowing small temporal offsets. Spearman correlation analysis captures monotonic association that can be nonlinear. Agreement between DTW and correlation is not required because the measures target different properties of the series. Read together, they indicate that inputs driven by fast environmental dynamics tend to produce segment-level coupling that DTW detects even when correlation is weak, while signals related to task progress and geometry tend to produce slower reallocations that appear in correlation even when waveform shapes do not closely match. Wind and distance required a targeted follow-up because their correlations on the raw series were not statistically significant, these two categories change slowly or in bursts, which disadvantages pointwise tests.

Redefining these observations as disturbance suppresses drift and emphasizes local transitions. Under this definition, shape matching improves for both categories and correlations strengthen. The priority gates, therefore, appear more sensitive to increments in wind and to short approach or deceleration episodes than to absolute levels for these signals. Other categories retain the raw series view because their correlations are already consistent with the DTW evidence. The sensing perspective translates these findings into practical guidance. For change-sensitive channels such as wind, sensing and preprocessing should preserve fast onsets with low latency and adequate temporal resolution. In many platforms this can be achieved without new hardware by computing short-window differences or derivatives from existing estimates and ensuring that these derived channels are available to downstream logic. Priority-aligned scheduling is also natural. When a category repeatedly shows short priority bursts at transitions, sampling rate or computation budget for that sensing chain can be raised during those segments and reduced during steady flight. Alignment statistics further provide a simple diagnostic. If the DTW and correlation profiles drift in a sustained way for a given channel, a sensor fault or timing issue may be detectable earlier than task-level reward changes, although validating this requires dedicated fault-injection and field-style evaluations. A sensor fault or timing issue may be detectable earlier than task-level reward changes, although validating this requires dedicated fault-injection and field-style evaluations.

The study also contributes a method for turning internal signals into external evidence that can support knowledge engineering and risk management for drone operations in building environments. Grouped priorities are defined over human-meaningful observation categories, which anchors model internals to site concepts such as wind exposure, obstacle clearance, motion regulation, and goal progress. The priority and observation comparison is summarized with trajectory-level statistics that can be reproduced across different initial wind settings, which fits practical needs for auditability and record keeping. The delta analysis for wind and distance shows how change-oriented features can be surfaced when raw levels are not informative, a transformation that can be implemented in software and deployed on existing platforms. Together these elements outline a path for supervisors and engineers to monitor priority shifts, to align them with site events in logs, and to connect alignment statistics to safety reviews and procedure updates. The contribution is not a user interface or a human study. It is an interpretable measurement protocol and a set of design implications that reflect the complexity of building environments while remaining compatible with existing sensing and operations.

This study has several limitations. First, all experiments are conducted in a single simulated urban environment. As a result, the stability of the observed priority regimes under different layouts, alternative turbulence or wind models, and substantially different sensing configurations remains untested. While we include limited randomization through a small set of precomputed wind-field realizations and modest variation in clutter patterns and sensing latencies, these variations do not fully reflect real-world variability. Second, the study is simulation-based and does not include user interface development or human-subject evaluation. Accordingly, any implications for operator understanding, calibrated trust, or oversight effectiveness should be viewed as prospective rather than validated outcomes. Third, the reported gate–observation alignment results should be interpreted as descriptive indicators of association. The Spearman correlation and DTW analyses quantify temporal alignment but do not establish causal or counterfactual relationships between observations and priority weights. Finally, the delta transformation is used only as an analysis-time lens to highlight change cues in key signals and is not part of the policy input during training or inference, nor does it modify the learned policy.

Future work will expand evaluation beyond the current setting by introducing additional environments, geometry layouts, obstacle densities, and a broader library of wind-field realizations, to test the robustness of both performance and priority allocation under stronger distribution shifts. A natural next step is hardware validation using real drones and field sensing pipelines to assess whether the proposed priority tracing remains informative under measurement noise, latency, and actuation disturbances. In parallel, prototype supervisory interfaces can be developed to present grouped priorities alongside mission context, and controlled user studies can examine how operators interpret these signals, whether they support appropriate interventions, and how presentation choices affect workload and trust calibration. Additional sensing studies can systematically vary sampling rate, latency, and sensor placement to characterize how sensing design influences priority alignment and to derive practical configuration guidance. Training efficiency is another practical direction for future work. While the unconstrained group-gating variant requires more training steps to reach comparable asymptotic performance in our setting, this overhead may be reduced through simple training schedules that do not change the core method. Examples include warm-starting the gated policy from a converged ungated baseline, using a two-stage schedule that enables gating only after the base policy stabilizes, initializing encoders from the baseline and briefly training only the gating module before end-to-end finetuning, and adding mild gate regularization to avoid early saturation. We did not conduct a dedicated acceleration study in this revision, and we present these items as actionable directions rather than validated results. Finally, richer interaction settings, such as multi-agent operation or environments with moving equipment, can be used to stress-test the framework and to refine trajectory-level aggregation and uncertainty estimates, including evaluating whether alignment statistics could serve as candidate online monitoring signals under realistic deployment constraints.

6 Conclusion

This study examined whether grouped priorities could provide a readable view of what a learned policy prioritizes when an autonomous drone operates near structures, and whether that view can be obtained without loss of task performance. Within a high-fidelity simulation of complex building environments, adding a lightweight group-gate module into the network of an autonomous drone increased training time but preserved final reward and value loss. Inference rollouts were stable across repeated trials, which supports the validity of post hoc priority analysis. Three empirical findings ground the contribution. First, the gate structure shows a shared varying mode together with category-specific reallocations. Several gates move together at scene change while relative weights still shift across channels. Second, alignment metrics summarize how priority relates to observations across ten trajectories. DTW captures episode-level similarity under small timing offsets. Spearman correlation captures pointwise and monotonic tendencies at matched indices. Read together, the results indicate episodic coupling for environment dynamics and slower reallocations for signals linked to task progress and geometry. Third, replacing raw observations with disturbances improves shape matching and strengthens correlations, which suggests that priority is sensitive to increments in gusts and to short approach or deceleration episodes rather than to absolute levels.

These signals also matter in complex construction and urban building environments because they connect internal computation to site concepts that practitioners already track, such as wind exposure, obstacle clearance, motion regulation, and goal pressure. Grouped priorities over these categories can be logged alongside trajectories and summarized at the trajectory level for audit and review. The same statistics can guide sensing and processing choices. Change-sensitive channels should preserve rapid onsets with low latency, for example, by increasing sampling rate, refresh rate, or effective bandwidth when their priority rises and using coarser sensing during quiescent periods. The learned priority profiles can inform sensing strategy, indicating which signals may merit higher temporal resolution or communication capacity during critical transitions. In this way, alignment acts not only as an interpretable monitoring signal but also as a design handle for adaptive sensing and resource allocation in the field.

The work is simulation-based, it does not claim a user interface or a human study. Priority weights are alignment indicators rather than causal proofs. Correlation and DTW do not replace counterfactual tests. Within these boundaries, the contribution is a reproducible protocol for exposing and validating grouped priorities in a multi-objective, partially observable setting, together with evidence that the resulting signals are stable, intelligible, and compatible with performance requirements. This provides a path from interpretability to deployment decisions in construction robotics and supports field trials that connect alignment statistics to safety and productivity outcomes.

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Author contributions

BS: Formal Analysis, Writing – original draft, Software, Methodology, Investigation. HY: Methodology, Software, Validation, Writing – review and editing. JW: Writing – review and editing, Investigation, Validation, Visualization. JD: Writing – review and editing, Resources, Project administration, Supervision, Conceptualization, Methodology.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This material is supported by the Air Force Office of Scientific Research (AFOSR) under grant FA9550-22-1-0492. Any opinions, findings, conclusions, or recommendations expressed in this article are those of the authors and do not reflect the views of the AFOSR.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was used in the creation of this manuscript. I only use AI to polish my language, which is grammar and typos check. The research was originally designed and conducted by us.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Abichandani, P., Lobo, D., Ford, G., Bucci, D., and Kam, M. (2020). Wind measurement and simulation techniques in multi-rotor small unmanned aerial vehicles. IEEE Access 8, 54910–54927. doi:10.1109/access.2020.2977693

CrossRef Full Text | Google Scholar

Achiam, J., Held, D., Tamar, A., and Abbeel, P. (2017). Constrained policy optimization. International conference on machine learning.

Google Scholar

Adadi, A., and Berrada, M. (2018). Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access 6, 52138–52160. doi:10.1109/access.2018.2870052

CrossRef Full Text | Google Scholar

Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., and Kim, B. (2018). Sanity checks for saliency maps. Adv. Neural Information Processing Systems 31.

Google Scholar

Agrawal, A., and Cleland-Huang, J. (2021). “Explaining autonomous decisions in swarms of human-on-the-loop small unmanned aerial systems,” in Proceedings of the AAAI Conference on Human Computation and Crowdsourcing.

Google Scholar

Albeaino, G., Gheisari, M., and Franz, B. W. (2019). A systematic review of unmanned aerial vehicle application areas and technologies in the AEC domain. J. Information Technology Construction 24.

Google Scholar

Avacharmal, R. (2024). Explainable AI: bridging the gap between machine learning models and human understanding. J. Inf. Educ. Res. 4 (2). doi:10.52783/jier.v4i2.960

CrossRef Full Text | Google Scholar

Banerjee, P., and Bradner, K. (2024). Energy-optimized path planning for uas in varying winds Via reinforcement learning. AIAA Aviat. Forum And Ascend 2024.

Google Scholar

Chen, X., Wang, Z., Fan, Y., Jin, B., Mardziel, P., Joe-Wong, C., et al. (2020). Reconstructing actions to explain deep reinforcement learning.

Google Scholar

Chen, S., Chen, S., Mo, Y., Wu, X., Xiao, J., and Liu, Q. (2024). Reinforcement learning-based energy-saving path planning for UAVs in turbulent wind. Electronics 13 (16), 13. doi:10.3390/electronics13163190

CrossRef Full Text | Google Scholar

Choi, H.-W., Kim, H.-J., Choi, H.-W., Kim, H.-J., Kim, S.-K., and Na, W. S. (2023). An overview of drone applications in the construction industry. Drones 7 (8), 515. doi:10.3390/drones7080515

CrossRef Full Text | Google Scholar

Choi, W., Na, S., and Heo, S. (2024). Integrating drone imagery and AI for improved construction site management through building information modeling. Buildings 14 (4), 1106. doi:10.3390/buildings14041106

CrossRef Full Text | Google Scholar

Chow, Y., Nachum, O., Duenez-Guzman, E., and Ghavamzadeh, M. (2018). A lyapunov-based approach to safe reinforcement learning. Adv. Neural Information Processing Systems 31.

Google Scholar

Chow, Y., Nachum, O., Faust, A., Duenez-Guzman, E., and Ghavamzadeh, M. (2019). Lyapunov-based safe policy optimization for continuous control. doi:10.48550/arXiv.1901.10031

CrossRef Full Text | Google Scholar

Das, R., and Dash, D. (2023a). Collaborative data gathering and recharging using multiple mobile vehicles in wireless rechargeable sensor network. Int. J. Commun. Syst. 36 (15), e5573. doi:10.1002/dac.5573

CrossRef Full Text | Google Scholar

Das, R., and Dash, D. (2023b). Joint on-demand data gathering and recharging by multiple mobile vehicles in delay sensitive WRSN using variable length GA. Comput. Commun. 204, 130–146. doi:10.1016/j.comcom.2023.03.022

CrossRef Full Text | Google Scholar

Das, R., Dash, D., Yadav, C. B. K., Das, R., Dash, D., and Yadav, C. B. K. (2022). An efficient charging scheme using battery constrained mobile charger in wireless rechargeable sensor networks. Telecommun. Syst. 2022 81 (3), 81–415. doi:10.1007/s11235-022-00951-w

CrossRef Full Text | Google Scholar

Dash, D., Das, R., and Yadav, C. B. K. (2025). Enhancing wireless sensor durability via on-demand Mobile charging and energy estimation. Concurrency Comput. Pract. Exp. 37 (18-20), e70205. doi:10.1002/cpe.70205

CrossRef Full Text | Google Scholar

Fan, J., Wang, Z., Xie, Y., and Yang, Z. (2020). A theoretical analysis of deep Q-learning. Learning for dynamics and control,

Google Scholar

Fei, W., Xiaoping, Z., Zhou, Z., and Yang, T. (2024). Deep-reinforcement-learning-based UAV autonomous navigation and collision avoidance in unknown environments. Chin. J. Aeronautics 37 (3), 237–257.

Google Scholar

Garcıa, J., and Fernández, F. (2015). A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16 (1), 1437–1480.

Google Scholar

Greydanus, S., Koul, A., Dodge, J., and Fern, A. (2018). “Visualizing and understanding atari agents,” in International conference on machine learning.

Google Scholar

Guo, W., Wu, X., Khan, U., and Xing, X. (2021). Edge: explaining deep reinforcement learning policies. Adv. Neural Information Processing Systems 34, 12222–12236.

Google Scholar

Gupta, S., and Nair, S. (2023). A review of the emerging role of UAVs in construction site safety monitoring. Mater. Today Proc. doi:10.1016/j.matpr.2023.03.135

CrossRef Full Text | Google Scholar

Hancock, P. A., Billings, D. R., Schaefer, K. E., Chen, J. Y., De Visser, E. J., and Parasuraman, R. (2011). A meta-analysis of factors affecting trust in human-robot interaction. Hum. Factors 53 (5), 517–527. doi:10.1177/0018720811417254

PubMed Abstract | CrossRef Full Text | Google Scholar

Hausknecht, M., and Stone, P. (2015). Deep recurrent Q-Learning for partially observable MDPs. doi:10.48550/arXiv.1507.06527

CrossRef Full Text | Google Scholar

He, L., Nabil, A., and Song, B. (2020). “Explainable deep reinforcement learning for uav autonomous navigation,”. arXiv preprint arXiv:2009.14551. Ithaca, NY: Cornell University.

Google Scholar

Hochreiter, S., and Schmidhuber, J. (1997). Long short-term memory. Neural Computation 9 (8), 1735–1780. doi:10.1162/neco.1997.9.8.1735

PubMed Abstract | CrossRef Full Text | Google Scholar

Huber, T., Limmer, B., and André, E. (2022). Benchmarking perturbation-based saliency maps for explaining atari agents. Front. Artif. Intell. 5, 903875. doi:10.3389/frai.2022.903875

PubMed Abstract | CrossRef Full Text | Google Scholar

Igl, M., Zintgraf, L., Le, T. A., Wood, F., and Whiteson, S. (2018). “Deep variational reinforcement learning for POMDPs,” in International conference on machine learning.

Google Scholar

Jiang, S., Jiang, W., Huang, W., and Yang, L. (2017). UAV-based oblique photogrammetry for outdoor data acquisition and offsite visual inspection of transmission line. Remote Sens. 9 (3), 278. doi:10.3390/rs9030278

CrossRef Full Text | Google Scholar

Juliani, A., Berges, V.-P., Teng, E., Cohen, A., Harper, J., Elion, C., et al. (2018). Unity: a general platform for intelligent agents. doi:10.48550/arXiv.1809.02627

CrossRef Full Text | Google Scholar

Kim, H.-I., Park, K.-D., Kim, H.-I., and Park, K.-D. (2025). Satellite positioning accuracy improvement in urban canyons through a new weight model utilizing GPS signal strength variability. Sensors 2025 (15), 25. doi:10.3390/s25154678

CrossRef Full Text | Google Scholar

Lauri, M., Hsu, D., and Pajarinen, J. (2022). Partially observable markov decision processes in robotics: a survey. IEEE Trans. Robotics 39 (1), 21–40. doi:10.1109/tro.2022.3200138

CrossRef Full Text | Google Scholar

Li, X., Serlin, Z., Yang, G., and Belta, C. (2019). A formal methods approach to interpretable reinforcement learning for robotic planning. Sci. Robotics 4 (37), eaay6276. doi:10.1126/scirobotics.aay6276

PubMed Abstract | CrossRef Full Text | Google Scholar

Liu, P., Sun, B., Wang, Y., and Tang, P. (2024). Bridge inspection strategy analysis through human-drone interaction games. Comput. Civ. Eng. 2023, 597–605. doi:10.1061/9780784485224.072

CrossRef Full Text | Google Scholar

Ma, Z., Zhuang, Y., Weng, P., Li, D., Shao, K., Liu, W., et al. (2020). Interpretable reinforcement learning with neural symbolic logic.

Google Scholar

Milani, S., Topin, N., Veloso, M., and Fang, F. (2024). Explainable reinforcement learning: a survey and comparative review. ACM Comput. Surv. 56 (7), 1–36. doi:10.1145/3616864

CrossRef Full Text | Google Scholar

Nguyen, T. T., Nguyen, N. D., Vamplew, P., Nahavandi, S., Dazeley, R., and Lim, C. P. (2020). A multi-objective deep reinforcement learning framework. Eng. Appl. Artif. Intell. 96, 103915. doi:10.1016/j.engappai.2020.103915

CrossRef Full Text | Google Scholar

Paradis, N., and Chapdelaine, B. (2025). Effects of urban canyons and electromagnetic interference on RPAS performance.

Google Scholar

Parisotto, E., Song, F., Rae, J., Pascanu, R., Gulcehre, C., Jayakumar, S., et al. (2020). Stabilizing transformers for reinforcement learning international conference on machine learning,

Google Scholar

Potteiger, N., and Koutsoukos, X. (2023). Safe explainable agents for autonomous navigation using evolving behavior trees. 2023 IEEE international conference on Assured Autonomy (ICAA).

Google Scholar

Puiutta, E., and Veith, E. M. (2020a). EExplainable reinforcement learning: a survey. International cross-domain conference for machine learning and knowledge extraction,

Google Scholar

Puiutta, E., and Veith, E. M. S. P. (2020b). Explainable reinforcement learning: a survey. Lecture notes in computer science. 77, 95. doi:10.1007/978-3-030-57321-8_5

CrossRef Full Text | Google Scholar

Puri, N., Verma, S., Gupta, P., Kayastha, D., Deshmukh, S., Krishnamurthy, B., et al. (2019). Explain your move: understanding agent actions using specific and relevant feature attribution. arXiv Preprint arXiv:1912.12191.

Google Scholar

Roijers, D. M., Vamplew, P., Whiteson, S., and Dazeley, R. (2013). A survey of multi-objective sequential decision-making. J. Artif. Intell. Res. 48, 67–113. doi:10.1613/jair.3987

CrossRef Full Text | Google Scholar

Rosero, J. C., and Dusparic, I. (2025). Explainable multi-objective Reinforcement Learning: challenges and considerations.

Google Scholar

Schött, S. Y., Amin, R. M., Butz, A., Schött, S. Y., Amin, R. M., and Butz, A. (2023). A Literature Survey of how to convey transparency in Co-Located human–robot interaction. Multimodal Technol. Interact. 2023 7 (3), 7. doi:10.3390/mti7030025

CrossRef Full Text | Google Scholar

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

Google Scholar

Senin, P. (2008). Dynamic time warping algorithm review. Inf. Comput. Sci. Dep. Univ. Hawaii A. T. Manoa Honolulu. 855. USA, 40.

Google Scholar

Son, M., Lee, J.-I., Kim, J.-J., Park, S.-J., Kim, D., Kim, D.-Y., et al. (2022). Evaluation of the wind environment around multiple urban canyons using numerical modeling. Atmosphere 13 (5), 13. doi:10.3390/atmos13050834

CrossRef Full Text | Google Scholar

Spaan, M. T. (2012). “Partially observable markov decision processes,” in Reinforcement learning: state-of-the-art (Springer), 387–414.

Google Scholar

Van Moffaert, K., and Nowé, A. (2014). Multi-objective reinforcement learning using sets of pareto dominating policies. J. Mach. Learn. Res. 15 (1), 3483–3512.

Google Scholar

Verma, A., Murali, V., Singh, R., Kohli, P., and Chaudhuri, S. (2018). Programmatically interpretable reinforcement learning. International conference on machine learning.

Google Scholar

Wu, J., Ye, Y., and Du, J. (2024a). Multi-objective reinforcement learning for autonomous drone navigation in urban area. Constr. Res. Congr. 2024, 707–716. doi:10.1061/9780784485262.072

CrossRef Full Text | Google Scholar

Wu, J., Ye, Y., and Du, J. (2024b). Multi-objective reinforcement learning for autonomous drone navigation in urban areas with wind zones. Automation Constr. 158, 105253. doi:10.1016/j.autcon.2023.105253

CrossRef Full Text | Google Scholar

Wu, J., Sun, B., You, H., and Du, J. (2025a). Enhancing Human-AI perceptual alignment through visual-haptic feedback system for autonomous drones. Int. J. Industrial Ergonomics 109, 103780. doi:10.1016/j.ergon.2025.103780

CrossRef Full Text | Google Scholar

Wu, J., You, H., Sun, B., and Du, J. (2025b). LLM-driven pareto-optimal multi-mode reinforcement learning for adaptive UAV navigation in urban wind environments. IEEE Access. doi:10.1109/ACCESS.2025.3611336

CrossRef Full Text | Google Scholar

Xia, Q., and Herrmann, J. M. (2025). Interpretability by design for efficient multi-objective reinforcement learning. arXiv preprint arXiv:2506.04022.

Google Scholar

Xing, J., Nagata, T., Zou, X., Neftci, E., and Krichmar, J. L. (2023). Achieving efficient interpretability of reinforcement learning via policy distillation and selective input gradient regularization. Neural Netw. 161, 228–241. doi:10.1016/j.neunet.2023.01.025

PubMed Abstract | CrossRef Full Text | Google Scholar

Yoon, J., Arik, S., and Pfister, T. (2020). Data valuation using reinforcement learning. International Conference on Machine Learning.

Google Scholar

Zajic, D., Fernando, H. J. S., Calhoun, R., Princevac, M., Brown, M. J., Pardyjak, E. R., et al. (2011). Flow and turbulence in an urban canyon. J. Appl. Meteorology Climatol. 50 (1), 203–223. doi:10.1175/2010JAMC2525.1

CrossRef Full Text | Google Scholar

Zhang, G., and Hsu, L. T. (2021). Performance assessment of GNSS diffraction models in urban areas. NAVIGATION 68 (2), 369–389. doi:10.1002/navi.417

CrossRef Full Text | Google Scholar

Zhao, J., Liu, H., Sun, J., Wu, K., Cai, Z., Ma, Y., et al. (2022). Deep reinforcement learning-based end-to-end control for UAV dynamic target tracking. Biomimetics 7 (4), 7. doi:10.3390/biomimetics7040197

PubMed Abstract | CrossRef Full Text | Google Scholar

Zheng, S., Zeng, K., Li, Z., Wang, Q., Xie, K., Liu, M., et al. (2024). Improving the prediction of GNSS satellite visibility in urban canyons based on a graph transformer. NAVIGATION J. Inst. Navigation 71 (4), navi.676. doi:10.33012/navi.676

CrossRef Full Text | Google Scholar

Zhou, S., and Gheisari, M. (2018). Unmanned aerial system applications in construction: a systematic review. Constr. Innov. 18 (4), 453–468. doi:10.1108/ci-02-2018-0010

CrossRef Full Text | Google Scholar

Keywords: autonomous drones, explainable artificial intelligence, human-AI alignment, multi-objective reinforcement learning (MORL), urban operations

Citation: Sun B, You H, Wu J and Du J (2026) Interpretable group gated priority traces for scalarized multi-objective reinforcement learning in autonomous drone operations. Front. Built Environ. 12:1747709. doi: 10.3389/fbuil.2026.1747709

Received: 20 November 2025; Accepted: 09 January 2026;
Published: 04 February 2026.

Edited by:

Pengkun Liu, Hong Kong Center for Construction Robotics, China

Reviewed by:

Hassan Sarmadi, Ferdowsi University of Mashhad, Iran
Junren Luo, National University of Defense Technology, China
Rupayan Das, Institute of Engineering and Management (IEM), India

Copyright © 2026 Sun, You, Wu and Du. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Jing Du, ZXJpYy5kdUBlc3NpZS51ZmwuZWR1

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.