Abstract
Accurate detection of photovoltaic (PV) module defects remains challenging due to environmental variability and the limited fault visibility of single-modality imaging. While RGB and electroluminescence (EL) images provide structural and subsurface information, they fail to capture thermal fault characteristics associated with hotspots, cell mismatch, and localized heating. Integrating infrared (IR) imagery offers complementary thermal cues that are critical for comprehensive PV inspection. This paper proposes a multimodal PV defect segmentation framework based on a modified Mask R-CNN architecture that fuses RGB, IR, and EL modalities at the feature level. A dedicated alignment pipeline combining homography transformation and enhanced correlation coefficient refinement ensures geometric consistency across modalities. A Fusion Attention Block adaptively weights modality-specific features, enabling effective cross-modal representation learning. Model hyperparameters and fusion weights are automatically optimized using the HawkFish Optimization Algorithm to improve convergence stability and segmentation robustness. Experiments conducted on statistically paired RGB–EL–IR datasets demonstrate that incorporating IR imagery significantly improves the detection of thermally driven defects and reduces false negatives in low-contrast and ambiguous regions. The proposed framework consistently outperforms unimodal and bimodal baselines, achieving state-of-the-art segmentation accuracy and enhanced defect localization, particularly for heat-related fault patterns. The results confirm that thermal information provides critical diagnostic value that cannot be recovered from RGB or EL data alone. The adaptive fusion strategy and optimization-driven tuning further enhance generalization under real-world conditions. These findings highlight the importance of IR-integrated multimodal learning for reliable and scalable PV module inspection systems.
1 Introduction
The rapid expansion of photovoltaic (PV) installations has intensified the need for reliable and automated defect detection to maintain energy yield and system longevity (Libra et al., 2023). Conventional inspection techniques (manual assessments or single-modality imaging) often fail to capture subtle surface or sub-surface faults, especially under variable environmental conditions. Recent progress in deep learning, particularly instance segmentation models such as Mask R-CNN, has improved defect localization on PV modules, while nature-inspired metaheuristic algorithms offer effective strategies for tuning complex models (Rabaia et al., 2021). These advances highlight the growing interest in combining multimodal sensing with intelligent optimization to improve diagnostic robustness (He et al., 2022). Figure 1 below shows the delamination of the PV panel and channels for moisture penetration from the back side.
FIGURE 1
However, current approaches remain limited in practice. RGB imagery is sensitive to shadows, reflections, and soiling, whereas infrared and electroluminescence (EL) provide complementary thermal and subsurface information but are rarely exploited jointly (Mathew et al., 2023). In addition, segmentation accuracy is highly dependent on hyperparameter choices, which are often selected heuristically (Sodhi et al., 2022). To address these constraints, this work introduces a multimodal framework that integrates RGB, infrared, and EL imagery within a Mask R-CNN backbone, with hyperparameters and fusion behavior automatically tuned using the HawkFish Optimization Algorithm (HFOA). By co-registering heterogeneous modalities and adaptively optimizing the segmentation model, the proposed method aims to deliver more reliable detection of surface and latent defects in diverse operating conditions. EL imaging remains central to this effort, offering fine-grained characterization of micro-cracks and cell-level discontinuities that are undetectable in standard RGB photographs. Zhang et al. (2024) introduced a lightweight neural architecture, optimized via neural architecture search and knowledge distillation, achieving high detection accuracy on EL images with significantly reduced computational overhead. Building on the publicly available ELPV dataset, Demir and Ataberk (Demir and Necati, 2024) developed a customized 2D CNN that robustly distinguishes intrinsic from extrinsic defects, demonstrating the value of tailored network designs for this modality. Hassan and Dhimish’s dual spin max-pooling CNN (Hassan and Dhimish, 2023) further improved crack detection by combining conventional convolutional layers with a novel pooling strategy that preserves edge details. Chen X. et al. (2023) proposed an automatic crack segmentation framework that fuses deep feature extraction with morphological post-processing to delineate defect boundaries accurately, while Akram et al. (2021) showed that a straightforward CNN, when properly trained on EL imagery, can outperform classical feature‐based methods by learning hierarchical defect representations. More recently, El Yanboiy (2024) adapted the YOLOv5 object detector to EL data, enabling real‐time, end‐to‐end localization and classification of multiple defect types within high‐resolution electroluminescence images. Thermal infrared imaging provides complementary insights by highlighting hot spots caused by cell mismatches or faults under load. Tao et al. (2025) explored a complementary direction by integrating modulated photocurrent signals with machine learning techniques to both diagnose and localize array faults, highlighting the importance of electrical domain information in improving diagnostic reliability. Animashaun and Hussain (2023) advanced image based micro crack detection by introducing a regularized convolutional network supported by ground modeling, which improved robustness against noise and structural variability in manufacturing environments.
Other studies relied on more classical learning approaches. Singh et al. (2023) proposed a segmentation method for micro crack detection using support vector machines, showing that carefully selected features can still offer meaningful performance in well controlled settings. Hybrid deep learning models have also been investigated. Liu et al. (2025) introduced ResGRU, a combination of residual networks and gated recurrent units, to diagnose compound faults under dust affected conditions, emphasizing the growing need for architectures that handle temporal and environmental dependencies. Unsupervised and clustering based approaches have been explored as well and summarized in Table 1 below.
TABLE 1
| Reference | Modality | Model/Approach | Dataset |
|---|---|---|---|
| Zhang et al. (2024) | EL imaging | NAS-designed lightweight CNN + knowledge distillation | Self-curated EL cell images |
| Demir and Necati (2024) | EL imaging | Customized 2D CNN tailored to micro-crack features | ELPV (2 624 images) |
| Hassan and Dhimish (2023) | EL imaging | Dual-spin max-pooling CNN preserving edge details | ELPV |
| Chen X. et al. (2023) | EL imaging | Mask-RCNN–style segmentation + morphological post-processing | Proprietary EL cell set |
| Akram et al. (2021) | EL imaging | Standard CNN classification | ELPV |
| El Yanboiy (2024) | EL imaging | YOLOv5 object detection for multi-defect localization | Custom EL images (high-res) |
| Tao et al. (2025) | Electrical photocurrent signals | Modulated photocurrent + ML models | Fault diagnosis and localization |
| Animashaun and Hussain (2023) | EL images in manufacturing | Regularized convolutional network | Micro crack Dataset |
| Singh et al. (2023) | Solar cell images | SVM based segmentation | Micro crack dataset |
| Liu et al. (2025) | Dust affected PV array data | ResGRU hybrid deep model | Compound fault diagnosis dataset |
Summary of related works.
Despite significant progress in photovoltaic defect detection, several key gaps persist in the current literature. First, the vast majority of studies remain confined to single‐modality inputs—whether electroluminescence, infrared thermography, or visible‐light imaging—resulting in limited sensitivity to faults that only manifest in complementary spectral bands (Chen L. et al., 2023; Jia et al., 2024; Patel et al., 2020). Second, most works have focused on classification frameworks that provide image‐level or patch‐level defect labels, rather than leveraging instance segmentation to deliver precise, pixel‐level localization of cracks, soiling, or hot spots. Third, hyperparameter tuning and modality‐fusion strategies are typically performed manually or via grid search, leading to suboptimal performance and high computational overhead. Finally, although some approaches demonstrate strong accuracy in controlled or small‐scale datasets, few have been validated on large, real‐world installations under variable environmental conditions, nor have they been packaged into open, reproducible toolkits for field deployment.
Our proposed methodology directly addresses each of these shortcomings. By co‐registering RGB, infrared, and electroluminescence imagery, we establish a true multimodal foundation that captures both surface and subsurface anomalies. Integrating this fused data into a Mask R-CNN backbone enables instance‐level segmentation, ensuring that each defect is precisely delineated rather than merely flagged.
The novelty of this work lies in the integrated design of a multimodal PV defect-segmentation framework that unifies RGB, infrared (IR), and electroluminescence (EL) imagery through a dedicated alignment pipeline, a feature-level fusion mechanism, and metaheuristic-driven optimization. Unlike prior studies that rely on single-modality data or simple concatenation, the proposed method incorporates geometric co-registration with ECC refinement, enabling meaningful cross-modality correspondence even when images originate from different sensors. A tailored Fusion Attention Block then learns adaptive modality weights, allowing the network to emphasize structural, thermal, or subsurface cues depending on the defect type. Furthermore, the HawkFish Optimization Algorithm (HFOA) automatically tunes critical hyperparameters and fusion behavior, improving robustness under diverse imaging conditions. Together, these components form a cohesive system that addresses limitations of existing PV inspection approaches by enhancing defect visibility, reducing false detections, and leveraging complementary sensory information more effectively than any modality alone.
The remainder of this paper is organized as follows. In Section 2, we detail our proposed multimodal fusion framework, describing the co-registration of RGB, infrared, and electroluminescence images, the integration into a Mask R-CNN backbone, and the HawkFish Optimization Algorithm for automated hyperparameter and fusion-weight tuning. In Section 3, we outline the experimental setup and evaluation metrics used to benchmark segmentation accuracy, localization precision, and computational efficiency on both laboratory and field-collected data. Section 4 reports the results of these experiments, provides a comprehensive discussion comparing our method against single-modality baselines, and analyzes robustness under varying environmental conditions. Finally, Section 5 concludes the paper by summarizing our key findings, highlighting practical implications for PV maintenance, and suggesting directions for future research.
2 Methodology
The proposed method in this section begins by synchronously acquiring RGB, infrared, and electroluminescence images of each PV module under controlled irradiance, then spatially co-registering these modalities to ensure pixel-level alignment. Next, we extract deep feature maps from each modality using parallel convolutional streams and merge them via a learnable fusion layer before feeding the combined representation into a Mask R-CNN backbone tailored for multimodal input as further illustrated in Figure 2.
FIGURE 2
To automate and refine both the fusion weights and the network’s hyperparameters, we introduce the HawkFish Optimization Algorithm, which iteratively evaluates segmentation performance and adjusts parameters according to a foraging-inspired search strategy. Once the optimized model is trained, inference proceeds by generating instance-level masks for cracks, hot spots, and soiling regions, followed by thresholding and morphological filtering to produce clean defect maps. Finally, detected anomalies are cross-verified against concurrent I–V curve deviations for severity ranking, enabling prioritized maintenance scheduling.
2.1 Datasets preprocessing and fusion
The proposed method employs two complementary public datasets to construct our multimodal defect-detection corpus. The first is the Electroluminescence Photovoltaic (ELPV) dataset (Lu et al., 2023), which comprises 2624 grayscale EL images of individual solar cells at pixels. Each image has been expertly annotated with pixel-level binary masks for intrinsic micro-cracks, finger-interruptions, and black-core defects. To prepare these images, we normalize pixel intensities to , apply contrast-limited adaptive histogram equalization to enhance crack visibility, and retain the original aspect ratio during resizing. The second dataset is the Photovoltaic Panel Defect Dataset from NeurobotData on Kaggle (Neurobotdata, 2021), containing 1 200 RGB images of full PV modules at pixels. Images are labeled for three defect categoriescracks, soiling, and snail trails-with bounding-box annotations as shown in Figure 3.
FIGURE 3
In the absence of publicly available datasets that provide physically paired RGB–IR–EL imagery of the same photovoltaic module, this study adopts a statistically aligned multimodal pairing strategy rather than a physically matched one. The ELPV dataset supplies cell-level EL images (captured under controlled electroluminescence conditions), whereas the NeurobotData RGB dataset contains full-module photographs acquired in natural lighting. Because these sources differ in scale, acquisition modality, and imaging conditions, a direct one-to-one physical correspondence between EL and RGB samples is not possible. Instead, we constructed synthetic multimodal pairs by matching EL and RGB images based on defect category, visual morphology, and module characteristics (e.g., crack orientation, defect severity, background texture). The purpose of this pairing is not to recreate the exact physical module across modalities but to simulate complementary spectral cues—EL providing subsurface micro-crack visibility, RGB revealing surface degradation, and IR highlighting thermal anomalies—and to evaluate how a fusion-based segmentation pipeline behaves when such complementary information is jointly available. After statistical pairing, images were geometrically normalized and co-registered using homography to emulate spatial alignment, acknowledging that this alignment is conceptual rather than physically exact. We explicitly recognize that this synthetic pairing introduces limitations, as the true pixel-level correlation across modalities cannot be guaranteed. Nevertheless, this approach is consistent with prior multimodal feasibility studies (Waqar Akram and Bai, 2025) (Mohamed et al., 2025) and enables controlled exploration of whether multimodal cues, when fused and optimized through HFOA, can enhance segmentation accuracy.
For our instance-segmentation task, we manually converted these boxes into pixel-level masks and then standardized all images by subtracting the ImageNet mean and dividing by the standard deviation per channel. We also applied data augmentation (random horizontal flips, rotations up to , and brightness jitter) to improve robustness under variable lighting. To fuse these two modalities, we first paired 1000 EL-RGB image pairs by matching module identifiers and defect types. We co-registered each pair via homography estimation using SIFT keypoints on module frames, then resampled to a common grid. Finally, we fused the normalized EL and RGB channels into a four-channel tensor . This unified input is passed through a learnable fusion layer in our Mask R-CNN backbone, allowing the network to weight each modality adaptively.
In Figure 4, ImageNet heatmap refers to a Grad-CAM–style feature activation map generated using a ResNet-50 model pretrained on ImageNet. This visualization is not used for training or fusion in our pipeline; rather, it serves as a qualitative tool to illustrate how a generic, pretrained convolutional backbone responds to defect-related textures in EL images. ImageNet pretrained networks are widely adopted as universal feature extractors because their early and mid-level filters capture edges, contours, and structural patterns that transfer well across domains. By projecting these activation intensities as a heatmap, we highlight regions where the backbone naturally focuses (such as cracks, hotspots, and cell-level irregularities) demonstrating the motivation behind using feature-level fusion in our Mask R-CNN architecture. Thus, the term “manual fusion” refers only to overlaying backbone feature activations on the EL image for interpretability, not to any step in the actual multimodal fusion algorithm.
FIGURE 4
The fusion pipeline begins by establishing correspondence between the electroluminescence and RGB datasets. Individual EL images are first matched to their RGB counterparts based on module identifiers and defect labels, yielding paired samples that capture both surface and sub-surface anomalies. Feature-point correspondences are then extracted using scale-invariant keypoint detectors (e.g., SIFT) on module frames, and a robust homography is estimated via RANSAC to align each EL image to the spatial geometry of the higher-resolution RGB image. All paired images are resampled to a common pixel grid, ensuring that defect regions coincide across modalities.
Once co-registration is complete, each aligned pair is normalized channel-wise: EL intensities are scaled to [0, 1] after contrast-limited adaptive histogram equalization, while RGB channels undergo mean subtraction and standard-deviation normalization using ImageNet statistics. The normalized EL map is then concatenated with the three RGB channels to form a four-channel tensor . During training, a learnable fusion layer applies modality-specific weights (with ) at each spatial location, producing a fused feature map as illustrated in Equation 1:where denotes the feature map extracted from channel . This adaptive weighting allows the segmentation backbone to emphasize the most informative modality cues-surface texture from RGB or crack patterns from EL-dynamically across different regions of the panel.
The proposed Multimodal Fusion–Mask R-CNN–HFOA framework begins by capturing synchronized RGB, infrared, and electroluminescence images of each photovoltaic panel, then geometrically aligns the three modalities through a SIFT-RANSAC homography so that every pixel across channels corresponds to the same physical location. After contrast-limited adaptive histogram equalization and mean–variance normalization, the aligned channels are concatenated into a single four-channel tensor that preserves both surface-level visual cues and sub-surface thermal or electrical anomalies. This tensor is fed into a Mask R-CNN whose anchor sizes, region-of-interest pooling dimensions, fusion weights, and learning hyper-parameters collectively form a design vector θ. Instead of relying on manual tuning, the Enhanced HawkFish Optimization Algorithm (HFOA) maintains a population of candidate θ vectors that iteratively train lightweight Mask R-CNN surrogates, evaluate their segmentation loss and mean average precision, and move each candidate toward its own historical best and the global best via Lévy-flight exploration tempered by an energy-aware attraction rule; the cycle repeats until convergence, yielding an optimally configured θ.
Algorithm 1

A final Mask R-CNN is then trained with θ⋆ on the full dataset and deployed for inference: each new multimodal sample is aligned, fused, and passed through the network to produce pixel-accurate masks for cracks, hotspots, and soiling, which are subsequently cleaned with morphological filters. If current–voltage (I–V) curves are available, the spatial extent of each mask is correlated with electrical deviation to generate a severity ranking that prioritizes maintenance actions as shown in Figure 5 below.
FIGURE 5
2.1.1 Infrared (IR) dataset description
To incorporate thermal information into the multimodal fusion pipeline, an additional set of infrared (IR) images was collected independently under controlled laboratory conditions. A total of 480 IR images of photovoltaic modules were acquired using a long-wave infrared (LWIR) thermal camera (FLIR-class device, 7.5–13 μm spectral range) with a native resolution of 320 × 256 pixels. The modules were operated under variable electrical loading (0.6–1.0 Isc) to stimulate thermal contrast and highlight fault-related hotspots such as cell mismatch, cracked fingers, and bypass-diode anomalies. All images were captured indoors at an ambient temperature of 23 °C–25 °C, which ensured repeatable thermal behavior and minimized environmental drift.
Because no publicly available dataset provides physically paired RGB–IR–EL image triplets of the same panel, the IR samples used in this study do not correspond to the same physical modules as those in the RGB and EL datasets. Instead, the IR dataset serves as an independent thermal modality representing realistic hotspot patterns commonly observed in defective PV modules. To enable multimodal integration, IR images were geometrically normalized to 256 × 256 pixels and co-registered to the RGB reference frame using the same homography + ECC alignment procedure used for EL–RGB pairs. This produces a conceptual spatial correspondence that allows Mask R-CNN to learn cross-modal feature relationships, while acknowledging that the alignment is synthetic rather than physically exact.
This IR dataset is therefore intended to evaluate the feasibility of incorporating a thermal modality into a fusion-based defect segmentation framework, rather than to reconstruct the exact same module across modalities. The acquisition of physically paired RGB–IR–EL data from a single PV installation remains an open direction for future work and is required for full physical validation of the fusion strategy.
2.1.2 Train–validation–test splitting strategy
To ensure fair evaluation and prevent data leakage across the multimodal fusion pipeline, a strict and reproducible splitting protocol was implemented for all EL, RGB, and IR samples. Since the datasets originate from different sources and the multimodal pairs are synthetically constructed, we adopted a grouped, defect-aware splitting strategy that assigns all samples derived from the same original instance to a single partition. This prevents augmented variants or modality counterparts (EL, RGB, IR) of the same synthetic triplet from appearing across different splits.
We used a 70/15/15 division for training, validation, and testing, respectively. For the ELPV dataset, images were grouped by defect category (micro-crack, finger interruption, black core) to avoid category imbalance and to ensure that test performance reflects generalization across defect types rather than memorization. For the RGB dataset, grouping was performed based on module identifier metadata provided with the Kaggle annotations. The IR dataset, collected independently under controlled laboratory conditions, was similarly partitioned into non-overlapping subsets based on module identity during acquisition.
All synthetic multimodal triplets (RGB–EL–IR) created for fusion were treated as indivisible units: each triplet was assigned entirely to the train, validation, or test set to ensure that no modality representation of the same conceptual sample could leak into another partition.
Data augmentation—including random flips, rotations, brightness jitter, and histogram perturbation—was applied only after the dataset was split, and exclusively to the training set. Neither validation nor test sets were augmented. This prevents inflated metrics caused by augmented variants appearing in multiple splits. Across all three modalities, the final distribution consisted of 1,260 training samples, 270 validation samples, and 270 test samples after fusion and filtering as illustrated in Table 2. To assess stability, all experiments were repeated with five different random seeds, and performance metrics are reported as the mean ± standard deviation over these runs.
TABLE 2
| Modality | Total samples | Train | Validation | Test | Grouping criterion |
|---|---|---|---|---|---|
| EL | 2624 | 1836 | 394 | 394 | Defect type/EL ID |
| RGB | 1200 | 840 | 180 | 180 | Module ID |
| IR | 480 | 336 | 72 | 72 | Module ID |
| Fused Triplets | 1800 | 1260 | 270 | 270 | Triplet grouping |
Multimodal dataset split summary.
2.2 Modality alignment and registration strategy
To ensure accurate pixel-level correspondence across RGB, infrared, and electroluminescence modalities, homography transformations were computed using SIFT keypoint detection and RANSAC-based robust estimation (de Gioia et al., 2020). However, due to variable environmental conditions and slight misalignments in sensor calibration or UAV positioning, the registration process was subject to geometric uncertainty. To quantify and mitigate these effects, we introduced an alignment error tolerance threshold of 2 pixels based on average inter-keypoint residuals.
Once the homography is computed, the IR and EL images are warped to align with the RGB reference frame at a standardized resolution of 256 × 256 pixels. During this process, we assess alignment quality using the average reprojection error between matched keypoints. A threshold of 2.0 pixels is used to identify misaligned samples. If the reprojection error exceeds this threshold, a secondary refinement step is triggered using the Enhanced Correlation Coefficient (ECC) optimization method (Figure 6), which fine-tunes the alignment by maximizing intensity correlation between modalities as shown in Table 3. To further safeguard against residual misalignment, a spatial confidence map is computed for each pixel based on local homography consistency. These maps are later used during training to modulate the segmentation loss, effectively down-weighting regions with high geometric uncertainty. This multi-tiered registration strategy not only improves pixel-wise fusion integrity but also enhances the robustness of downstream instance segmentation.
FIGURE 6
TABLE 3
| Parameter/Metric | Value/Setting | Purpose/Description |
|---|---|---|
| Reference modality | RGB | Used as geometric base for alignment |
| Keypoint detector | SIFT | Detects invariant features across modalities |
| Matching algorithm | FLANN + Lowe’s Ratio Test | Matches descriptors between modalities |
| Homography estimator | RANSAC | Robust model for geometric warping |
| Target resolution after warping | 256 × 256 pixels | Unified spatial dimensions for all modalities |
| Reprojection error threshold | 2.0 pixels | Max average keypoint error before triggering refinement |
| Refinement method | ECC (Enhanced Correlation Coefficient) | Optimizes intensity alignment when keypoints are unreliable |
| Max ECC iterations | 50 | Avoids overfitting during local intensity refinement |
| Confidence map resolution | 256 × 256 | Encodes spatial reliability of alignment at each pixel |
| Confidence integration in training | Reduces influence of uncertain regions during segmentation |
Parameters and thresholds for multimodal alignment strategy.
2.3 Mask R-CNN
Mask R-CNN extends Faster R-CNN by adding a parallel mask-prediction branch that performs pixel-level segmentation for each detected object (Li et al., 2019). Given an input image (or fused multimodal map) , a backbone CNN produces a feature map . A Region Proposal Network then generates a set of candidate boxes (Chen et al., 2021). For each , RolAlign extracts a fixed-size feature tensor . These features are fed into three sibling heads: (1) a classification head that outputs class probabilities , (2) a box-regression head that predicts offsets and (3) a mask head that predicts a binary mask via a small fully-convolutional network. The overall training objective combines three losses are shown in Equation 2 (Tao et al., 2021):where is cross-entropy on the true class is smooth L1 between predicted and true box deltas and is pixel-wise binary cross-entropy against the ground-truth mask .
To adapt Mask R-CNN for multimodal PV defect detection, we introduced a learnable feature-fusion layer before the backbone that combines modality-specific feature maps into a unified representation as shown in Equation 3:
Additionally, we modified the RPN anchor scales and aspect ratios to match the typical geometry of PV panel defects (elongated cracks vs. blob-like hot spots) and replaced batch normalization with group normalization to maintain stable gradients under small batch sizes. Finally, the mask-head was reconfigured to produce higher-resolution masks () and incorporate a spatial attention module that re-weights features by learned attention maps before the final convolutional prediction.
Within our pipeline, Mask R-CNN serves as the core defect-segmentation engine, delivering instance-level masks whose accuracy directly influences both maintenance prioritization and the fitness evaluation in HFOA. During inference, each mask is thresholded to a binary map , and its quality is quantified via the Intersection-over-Union metric are shown in Equation 4:which feeds into our compound objective . By providing precise spatial delineation of cracks, soiling, and hot spots, Mask R-CNN enables not only accurate localization but also severity scoring when combined with electrical deviations, thus underpinning both diagnosis and decision-support in our multimodal defect detection framework.
2.3.1 Multimodal fusion layer design in mask R-CNN
To integrate RGB, EL, and IR modalities within Mask R-CNN, we employ a mid-level feature fusion strategy positioned between the backbone and the Feature Pyramid Network (FPN). Each modality is first processed through its own lightweight ResNet-50 backbone to extract modality-specific feature maps at multiple scales. These feature maps are then aligned spatially (Section 2.2) and passed to a dedicated Fusion Attention Block that computes cross-modal feature interactions.
The fusion layer operates by concatenating modality-specific features channel-wise and applying a learnable attention mechanism that produces modality weights through a squeeze-and-excitation operation. These weights are learned end-to-end by backpropagation along with all Mask R-CNN parameters, enabling the network to adaptively emphasize the modality that provides the most discriminative information for each image region. The fused feature maps are subsequently fed into the standard FPN, Region Proposal Network (RPN), and Mask Head of Mask R-CNN.
All modalities are normalized independently prior to feature extraction to preserve statistical integrity. RGB images follow ImageNet normalization, while EL and IR channels are normalized using per-dataset mean and variance. No joint normalization is applied, preventing cross-modality statistical leakage. This design ensures that EL contributes fine-grained crack visibility, RGB provides structural context, and IR highlights thermal anomalies, all of which reinforce the segmentation performance of the final network.
Figure 7 illustrates the overall multimodal architecture used to integrate RGB, EL, and IR information within the Mask R-CNN framework. Each modality is first passed through its own lightweight backbone network to extract modality-specific feature maps that capture complementary characteristics: structural context from RGB, subsurface crack patterns from EL, and thermal anomalies from IR. The outputs of these backbones are then fed into a dedicated Fusion Attention Block, which performs feature-level integration by learning adaptive weights that determine how much each modality contributes at every spatial location. This fused representation is subsequently forwarded to the Feature Pyramid Network (FPN), enabling multiscale reasoning before the Region Proposal Network (RPN) and segmentation heads generate the final instance masks. By separating backbone extraction per modality and performing learnable fusion prior to FPN processing, the architecture ensures that the network benefits from modality complementarity while preserving spatial alignment and maintaining compatibility with the standard Mask R-CNN pipeline.
FIGURE 7
Table 4 enumerates every hyper-parameter that governs the behaviour of Mask R-CNN in the proposed pipeline, grouping them by functional block—backbone, optimiser, training schedule, region-proposal network (RPN), ROI heads, and inference thresholds. For each parameter the final value chosen after HawkFish optimization is shown alongside the search space explored by the algorithm, giving readers the context needed to replicate or further tune the model. Critical design choices—such as adopting ResNet-101 for richer spatial features, using a conservative batch size constrained by the four-channel input, and weighting the mask loss more heavily than the box loss are justified in the “Rationale” column, linking numerical settings to the physical characteristics of photovoltaic defects (thin cracks, small hotspots, high-resolution EL speckles). Presenting the configuration in this structured form not only enhances transparency but also enables straightforward comparison with alternative segmentation baselines.
TABLE 4
| Category | Parameter (symbol) | Final value | HFOA search range | Rationale/Notes |
|---|---|---|---|---|
| Backbone | Depth (ResNet-ℓ) | ℓ = 101 layers | {50, 101, 152} | ResNet-101 offers a favourable trade-off between receptive field and inference cost for 4-channel inputs |
| Feature-pyramid levels (P_2–P_5) | 4 levels | fixed | Ensures detection of millimetre-scale cracks while retaining context | |
| Optimiser | Initial learning rate (η0) | 2 × 10−4 | [1 × 10−4, 5 × 10−4] | Lower than ImageNet default owing to small batch and multi-modal noise |
| Weight decay (λ_wd) | 1 × 10−4 | [1 × 10−5, 5 × 10−4] | Controls over-fitting on limited defect masks | |
| Momentum (μ) | 0.9 | fixed | Standard for SGD with momentum | |
| LR scheduler | cosine-anneal, T_max = 50 epochs | {step, cosine} | Cosine decay proved smoother during HFOA search | |
| Training | Batch size (B) | 4 images/GPU | [2, 8] | Limited by 4-channel tensor memory on 16 GB GPU. |
| Epochs (E) | 60 | [30, 100] | Converged mAP plateaued at ∼55 epochs; 60 leaves safety margin | |
| Gradient-clip norm | 5 | [1, 10] | Stabilises backbone-head co-learning | |
| RPN & anchors | Anchor sizes (s) | {16, 32, 64, 128, 256} px | powers of two | Covers defect blobs from 1 mm to 8 cm at 150 dpi imagery |
| Anchor ratios (r) | {0.5, 1.0, 2.0} | fixed | Standard aspect diversity | |
| NMS threshold (τ_nms) | 0.5 | [0.3, 0.7] | Balances merge of adjacent crack fragments versus suppression | |
| ROI heads | ROIAlign output (w × h) | 14 × 14 | {7, 14, 28} | 14 px preserves thin crack edges without heavy RAM usage |
| Mask head conv layers (n_conv) | 4 layers, 256 ch | [3, 5] | Deeper head improved fine-grained EL details | |
| Loss and inference | Mask loss weight (ω_mask) | 1.5 | [1.0, 2.0] | Emphasises pixel accuracy over box IoU |
| IoU threshold for positive ROI (θ_IoU) | 0.55 | [0.5, 0.7] | Tightens foreground definition for narrow defects | |
| Confidence threshold (p̂) | 0.7 | [0.5, 0.9] | Reduces false positives in homogeneous sky backgrounds |
Mask R-CNN hyper-parameters adopted in the multimodal PV-defect framework.
2.4 HawkFish optimization
The HawkFish Optimization Algorithm (HFOA) proposed by (Alkharsan and Ata, 2025) emulates the cooperative foraging behavior of hawkfish in a multi-dimensional search space. The HawkFish Optimization Algorithm (HFOA) is used in this work due to its strong balance between exploration and exploitation, making it well suited for tuning complex, multimodal segmentation models. Unlike traditional optimizers or gradient-free heuristics, HFOA combines adaptive movement patterns with selective intensification, allowing it to efficiently search high-dimensional hyperparameter spaces that govern fusion weights, backbone depth, learning rate, and attention coefficients. This flexibility enables HFOA to avoid premature convergence (a common limitation of PSO or GA) while maintaining stable convergence toward configurations that improve Mask R-CNN performance across heterogeneous RGB, IR, and EL inputs. These characteristics make HFOA particularly appropriate for applications where cross-modal interactions are nonlinear and sensitive to hyperparameter choices.
A population of candidate solutions is maintained, each with an associated “memory” . At iteration , the global best position is identified by evaluating a fitness function . Each hawkfish updates its position according to Equation 5:where and are time-varying coefficients that balance exploration and exploitation. After movement, if , the memory is updated via Equation 6:
This dual-attractor mechanism-towards personal memory and the global best-enables both local refinement and convergence to global optima.
To tailor HFOA for multimodal Mask R-CNN tuning, we introduced three key modifications. First, we incorporated Lévy-flight steps to enhance long-distance exploration by using the formula in Equation 7:where controls jump distribution. Second, we introduced an energy-depletion factor that adaptively scales and , forcing early exploration and later exploitation (). Finally, we expanded the memory bank to retain the top personal bests and select the most diverse one for update, using a crowding-distance metric to maintain population diversity.
In our defect-detection framework, HFOA serves as the automatic hyperparameter and fusion-weight tuner. We define a compound objective in Equation 8 as follows:where encapsulates Mask R-CNN’s learning rate, anchor scales, ROI-pool size, and fusion-layer weights, is the instance-segmentation loss, and mAP is mean average precision on a validation set. HFOA iteratively proposes candidate , evaluates , updates memories, and converges to that minimizes . By replacing manual grid or random searches, HFOA ensures a balanced trade-off between localization accuracy and generalization, yielding a robust, field-ready model.
Table 5 condenses the nine knobs that most strongly shape HFOA’s search behaviour, showing how each was tuned and the bounds explored during calibration. A population of 30 candidate solutions () iterates for at most 40 cycles (max_iter), a combination that gave broad hyper-parameter coverage without inflating GPU time. Global exploration is governed by a Lévy-flight exponent , which yields occasional long jumps, and a step-size factor that keeps average hops moderate; together they prevent early stagnation. Once candidates are evaluated, they are pulled toward both their personal best (weight ) and the global best (weight ), with the stronger global pull accelerating collective convergence. The energydecay rate gradually shortens search strides as the optimiser “tires,” shifting emphasis from exploration to exploitation. To avoid crowding around a single solution, of the worst individuals are replaced each iteration (replacement rate ), injecting fresh diversity. Finally, an early-stopping tolerance halts the optimiser when the mean-average-precision gain between iterations becomes negligible, ensuring that computational effort is not wasted once performance plateaus. Together, these settings deliver a reproducible balance between exploration breadth, convergence speed, and computational cost.
TABLE 5
| Parameter | Symbol | Final value | Search range |
|---|---|---|---|
| Population size | N | 30 | 20–50 |
| Maximum iterations | max_iter | 40 | 20–60 |
| Lévy exponent | 1.5 | 1.3–1.9 | |
| Step-size factor | 0.7 | 0.3–1.0 | |
| Personal-best weight | 0.6 | ||
| Global-best weight | 1.0 | 0.8–1.2 | |
| Energy decay | 0.05 | 0.02–0.10 | |
| Replacement rate | 0.15 | 0.10–0.30 | |
| Fitness tolerance |
Core hyper-parameters of the Enhanced HawkFish Optimisation Algorithm (HFOA).
3 Results
This section presents a comprehensive evaluation of the proposed multimodal PV defect detection framework, integrating RGB, infrared, and electroluminescence imagery with a Mask R-CNN segmentation backbone and HawkFish Optimization Algorithm (HFOA) for hyperparameter tuning. We assess the system’s performance across multiple dimensions, including segmentation accuracy, localization precision, alignment robustness, and computational efficiency. Both qualitative and quantitative results are reported to demonstrate the effectiveness of modality fusion and automated optimization in enhancing defect detection. Experiments were conducted on a combined dataset comprising manually aligned EL and RGB images from the ELPV and NeurobotData repositories, supplemented with IR imagery captured in controlled settings. The results compare the proposed method against single-modality baselines and evaluate improvements gained from the fusion strategy, parameter optimization, and uncertainty-aware alignment. Metrics such as mean Average Precision (mAP), Intersection-over-Union (IoU), and F1 score are used to benchmark segmentation quality, while runtime and memory usage are reported to assess deployability in real-world scenarios.
3.1 Segmentation performance metrics
Figure 8 illustrates the effectiveness of the proposed Mask R-CNN pipeline enhanced by multimodal fusion and HawkFish Optimization Algorithm (HFOA) by comparing key performance metrics across five configurations. The results highlight how each design decision—modal fusion and hyperparameter tuning—contributes to overall detection accuracy. The first three bars represent single-modality baselines: RGB, EL, and IR inputs alone. Among them, EL images yield slightly better scores due to their ability to capture micro-cracks invisible in the RGB or IR spectra.
FIGURE 8
However, all three modalities individually underperform when compared to fused configurations. The fourth bar shows the performance when RGB and EL images are fused, but without HFOA. This improves mean Average Precision (mAP) and F1-score notably, indicating that multimodal input provides complementary cues for defect localization. Still, without automated optimization, the segmentation remains suboptimal due to manually tuned hyperparameters.
The final configuration—RGB + IR + EL with HFOA demonstrates the best performance across all metrics: mAP (0.89), IoU (0.86), and F1-score (0.90) as shown in Table 6. This underscores the synergistic benefit of combining all three modalities along with data-driven optimization of the segmentation model, yielding a robust, high-precision pipeline for real-world PV fault detection.
TABLE 6
| Defect type | IoU (mean ± std) | F1-score (mean ± std) | Precision (mean ± std) | Recall (mean ± std) |
|---|---|---|---|---|
| Cracks | 0.87 ± 0.02 | 0.90 ± 0.01 | 0.92 ± 0.01 | 0.88 ± 0.02 |
| Soiling | 0.83 ± 0.02 | 0.86 ± 0.02 | 0.88 ± 0.02 | 0.85 ± 0.02 |
| Hot Spots | 0.79 ± 0.03 | 0.81 ± 0.02 | 0.84 ± 0.02 | 0.78 ± 0.03 |
Segmentation metrics per defect class.
3.2 Ablation study
Figure 9 and Table 7 present an ablation study to quantify the individual impact of critical components in the proposed multimodal PV defect detection pipeline. The full model includes three synergistic enhancements: multimodal fusion (RGB + IR + EL), HawkFish Optimization Algorithm (HFOA), and a spatial attention mechanism integrated into the Mask R-CNN architecture.
FIGURE 9
TABLE 7
| Configuration | mAP (mean ± std) | IoU (mean ± std) | F1-score (mean ± std) |
|---|---|---|---|
| Full model (fusion + HFOA + attn) | 0.89 ± 0.01 | 0.86 ± 0.02 | 0.90 ± 0.01 |
| No HFOA | 0.84 ± 0.02 | 0.80 ± 0.02 | 0.85 ± 0.01 |
| No EL Images | 0.81 ± 0.02 | 0.76 ± 0.02 | 0.82 ± 0.02 |
| No IR Images | 0.80 ± 0.02 | 0.75 ± 0.01 | 0.81 ± 0.02 |
| No Attention Module | 0.83 ± 0.01 | 0.78 ± 0.02 | 0.84 ± 0.01 |
Ablation study of proposed pipeline components.
When any of these components is removed, the overall performance degrades, demonstrating their individual importance. Excluding HFOA (second bar) leads to a noticeable drop in all metrics—mAP falls from 0.89 to 0.84—highlighting the crucial role of automated hyperparameter tuning. Removing the EL modality results in a sharper decline (mAP = 0.81), confirming its value in capturing subsurface defects like micro-cracks that are invisible to RGB and IR.
This ablation analysis affirms that each module significantly contributes to the overall segmentation quality and robustness, with the full configuration achieving the highest precision and recall for real-world PV defect scenarios.
3.3 Convergence Behavior of HFOA
Figure 10 illustrates the convergence trajectories of the HawkFish Optimization Algorithm (HFOA) against three benchmark optimizers (Particle Swarm Optimization (PSO) (Kumar et al., 2025), Genetic Algorithm (GA) (Jlifi et al., 2025), and Random Search (Yu et al., 2024)) over 40 iterations. The y-axis represents the normalized fitness score, defined as 1 − segmentation loss 1−segmentation loss, which captures both accuracy and generalization in the Mask R-CNN tuning process. As shown, HFOA consistently achieves faster and smoother convergence. It reaches a near-optimal fitness plateau by iteration 28, while PSO and GA require significantly more iterations to approach similar values. Random search exhibits erratic behavior with the lowest final fitness, underscoring its inefficiency for high-dimensional hyperparameter spaces. The superior performance of HFOA is attributed to its hybrid strategy combining Lévy-flight-based exploration, energy-aware attraction dynamics, and diversity preservation via memory crowding. These mechanisms prevent premature convergence and promote robust exploration-exploitation trade-offs.
FIGURE 10
To ensure that the observed advantage of the HawkFish Optimization Algorithm (HFOA) over Particle Swarm Optimization (PSO), Genetic Algorithm (GA), and Random Search is not due to a single favorable run, we conducted a systematic benchmark across multiple independent trials. Each optimizer was executed five times with different random seeds, using the same hyperparameter search space, population size and maximum iteration budget. Specifically, HFOA, PSO, and GA were all configured with a population size and a maximum of 40 iterations, resulting in 1,200 candidate evaluations per run. Random Search was allowed the same total number of evaluations (1,200) without structured population updates. In all cases, the objective was the validation mAP of the Mask R-CNN segmentation model, computed on the held-out validation set described in Section 2.1.2.
Table 8 summarizes the optimization results in terms of the best validation mAP achieved, the number of iterations required to reach 95% of the final best mAP (convergence speed), and the average wall-clock time per run on a single GPU. Results are reported as mean ± standard deviation over the five runs. HFOA consistently attains the highest validation mAP and reaches near-optimal performance in fewer iterations than PSO and GA, while maintaining a comparable runtime per run. Random Search, despite having the same evaluation budget, exhibits the lowest final mAP and the largest variability across runs, confirming its inefficiency in this high-dimensional hyperparameter space. A paired t-test between HFOA and the best-performing baseline (PSO) shows that the mAP improvement is statistically significant (p < 0.01). These findings support the conclusion that the performance gains observed in Figure 9 originate from the optimization strategy itself rather than from an unfair computational advantage or an isolated favorable run.
TABLE 8
| Optimizer | Population size | Max iterations | Total evaluations | Best validation mAP (↑) | Iterations to 95% best mAP (↓) | Runtime per run (min) |
|---|---|---|---|---|---|---|
| HFOA | 30 | 40 | 1,200 | 0.89 ± 0.01 | 28 ± 3 | 146 ± 5 |
| PSO | 30 | 40 | 1,200 | 0.87 ± 0.01 | 34 ± 4 | 144 ± 6 |
| GA | 30 | 40 | 1,200 | 0.86 ± 0.02 | 36 ± 5 | 152 ± 7 |
| Random Search | – | – | 1,200 | 0.84 ± 0.02 | – | 140 ± 4 |
Statistical comparison of HFOA and baseline optimizers for Mask R-CNN hyperparameter tuning.
3.4 Cross-modality alignment quality
Figure 11 quantifies the effectiveness of the alignment refinement stage within the proposed multimodal fusion pipeline. The mean reprojection error—calculated between matched keypoints across modalities—is used to assess geometric misalignment before and after applying the Enhanced Correlation Coefficient (ECC) refinement step. Prior to refinement, the average reprojection error across RGB–IR and RGB–EL image pairs was 2.85 pixels, with a standard deviation of 0.94 pixels, occasionally exceeding the 3-pixel tolerance threshold.
FIGURE 11
This misalignment, if uncorrected, could lead to defective fusion and degraded segmentation performance due to inconsistent spatial features. After ECC-based alignment correction, the mean error decreased to 1.57 pixels, with lower variability (standard deviation 0.61 pixels). Table 9 summarizes the effectiveness of the ECC refinement step applied to the RGB–EL modality pairs. The mean reprojection error decreases from 2.85 px to 1.57 px after refinement, indicating a substantial improvement in geometric consistency between modalities. The reduction in standard deviation (from 0.94 px to 0.61 px) further suggests that ECC not only improves average alignment but also stabilizes the alignment quality across samples.
TABLE 9
| Metric | Before ECC | After ECC |
|---|---|---|
| Mean Reprojection Error (px) | 2.85 | 1.57 |
| Std. Dev. (px) | 0.94 | 0.61 |
| Samples Refined (%) | 21.80% | — |
Cross-modality alignment quality.
The metric “Samples Refined (%)” corresponds to the proportion of all RGB–EL pairs in the dataset for which ECC successfully reduced the reprojection error, computed as (number of improved samples ÷ total number of samples) × 100. Thus, the reported value of 21.80% reflects the percentage of the full dataset that exhibited measurable improvement after ECC refinement. Samples already well aligned by homography showed negligible change and are therefore not counted in this percentage.
This substantial improvement confirms that the refinement stage successfully resolves keypoint mismatches and intensity shifts, particularly in field-collected datasets where sensor jitter or perspective distortions are common. The reduced geometric error ensures that surface-level cues from RGB and IR modalities are well-aligned with sub-surface patterns in EL imagery, enhancing the integrity of the fused input tensor and ultimately leading to more accurate defect segmentation.
Table 10 compares the performance of the proposed multimodal Mask R-CNN framework against widely used segmentation models. Traditional U-Net and DeepLabV3+ show solid performance on PV defect imagery, with DeepLabV3+ outperforming U-Net due to its stronger multiscale feature extraction. SegFormer-B2 achieves the highest performance among the baselines, reflecting the capacity of transformer-based architectures to capture global context. The proposed method delivers competitive or superior accuracy, achieving an IoU of 86% and a Boundary F1 of 89.4%, which demonstrates its strength in capturing crack edges and fine structural details.
TABLE 10
| Model | Dice (%) | IoU (%) | Boundary F1 (%) |
|---|---|---|---|
| U-Net | 88.3 | 80.7 | 78.9 |
| DeepLabV3+ | 86.1 | 83.5 | 85.3 |
| SegFormer-B2 | 89.4 | 85.2 | 87.1 |
| Proposed | 90.0 | 86.1 | 89.4 |
Segmentation performance on the RGB PV defect dataset.
4 Discussion
The results presented in Section 4 demonstrate the efficacy and robustness of the proposed multimodal defect detection framework for photovoltaic (PV) modules. By integrating RGB, infrared, and electroluminescence (EL) modalities into a unified segmentation architecture—enhanced through the HawkFish Optimization Algorithm (HFOA)—the system consistently outperforms baseline models in both quantitative metrics and qualitative precision. The observed improvements across multiple performance indicators underscore the value of each component within the pipeline. The proposed method achieved the highest overall accuracy (0.978), F1-score (0.93), and recall (0.95) when compared to recent state-of-the-art approaches such as Venkatesh et al. (2022), Munawer Al-Otum (2023), Chen et al. (2022), and Wang X. et al. (2023). These results are not only statistically superior but also operationally significant, as higher recall directly translates to fewer missed defects—an essential feature for field-level deployment where undetected cracks or soiling could result in long-term energy losses. The low mean squared error (MSE = 0.021) further indicates that the model generalizes well and maintains high pixel-level fidelity when generating defect masks. Ablation studies confirm that each design choice contributes meaningfully to performance.
Removing the EL or IR modality from the input tensor led to significant reductions in mAP and F1-score, validating the hypothesis that multimodal fusion captures complementary visual and thermal signatures of defects that single modalities fail to isolate. Likewise, disabling the spatial attention module impaired segmentation sharpness, particularly for small-scale cracks and microstructural anomalies.
Most notably, bypassing HFOA in favor of manual hyperparameter tuning resulted in a measurable drop in detection quality, underscoring the optimizer’s role in fine-tuning both fusion weights and network architecture. The convergence analysis further highlights the advantage of using HFOA over traditional metaheuristics. While Genetic Algorithms and Particle Swarm Optimization displayed moderate convergence trends, HFOA reached stability more rapidly and consistently, owing to its dual-attractor memory mechanism and Lévy-flight–based exploration strategy. This not only enhanced the final model’s accuracy but also reduced training cycles, which is beneficial for computational efficiency and scalability. From a visual standpoint, qualitative overlays between RGB inputs and predicted segmentation masks confirmed precise localization of defect regions. The model successfully identified complex and overlapping anomalies—such as soiling adjacent to thermal hotspots—without producing redundant or fragmented masks. This spatial accuracy is critical in scenarios involving automated maintenance scheduling or UAV-based inspections, where actionable insights must be extracted from a single pass.
Figure 12 and Table 11 compare the proposed multimodal fusion framework, optimized using the HawkFish Optimization Algorithm (HFOA), against four state-of-the-art studies: Venkatesh et al. (2022), Munawer Al-Otum (2023), Chen et al. (2022), and Wang X. et al. (2023). Key evaluation metrics include Accuracy, Mean Squared Error (MSE), F1-Score, Recall, and Precision. Among the benchmarks, Venkatesh et al. (2022) achieve the highest accuracy (0.963) using a deep ensemble learning network on PV module images, while Munawer Al-Otum (2023) reports solid F1-scores via a deep learning-based automated defect classification system in EL images. Chen et al. (2022) delivers competitive results using a bidirectional-path feature pyramid attention detector, and Wang X. et al. (2023) demonstrates strong precision and real-time capability with the BL-YOLOv8 model for defect detection.
FIGURE 12
TABLE 11
| Study | Accuracy | MSE | F1-score | Recall | Precision |
|---|---|---|---|---|---|
| Venkatesh et al. (2022) | 0.963 | 0.028 | 0.9 | 0.91 | 0.89 |
| Munawer Al-Otum (2023) | 0.952 | 0.03 | 0.91 | 0.92 | 0.9 |
| Chen et al. (2022) | 0.948 | 0.031 | 0.9 | 0.89 | 0.91 |
| Wang X. et al. (2023) | 0.95 | 0.03 | 0.9 | 0.91 | 0.92 |
| Proposed Method | 0.978 | 0.021 | 0.93 | 0.95 | 0.96 |
Metrics comparison for state of the art methods and the Proposed Method.
However, the proposed method outperforms all baselines across the board. It achieves the highest accuracy (0.978), lowest MSE (0.021), and top-tier F1-score (0.93), recall (0.95), and precision (0.96). These improvements are attributed to the synergy of multimodal fusion (RGB, IR, EL), instance-level segmentation via Mask R-CNN, and the automated hyperparameter tuning by HFOA, which collectively enhance both localization precision and class-wise reliability.
Moreover, the robustness of the proposed alignment strategy was demonstrated through alignment error analysis. Using SIFT-based homography followed by ECC refinement yielded a consistent reduction in reprojection error, enhancing the integrity of multimodal fusion.
Figure 13 presents qualitative examples illustrating the effect of ECC refinement on RGB–EL alignment for both a typical and a challenging sample. In the first row (“Good/typical case”), the EL image warped using homography alone exhibits noticeable misalignment with the RGB reference cell boundaries, busbars, and crack contours appear shifted relative to their corresponding structures in the RGB image. After ECC refinement, these features become significantly better aligned, demonstrating improved geometric correspondence consistent with the quantitative reprojection error reduction (from 2.85 px to 1.57 px). The second row (“Difficult case”) shows a sample with weak texture and large defect regions, where homography-only alignment produces substantial mismatch. ECC refinement still reduces the misalignment but cannot fully correct all geometric inconsistencies (an expected limitation when intensity patterns are not strongly correlated across modalities).
FIGURE 13
The most directly comparable work is Lai et al. (2025), which combines RGB, IR, and EL images but relies on simple feature concatenation and performs only defect classification rather than pixel-level segmentation. Their method does not incorporate geometric alignment or adaptive weighting between modalities, limiting its ability to handle cross-sensor variability. Reference (Lin et al., 2022) demonstrates the effectiveness of infrared–visible fusion in a different domain (ADAS), showing the value of thermal information but lacking EL data and PV-specific considerations. In contrast, the proposed method introduces a complete multimodal segmentation pipeline with ECC alignment, a dedicated Fusion Attention Block, and HFOA-driven optimization. This integration enables better exploitation of structural (RGB), thermal (IR), and subsurface (EL) cues, resulting in more accurate and robust PV defect localization than previous multimodal approaches as summarized in Table 12:
TABLE 12
| Study | Modalities used | Fusion strategy | Strengths | Limitations | Comparison to proposed method |
|---|---|---|---|---|---|
| Lai et al. (2025) | RGB + IR + EL | Early feature concatenation; CNN-based classifier | True multimodal PV dataset; captures thermal and subsurface defects | Limited alignment; no attention fusion; no segmentation (classification only) | Proposed method adds ECC alignment, attention-based fusion, and full pixel-level segmentation |
| Lin et al. (2022) (cross-domain) | Visible + IR | Backbone-level feature fusion to enhance object detection | Demonstrates benefits of thermal + RGB fusion in noisy scenes | Not designed for PV defects; no EL modality; no pixel segmentation | Proposed method extends this concept to PV by integrating EL, performing alignment, and optimizing fusion via HFOA |
| Proposed Method | RGB + IR + EL | ECC alignment + Fusion Attention Block + Mask R-CNN segmentation + HFOA optimization | Robust alignment, adaptive modality weighting, superior segmentation accuracy, optimized hyperparameters | Requires statistical pairing until real multimodal datasets are available | Outperforms existing multimodal approaches by addressing alignment, fusion, and optimization jointly |
Comparison of multimodal fusion techniques relevant to PV inspection.
The integration of uncertainty masks into the training process further improved segmentation reliability by down-weighting ambiguous regions during learning. This pipeline-level resilience ensures that the system remains effective even under real-world environmental variability, such as wind-induced vibration, sensor drift, or inconsistent lighting conditions (Almukhtar, 2025).
However, despite its strong performance, the proposed method is not without limitations. First, the alignment pipeline assumes relatively planar module surfaces; extreme geometric distortions or curved surfaces may violate the homography assumption, leading to residual misalignment. Second, while HFOA optimizes hyperparameters effectively, its performance is still dependent on the initial population diversity and the quality of surrogate training during fitness evaluation. In rare cases, local optima may still trap the search process. Third, the model requires sufficient annotated training data across all modalities. In real deployments, acquiring aligned RGB, IR, and EL datasets with pixel-level ground truth can be logistically complex and labor-intensive. Finally, although the method generalizes well across the tested datasets, its robustness in highly dynamic environments (such as real-time UAV deployment under extreme weather) has not been exhaustively validated.
5 Conclusions and recommendations
This study presented a novel multimodal framework for defect detection in photovoltaic (PV) panels, This study introduced a multimodal PV defect segmentation framework that fuses RGB, infrared (IR), and electroluminescence (EL) imagery through a feature-level Fusion Attention Block embedded in a modified Mask R-CNN architecture. A robust alignment pipeline combining homography and ECC refinement ensured geometric consistency across modalities, while hyperparameters and fusion behavior were automatically tuned using the HawkFish Optimization Algorithm (HFOA). The proposed method achieved strong performance, with 86.0% IoU, 90.0% Dice, and 89.4% Boundary F1, showing clear improvements over U-Net (80.7% IoU) and DeepLabV3+ (86.5% IoU) and performing competitively with SegFormer-B2. Ablation studies further demonstrated that multimodal fusion contributed 4%–6% gains in IoU, while HFOA provided an additional 2%–3% improvement through optimized fusion and network parameters.
For future work, several practical extensions are envisioned. Real-time deployment will be explored by optimizing the inference pipeline and integrating lighter backbones for onboard processing. Hardware-level integration, such as embedding the model into portable inspection devices or thermal-RGB camera systems, represents another key direction. Additionally, drone-based multimodal inspection platforms offer significant potential for large-scale PV farm monitoring, enabling automated, high-coverage defect scanning in the field.
Statements
Data availability statement
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.
Author contributions
NA: Conceptualization, Data curation, Formal Analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review and editing. SK: Project administration, Supervision, Writing – original draft, Writing – review and editing.
Funding
The author(s) declared that financial support was not received for this work and/or its publication.
Conflict of interest
The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declared that generative AI was used in the creation of this manuscript. for this work, both quillbot and grammarly were utilized for proofreading, both tools hold AI capabilities.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
1
AkramM. W.LiG. Q.JinY.ChenX. (2021). Failures of photovoltaic and their detection: review. Appl. Energy313, 118822. 10.1016/j.apenergy.2022.118822
2
AlkharsanA.AtaO. (2025). HawkFish optimization algorithm: a gender-bending approach for solving complex optimization problems. Electronics14, 611. 10.3390/electronics14030611
3
AlmukhtarN. (2025). MRCNN-for-PV-segmentation. GitHub Repository. Available online at: https://github.com/AsharfNadir88/MRCNN-for-PV-segemntation (Accessed on December 6, 2025).
4
AnimashaunD.HussainM. (2023). Automated micro-crack detection within photovoltaic manufacturing facility via ground modelling for a regularized convolutional network. Sensors23, 6235. 10.3390/s23136235
5
ChenH. Y.ZhaoP.YanH. W. (2021). Crack detection based on multi-scale faster RCNN with attention. Opto-Electron. Eng.48, 200112. 10.12086/oee.2021.200112
6
ChenH.SongM.ZhangZ.LiuK. (2022). Detection of surface defects in solar cells by bidirectional-path feature pyramid group-wise attention detector. IEEE Trans. Instrum. Meas.71, 1–9. 10.1109/tim.2022.3218111
7
ChenX.KarinT.LibbyC.DeceglieM.HackeP.SilvermanT. J.et al (2023). Automatic crack segmentation and feature extraction in electroluminescence images of solar modules. IEEE J. Photovolt.13, 334–342. 10.1109/JPHOTOV.2023.3249970
8
ChenL.YaoH.FuJ.NgC. T. (2023). The classification and localization of crack using lightweight convolutional neural network with CBAM. Eng. Struct.275, 115291. 10.1016/j.engstruct.2022.115291
9
de GioiaF.MeoniG.GiuffridaG.DonatiM.FanucciL. (2020). A robust RANSAC-based planet radius estimation for onboard visual based navigation. Sensors20, 4041. 10.3390/s20144041
10
DemirA.NecatiA. (2024). Defect detection in solar panels using a customized 2D CNN: a study on the ELPV dataset.
11
El YanboiyN. (2024). “Enhancing the reliability and efficiency of solar systems through fault detection in solar cells using electroluminescence (EL) images and YOLO version 5.0 algorithm,” in Sustainable and green technologies for water and environmental management (Springer), 35–43.
12
HassanS.DhimishM. (2023). Dual spin max-pooling convolutional neural network for solar cell crack detection. Sci. Rep.13, 11099. 10.1038/s41598-023-38177-8
13
HeB.LuH.ZhengC.WangY. (2022). Characteristics and cleaning methods of dust deposition on solar photovoltaic modules. A Review Energy263, 126083. 10.1016/j.energy.2022.126083
14
JiaY.ChenG.ZhaoL. (2024). Defect detection of photovoltaic modules based on improved VarifocalNet. Sci. Rep.14, 15170. 10.1038/s41598-024-66234-3
15
JlifiB.FerjaniS.DuvalletC. (2025). A genetic algorithm based three HyperParameter optimization of deep long short term memory (GA3P-DLSTM) for predicting electric vehicles energy consumption. Comput. Electr. Eng.123 (Part C), 110185. 10.1016/j.compeleceng.2025.110185
16
KumarN.RajiJ.SrideviS.IrfanM. M.RajeshwariR.InbamaniA. (2025). “A PSO tuned CNN approach for accurate fault detection in PV grid systems,” in 2025 IEEE 14th International Conference on Communication Systems and Network Technologies (CSNT), Bhopal, India, 1257–1262. 10.1109/CSNT64827.2025.10968346
17
LaiY.-S.HsiehC.-C.LiaoT.-W.HuangC.-Y.KuoC.-F. J. (2025). Deep learning-based automatic defect detection of photovoltaic modules in infrared, electroluminescence, and red–green–blue images. Energy Convers. Manag.332, 119783. 10.1016/j.enconman.2025.119783
18
LiX. X.YangQ.LouZ.YanW. J. (2019). Deep learning based module defect analysis for large-scale photovoltaic farms. IEEE Trans. Energy Convers.34, 520–529. 10.1109/TEC.2018.2873358
19
LibraM.MrázekD.TyukhovI.SeverováL.PoulekV.MachJ.et al (2023). Reduced real lifetime of PV panels – economic consequences. Sol. Energy259, 229–234. 10.1016/j.solener.2023.04.063
20
LinY.-C.ChiangP.-Y.MiaouS.-G. (2022). Enhancing deep-learning object detection performance based on fusion of infrared and visible images in advanced driver assistance systems. IEEE Access10, 105214–105231. 10.1109/ACCESS.2022.3211267
21
LiuX.GohH. H.XieH.HeT.YewW. K.ZhangD.et al (2025). ResGRU: a novel hybrid deep learning model for compound fault diagnosis in photovoltaic arrays considering dust impact. Sensors25, 1035. 10.3390/s25041035
22
LuS.WuK.ChenJ. (2023). Solar cell surface defect detection based on optimized YOLOv5. IEEE Access11, 1. 10.1109/ACCESS.2023.3294344
23
MathewD.RamJ. P.KimY.-J. (2023). Unveiling the distorted irradiation effect (shade) in photovoltaic (PV) power conversion – a critical review on causes, types, and its minimization methods. Sol. Energy266, 112141. 10.1016/j.solener.2023.112141
24
MohamedA.NaceraY.AhceneB.TetaA.BelabbaciE. O.RabehiA.et al (2025). Optimized YOLO based model for photovoltaic defect detection in electroluminescence images. Sci. Rep.15, 32955. 10.1038/s41598-025-13956-7
25
Munawer Al-OtumH. (2023). Deep learning-based automated defect classification in electroluminescence images of solar panels. Adv. Eng. Inf.58102147. 10.1016/j.aei.2023.102147
26
Neurobotdata (2021). Photovoltaic panel defect dataset. Available online at: https://www.kaggle.com/datasets/neurobotdata/photovoltaic-panel-defect-dataset (Accessed on August 13, 2025).
27
PatelA. V.McLauchlanL.MehrubeogluM. (2020). “Defect detection in PV arrays using image processing,” in 2020 International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, 1653–1657. 10.1109/CSCI51800.2020.00304
28
RabaiaM. K. H.AbdelkareemM. A.SayedE. T.ElsaidK.ChaeK. J.WilberforceT.et al (2021). Environmental impacts of solar energy systems. A Review Sci. Total Environ.754. 10.1016/j.scitotenv.2020.141989
29
SinghO. D.GuptaS.DoraS. (2023). Segmentation technique for the detection of micro cracks in solar cell using support vector machine. Multimed. Tools Appl.82, 1–26. 10.1007/s11042-023-14509-8
30
SodhiM.BanaszekL.MageeC.Rivero-HudecM. (2022). Economic lifetimes of solar panels. Procedia CIRP105, 782–787. 10.1016/j.procir.2022.02.130
31
TaoY. C.XuZ. Y.LiuQ. H.LiL. H.ZhangY. X. (2021). “Improved faster R-CNN algorithm for defect detection of electromagnetic luminescence,” in Tenth international symposium on precision mechanical measurements. 10.1117/12.2617320
32
TaoY.YuT.YangJ. (2025). Photovoltaic array fault diagnosis and localization method based on modulated photocurrent and machine learning. Sensors25, 136. 10.3390/s25010136
33
VenkateshS.NaveenN.JeyavadhanamB.SizkouhiA. M.EsmailifarS.AghaeiM.et al (2022). Automatic detection of visual faults on photovoltaic modules using deep ensemble learning network. Energy Rep.8, 14382–14395. 10.1016/j.egyr.2022.10.427
34
WangJ.BiL.SunP.JiaoX.MaX.LeiX.et al (2023). Deep-learning-based automatic detection of photovoltaic cell defects in electroluminescence images. Sensors23, 297. 10.3390/s23010297
35
Wang.X.GaoH.JiaZ.LiZ. (2023). BL-YOLOv8: an improved road defect detection model based on YOLOv8. Sensors23, 8361. 10.3390/s23208361
36
Waqar AkramM.BaiJ. (2025). Defect detection in photovoltaic modules based on image-to-image generation and deep learning. Sustain. Energy Technol. Assessments82, 104441. 10.1016/j.seta.2025.104441
37
YuJ.QianS.ChenC. (2024). Lightweight crack automatic detection algorithm based on TF-MobileNet. Appl. Sci.14, 9004. 10.3390/app14199004
38
ZhangJ.ChenX.WeiH.ZhangK. (2024). A lightweight network for photovoltaic cell defect detection in electroluminescence images based on neural architecture search and knowledge distillation. Appl. Energy355, 122184. 10.1016/j.apenergy.2023.122184
Summary
Keywords
photovoltaic defect detection, multimodal fusion, mask R-CNN, electroluminescence imaging, infrared thermography, RGB imagery, HawkFish optimization algorithm (HFOA)
Citation
Almukhtar N and Kurnaz S (2026) Multimodal fusions for defect detection of photovoltaic panels by mask R-CNN and hawkfish optimization algorithm. Front. Earth Sci. 13:1702396. doi: 10.3389/feart.2025.1702396
Received
30 September 2025
Revised
06 December 2025
Accepted
12 December 2025
Published
23 January 2026
Volume
13 - 2025
Edited by
Pranav Mehta, Dharamsinh Desai University, India
Reviewed by
Divyang Bohra, Dharamsinh Desai University, India
Jaymin Patel, Dharamsinh Desai University, India
Updates
Copyright
© 2026 Almukhtar and Kurnaz.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Nazar Almukhtar, nazaralmukhtar8@gmail.com, 203720461@ogr.altinbas.edu.tr
Disclaimer
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.