LenNet: direct detection and localization of strong gravitational lenses in wide-field sky survey images

Liu, Pufan; Li, Hui; Li, Ziqi; Cao, Xiaoyue; Li, Rui; Su, Hao; Li, Ran; Napolitano, Nicola R.; Koopmans, Léon V. E.; Busillo, Valerio; Tortora, Crescenzo; Gao, Liang

doi:10.3389/fspas.2025.1656917

ORIGINAL RESEARCH article

Front. Astron. Space Sci., 14 October 2025

Sec. Extragalactic Astronomy

Volume 12 - 2025 | https://doi.org/10.3389/fspas.2025.1656917

This article is part of the Research TopicStrong Lensing Studies Towards Next Generation SurveysView all articles

LenNet: direct detection and localization of strong gravitational lenses in wide-field sky survey images

Pufan Liu^1,2

Hui Li¹

Ziqi Li^1,2

Xiaoyue Cao^1,3,4

Rui Li¹*

Hao Su⁵

Ran Li⁶

Nicola R. Napolitano⁷

Léon V. E. Koopmans⁸

Valerio Busillo⁹

Crescenzo Tortora⁹

Liang Gao^1,6

¹Institute for Astrophysics, School of Physics, Zhengzhou University, Zhengzhou, China
²International College, Zhengzhou University, Zhengzhou, China
³School of Astronomy and Space Science, University of Chinese Academy of Sciences, Beijing, China
⁴National Astronomical Observatories, Chinese Academy of Sciences, Beijing, China
⁵Department of Physics “E. Pancini”, University of Naples Federico II, Naples, Italy
⁶School of Physics and Astronomy, Beijing Normal University, Beijing, China
⁷Department of Physics “E. Pancini”, University Federico II, Naples, Italy
⁸Kapteyn Astronomical Institute, University of Groningen, Groningen, Netherlands
⁹INAF–Osservatorio Astronomico di Capodimonte, Naples, Italy

Strong gravitational lenses are invaluable for tackling fundamental astrophysics questions, such as the nature of dark matter and cosmic expansion. However, current sky surveys’ “crop-and-classify” lens search method faces a critical challenge: it creates massive computational and storage bottlenecks when dealing with billions of potential host galaxies, which is unsustainable for future large-scale surveys. To address this, we propose LenNet, an object detection model that directly identifies lenses in large, original survey images, eliminating the inefficient cropping step. LenNet is first trained on simulated data to learn gravitational lens features. Then, transfer learning is used to fine-tune it on a limited set of real, labeled samples from the Kilo-Degree Survey (KiDS). Experiments show LenNet performs exceptionally well on real survey data, validating its ability as an efficient and scalable solution for lens discovery in massive astronomical surveys. LenNet’s success in direct lens detection in large images resolves the computational and storage issues of traditional methods. The strategy of using simulated data for initial training and transfer learning with real KiDS data is effective, especially given limited real labeled data. Looking forward, LenNet can enable more efficient lens discovery in future large-scale surveys, accelerating research on dark matter and cosmic expansion.

1 Introduction

According to Einstein’s Theory of General Relativity, light from a distant source galaxy travels to the observer along geodesics in spacetime. If a foreground galaxy is closely aligned with such a source, its gravitational potential distorts the intervening spacetime, deflecting the light to form multiple images or extended arcs (Schneider et al., 1992). This phenomenon, analogous to optical lensing, is known as galaxy-galaxy strong gravitational lensing (GGSL) and constitutes a powerful astrophysical tool (Shajib et al., 2024; Treu, 2010). GGSL provides robust estimates of the lens galaxy’s mass distribution (Koopmans et al., 2006; Auger et al., 2010; Sonnenfeld et al., 2015; Shajib et al., 2021; Sheu et al., 2025), including its dark matter subhalos (Vegetti et al., 2012; Nightingale et al., 2024; Ballard et al., 2024; Cao et al., 2025; He et al., 2025); when combined with stellar population synthesis models, these measurements yield constraints on the galaxy’s initial mass function (e.g., Sonnenfeld et al., 2015; Li et al., 2025). Furthermore, GGSL acts as a natural “cosmic telescope”, magnifying background sources to enable the study of their structure (Dye et al., 2015; Shu et al., 2016b; Cheng et al., 2020; Rizzo et al., 2020; Li et al., 2024a), which would otherwise be beyond the capabilities of current instruments. Finally, because light propagates through the cosmological background, GGSL serves as an independent cosmographic probe when the background source exhibits temporal variability (Treu et al., 2022; Birrer et al., 2024) or when multiple sources lie at different redshifts (Gavazzi et al., 2008; Collett and Auger, 2014; Euclid Collaboration et al., 2025a).

GGSL is a rare astrophysical event that necessitates precise alignment, typically within a few arcseconds, between foreground and background galaxies. Consequently, identifying individual lenses within contemporary sky surveys often requires extensive visual inspection of numerous galaxy images, rendering the development of automated detection methods imperative, particularly as forthcoming surveys will catalogue more than billions of galaxies (e.g., LSST Science Collaboration et al., 2009). Traditional automatic lens-finding approaches include spectrum-based techniques (Bolton et al., 2006; Shu et al., 2016a; Cao et al., 2020), which identify anomalous emission lines in foreground galaxies indicative of background sources, and image-based methods (Cabanac et al., 2006; More et al., 2012; Nightingale et al., 2025), which directly inspect galaxy images for characteristic lensing features such as bluish arcs. While these conventional methods have successfully confirmed several hundred GGSLs, recent advances in deep learning, particularly convolutional neural networks (CNNs), have significantly enhanced lens discovery capabilities, yielding thousands of new candidates (e.g., Petrillo et al., 2017; Jacobs et al., 2019; Li et al., 2021; Nagam et al., 2025). Nevertheless, existing lens samples are still insufficient for many strong lens sciences in terms of size and completeness (Shajib et al., 2024). Ongoing and Upcoming telescope facilities, including Euclid (Euclid Collaboration et al., 2025b), the China Space Station Telescope (CSST, Zhan, 2021), the Nancy Grace Roman Space Telescope (Spergel et al., 2015), and the Vera C. Rubin Observatory’s Legacy Survey of Space and Time (LSST, Ivezić et al., 2019), are expected to detect hundreds of thousands of GGSLs (Collett, 2015; Cao et al., 2024; Wedig et al., 2025). This anticipated surge in available data promises to revolutionise the field, but concurrently underscores the critical need to develop more efficient and robust automated lens-finding algorithms to fully exploit this unprecedented opportunity.

The conventional pipeline for identifying gravitational lenses using neural networks typically comprises three main stages. First, potential lens candidates, such as massive elliptical galaxies, are pre-selected from a parent source catalogue generated by software such as Source Extractor (Bertin and Arnouts, 1996). Second, small cutout images centred on these candidates are produced. Third, a neural network, commonly a CNN, classifies each cutout to determine if it contains a strong lens. This methodology has been extensively employed in various large-scale surveys (e.g., Petrillo et al., 2019; Li et al., 2021; Euclid Collaboration et al., 2025b), achieving notable success. However, this approach has several limitations. Firstly, in crowded fields, the source-detection algorithm may fail, leading to cutouts that blend multiple neighbouring objects. Such blending can hinder the network’s ability to accurately recognise genuine lensing features. Secondly, choosing an appropriate cutout size involves a critical trade-off that directly influences detection accuracy. If the cutout is too small, lensed arcs of systems with large Einstein radii may lie outside the field of view, resulting in missed detections. Conversely, overly large cutouts may include nearby contaminating objects, especially in crowded regions, potentially causing the network to misidentify these contaminants as lensing features, thus generating false positives. Thirdly, the cutout-based approach inherently neglects information about the broader environmental context of the lens system. This context can provide subtle yet valuable clues for lens identification; for example, GGSLs are more likely to occur within dense galaxy cluster environments (Oguri et al., 2005). Finally, the pre-selection stage can systematically exclude rare yet scientifically interesting systems, such as those with very low mass (Shu et al., 2017) or disk lensing galaxies (Treu et al., 2011).

To address the limitations inherent in conventional neural-network-based gravitational lens finders, we introduce LenNet, a neural network specifically designed to identify GGSLs directly from full-survey images. This approach eliminates the necessity of pre-selecting target sources and extracting individual image cutouts¹. LenNet incorporates advanced feature-extraction techniques explicitly tailored for gravitational lens identification. Initially trained on simulated lensing datasets, LenNet subsequently employs transfer learning to improve its generalisation capabilities for application to real observational data. On mock lensing datasets, LenNet achieves an precision of 96.95 per cent with a recall rate of 97.27 per cent, while on real KIDS lensing datasets, it attains an precision of 97.91 per cent with a recall rate of 89.47 per cent. Additionally, we compare the performance of LenNet with several classical object detection neural networks (e.g., YOLO (Khanam and Hussain, 2024; Redmon and Farhadi, 2018)) and find that LenNet demonstrates superior performance.

This paper is structured as follows. Section 2 describes the methodology used to generate the simulated dataset. Section 3 details the architecture, key components, and training procedure of LenNet. Section 4 presents the results, including performance evaluations of LenNet on both simulated and real observational data. Finally, Section 5 summarises our conclusions.

2 Dataset generation

2.1 KIDS DR4

The Kilo-Degree Survey (KiDS) is a Stage-III optical imaging survey conducted using the OmegaCAM wide-field camera on the VLT Survey Telescope (VST) at the ESO Paranal Observatory (Kuijken et al., 2015). Initiated in 2011, KiDS was designed to map a large area of the extragalactic sky with exceptional image quality. Although its primary scientific goal is weak gravitational lensing for cosmological studies, the survey’s deep, multi-band imaging data provide a rich resource for a broad range of astrophysical investigations, including strong lensing, galaxy evolution, and the characterisation of galaxy clusters. The KiDS footprint covers 1,350 ${deg}^{2}$ , divided into two large patches on the North and South Galactic Caps. The fourth data release (DR4) encompasses 1,006 ${deg}^{2}$ of observations in four broad-band optical filters— $u$ , $g$ , $r$ , and $i$ (Kuijken et al., 2019). To enable precise galaxy shape measurements, observations in the $r$ -band were prioritised for optimal seeing conditions, requiring a point-spread function (PSF) with a full-width at half-maximum (FWHM) of less than $0 . 8^{''}$ . The median seeing achieved in the $r$ -band is $\approx 0 . 7^{''}$ . Furthermore, the instrument optics ensure a uniform PSF across the $1 {deg}^{2}$ field of view, establishing KiDS as one of the sharpest large-area, ground-based surveys. The survey is notably deep, reaching $5 σ$ limiting magnitudes in a $2^{''}$ aperture of $u \approx 24.2$ , $g \approx 25.1$ , $r \approx 25.0$ , and $i \approx 23.7$ . This depth, when combined with near-infrared data from the VISTA Kilo-Degree Infrared Galaxy (VIKING) survey (Edge et al., 2013), allows the galaxy population to be probed to a median redshift of $z \approx 0.8$ and beyond.

2.2 Lens simulation

Identifying GGSLs using neural networks requires substantial amounts of training data. However, confirmed lenses remain scarce within the KiDS survey (Petrillo et al., 2019; Li et al., 2020; Grespan et al., 2024; N et al., 2024). Therefore, we have developed a specialized GGSL simulation pipeline to generate realistic training samples. This method, described extensively in Li et al. (2021), involves injecting simulated lensing images into real KiDS galaxy observations, as detailed below.

It should be noted that lens-free images are not essential for the model training and validation in this study. On one hand, if lens-free images were to be used for model optimization, they would first require manual visual inspection to screen out valid samples that are free of interference and errors. This process consumes significant labor and time costs, thereby reducing research efficiency. On the other hand, the core objective of this study is to improve the model’s accuracy in recognizing “lenses”. Importantly, lens-free images do not actually participate in the calculation of loss; their auxiliary role in helping the model learn the “key features of lenses” is extremely minimal. Even without incorporating lens-free images, the model can still maintain excellent detection capabilities. Luminous Red Galaxies (LRGs) are selected from the KiDS DR4 catalogue to act as potential lens galaxies. Following the methodology of Li et al. (2021), Section 3.1), the selection is based on both colour and magnitude criteria, with a magnitude limit of $r < 20.0$ and colors satisfying a slightly adapted LRG color–magnitude condition. This imposes to selection of massive galaxies, which are the most efficient deflectors². The mass distribution of each LRG is modelled using a singular isothermal ellipsoid (SIE) profile, supplemented by an external shear component to account for environmental effects. The Einstein radius $θ_{E}$ , ellipticity, and shear strength are randomly drawn from distributions specified in Li et al. (2021). Background sources are constructed using morphological and colour information from the CosmoDC2 simulation catalogue with selected redshifts between 0.8 and 6 (Korytov et al., 2019). Each randomly selected source is then strongly lensed by a prospective foreground LRG within a large $501 \times 501$ pixels ( $0 . 2^{''} / p i x e l s$ , corresponding to $100 . 2^{''} \times 100 . 2^{''}$ ) survey image to generate an idealised lensed image. The position of this prospective deflector is randomly assigned within the survey image. Specifically, we select sources that satisfy the criterion $z_{source} > z_{lens} + 0.3$ . To replicate the spatial resolution of real observations, this idealised lensed image is subsequently convolved with the local point spread function (PSF) extracted from the corresponding region of the KiDS imaging data. The PSF-convolved lensed image is then injected into the KiDS survey image, centred on the prospective foreground LRG, thereby creating a realistic training sample. For the object detection task, the precise location of the injected lens must be recorded. We determine the coordinates of the lensed source by defining a bounding box that tightly encloses the full extent of the simulated lensed arcs or ring. These bounding box coordinates, along with the class label indicating a lens, constitute the ground truth for training the object detection model.

This simulation pipeline yielded 9,147 mock lenses with known labels for network training.

Notably, one simulated field contains only a single lens. The full sample of lenses was simulated with Einstein radii in the range of 1–5 arcsec, lens redshifts $(z_{lens})$ from 0.08 to 0.80, and source redshifts $(z_{source})$ from 0.80 to 4.84. Figure 1 presents six representative examples, illustrating the complexity and realism achieved by our simulations. These simulations successfully reproduce the diverse morphologies of strong lenses, whilst preserving the noise characteristics, PSF effects, and galaxy populations typical of KiDS observations.

Figure 1

A composite image of six panels showing sections of the night sky filled with numerous stars and galaxies. Each panel highlights a central galaxy with a red box and includes an inset zooming in on the galaxy. The main background is dark, with scattered bright celestial bodies. The sizes of the insets vary, emphasizing the galaxies' details.

Figure 1. Representative examples of simulated strong lenses in KiDS DR4 imaging. Each system is marked by a red box, and the inset shows a zoomed-in view of the lens.

3 Methods

The performance of LenNet is determined by three core components: its model architecture, loss function, and training procedure. The architecture and loss function define the model’s theoretical performance ceiling, while the training strategy determines how effectively this potential is achieved. The following subsections detail each of these three components.

3.1 Network architecture

Figure 2 provides a simplified schematic of this architecture, omitting intermediate layers for clarity. The model’s architecture is composed of two primary stages: a feature extraction backbone and a feature fusion neck. These modules are followed by a detection head, which processes the fused feature maps at three distinct scales. This multi-scale approach enables the model to effectively detect gravitational lenses of various apparent sizes. The operational workflow diverges for training and inference. During the training phase, the model’s predictions are compared against ground-truth labels to compute a loss value, which guides parameter updates via backpropagation. Conversely, during inference, the raw predictions are aggregated and refined using Non-Maximum Suppression (NMS) (Neubeck and Van Gool, 2006) to produce the final set of detections.

Figure 2

Simplified LenNet network structure diagram with three main sections: Backbone Network, Feature Fusion Neck, and Output Head. The Backbone Network includes components like Focus and Conv with Feature Extractors. The Feature Fusion Neck shows processes such as downsampling and upsampling. The Feature Extractor section lists modules like ECA, SE, and CBAM. Outputs are fed into multiple heads for final processing.

Figure 2. Simplified LeNet Network Structure. This figure is intended solely to provide readers with a general understanding of the overall architecture of LeNet, and does not encompass all components in detail. For example, in the upsampling and downsampling process of the feature fusion neck (i.e., FPN Network), the intermediate feature extraction layers interleaved between these operations have been omitted for clarity. In the diagram, dashed lines represent the connections between different components, with rectangular blocks in the Backbone Network and circular nodes in the Feature Fusion Neck both denoting intermediate feature maps.

The feature extraction module serves as the backbone of LenNet, designed to transform raw input images into a hierarchy of representative and discriminative features. This is achieved by progressively downsampling the spatial dimensions while extracting information critical for lens identification. The module’s architecture is a carefully orchestrated sequence of attention mechanisms, a bespoke Multi-Feature Extraction block (MFEBlock (Xie et al., 2023)), and the efficient C3 module from CSPNet (Wang et al., 2023; Khanam and Hussain, 2024). It starts with a lightweight ECA block (Wang et al., 2020) for initial channel-wise weighting. Following this, an SE block (Hu et al., 2018) further refines these channel weights, focusing computational resources on the most informative channels. Building on this, a CBAM block (Woo et al., 2018) introduces a spatial attention component, creating a comprehensive feature map refined in both channel and spatial domains. After this attention-based refinement, the core feature extraction is performed. The MFEBlock employs dilated convolutions to systematically enlarge the receptive field, enabling the capture of broad contextual information essential for identifying lens systems. The subsequent C3 module then effectively aggregates the multi-scale features generated by the MFEBlock, utilizing residual connections and feature concatenation to produce a semantically rich representation. Finally, the sequence concludes with a Coordinate Attention (CA) block (Hou et al., 2021) to capture direction-aware and position-sensitive information, which is crucial for the fine-grained localization of the target lens. In the final feature extraction stage only, an SPPF block (He et al., 2015) is appended to fuse features at multiple scales efficiently. The output of this entire module is then passed to the next stage of the backbone or serves as one of the three final feature maps for the detection head. The unique presence of the SPPF block in the last stage is highlighted in Figure 2.

In the feature fusion neck section, to effectively integrate features across different scales, we employ a feature fusion neck based on the Path Aggregation Network (PANet) architecture, a well-established enhancement to the classical Feature Pyramid Network (FPN) (Lin et al., 2017). This neck structure creates bidirectional information flow, combining a top-down path to propagate rich semantic features with a bottom-up path to preserve precise localization information. Let P3, P4, and P5 denote the multi-scale feature maps produced by the backbone. Prior to fusion, each map is passed through an SE block to recalibrate channel-wise feature responses. The process then unfolds in two stages.

$•$ Top-Down Pathway: This path propagates high-level semantic information from deeper layers (P5) to shallower ones. Specifically, the P5 feature map is upsampled and fused (e.g., via concatenation) with P4. This combined map is then processed and upsampled again to be fused with P3, creating a set of semantically enhanced feature maps.

$•$ Bottom-Up Pathway: Following the top-down pass, a complementary bottom-up pathway is constructed. This path transmits low-level, high-resolution positional information from the shallower layers (e.g., the newly enhanced P3) upwards to the deeper layers. This is achieved through a series of downsampling and fusion operations, ensuring that the final output pyramids contain both strong semantic and precise spatial information.

The resulting feature maps from this bidirectional fusion process—P3, P4, and P5—are information-rich and serve as the final inputs to the detection head for generating predictions.

3.2 Loss function

The overall loss function (4), is a composite objective designed to train the model on three distinct tasks: objectness confidence prediction, class prediction, and bounding box localization. It is a weighted sum of a confidence loss (1), a classification loss (2), and a localization loss (3). Both the confidence and classification losses are implemented using the standard Binary Cross-Entropy (BCE) loss. The localization loss is calculated using the Generalized Intersection over Union (GIoU) (Rezatofighi et al., 2019), which offers more stability than the standard IoU for non-overlapping boxes. The individual loss components are defined as follows:

L_{conf} = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i}^{conf} \cdot \ln (p_{i}^{conf}) + (1 - y_{i}^{conf}) \cdot \ln (1 - p_{i}^{conf})] (1)

L_{cls} = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i}^{cls} \cdot \ln (p_{i}^{cls}) + (1 - y_{i}^{cls}) \cdot \ln (1 - p_{i}^{cls})] (2)

L_{loc} = L_{GIoU} = 1 - (IoU - \frac{| C | - | A \cup B |}{| C |}) (3)

where $N$ is the number of predictions. $y_{i}^{conf}$ and $p_{i}^{conf}$ are the ground-truth and predicted confidence scores for the i-th prediction, respectively. $y_{i}^{cls}$ and $p_{i}^{cls}$ are the ground-truth and predicted class probabilities. Intersection over Union (IoU) is the Intersection over Union between the ground-truth box $A$ and the predicted box $B$ . $C$ is the area of the smallest convex bounding box enclosing both $A$ and $B$ . The final objective function combines these components, weighted according to their importance for the single-class lens detection task. Since accurate localization is the primary goal, we prioritize the confidence and localization losses. Following a similar strategy to LSGBnet (Su et al., 2024), the classification loss is assigned a minimal weight, mainly serving to stabilize training in the multi-task framework. The final loss is:

L_{total} = ω_{conf} L_{conf} + ω_{cls} L_{cls} + ω_{loc} L_{loc} (4)

where the weights are set to $ω_{conf} = 10$ , $ω_{cls} = 0.005$ , and $ω_{loc} = 1$ .

3.3 Training

In object-detection frameworks like ours, the training data already includes negative examples via the background regions of images that contain lenses. With enough training images, these backgrounds provide ample negatives, allowing the model to learn discriminative features without additional pure negative images. This follows standard practice in modern object detection networks (e.g., YOLO, Faster R-CNN), which typically do not train on entire negative images.

In our simulation-based pretraining, we supply many simulated lens images, whose background regions already yield numerous negative samples. Adding pure negative images would likely not improve performance and would introduce practical challenges: verifying that large numbers of candidate negatives truly contain no lenses is labor-intensive and error-prone, especially at the scale of the simulated dataset. However, during transfer learning on real data (see Section 4.2), where the training set is much smaller ( $\sim 200$ –360 images), we included a small set of carefully verified negative samples. This improves the model’s ability to distinguish lenses from specific background features in real observations, and manual verification is feasible at this scale. We generated a comprehensive dataset of simulated gravitational lens images for this study. The complete dataset was partitioned into training, validation, and test sets following a standard 8:1:1 ratio. All splits were sampled from the same underlying data distribution to ensure an unbiased evaluation of the model’s final performance.

The model was trained using the Adam optimizer (Kingma and Ba, 2014) with a batch size of 32. The initial learning rate was set to $2 \times 1 0^{- 5}$ and was managed by a ReduceLROnPlateau scheduler, which halved the learning rate whenever the validation loss stagnated for five consecutive epochs. We conducted our training process using a single NVIDIA A100 GPU, leveraging the PyTorch deep learning framework (version 2.4.1) with Python 3.12.7 and CUDA version 12.0. The training spanned 150 epochs and required a total of 9 GPU hours. Throughout the training, the GPU achieved an average utilization rate of 95%, while the memory and CPU utilization rates averaged 36% and 67%, respectively. Additionally, during the testing phase, the test dataset comprised 915 images containing lenses and 1,200 images without lenses, and LeNet took only 28 s to make predictions on all of them. It is particularly worth noting that these 1,200 lens-free images are real-world samples selected completely at random.

4 Results and discussion

To comprehensively assess the performance of LenNet, we employ a combination of quantitative metrics and qualitative analysis. For the quantitative evaluation, we adopt three standard metrics from the field of object detection: Precision, Recall, and the F1-score. The Recall measures the model’s ability to identify all actual gravitational lenses within the dataset, thus quantifying its completeness. Precision measures the reliability of the predictions, indicating what fraction of the identified candidates are actual lenses. A fundamental trade-off exists between these two metrics; for instance, a model can achieve high recall by lowering its detection threshold, but this often leads to an increase in false positives and a decrease in precision. The F1-score, which is the harmonic mean of precision and recall, provides a single, balanced metric that summarizes the model’s overall accuracy. These metrics are defined as:

\begin{array}{l} Recall & = \frac{T P}{T P + F N}, \\ Precision & = \frac{T P}{T P + F P}, \\ F 1 - score & = \frac{2 \times Precision \times Recall}{Precision + Recall} . \end{array}

Where TP is the True Positive, means that a predicted bounding box has an IoU with a ground-truth box that exceeds a predefined threshold (e.g., $I o U > 0.5$ ). FP (False Positive) corresponds to a predicted box with no matching ground-truth object, while a FN (False Negative is a ground-truth object that the model failed to detect.

Beyond these numerical scores, it is crucial to understand how the model arrives at its predictions. To this end, we perform a qualitative analysis by visualizing the model’s internal feature maps. This interpretability technique allows us to peer inside the model’s “black box” and gain direct, intuitive insight into its decision-making process. Specifically, this analysis enables us to verify that LenNet learns to activate on physically meaningful structures—such as Einstein rings, arcs, or multiple images—rather than relying on spurious artifacts or background noise. This confirmation is vital for building trust in the model’s scientific utility and ensuring that its high performance is rooted in a genuine understanding of the underlying astrophysics.

4.1 The performance on the simulated dataset

This section details the quantitative performance of LenNet on our simulated dataset. We first characterize the model’s standalone effectiveness using standard evaluation metrics and then contextualize these results through a rigorous comparative analysis against established, state-of-the-art object detection models.

The initial phase of our evaluation focused on quantifying the intrinsic performance of LenNet. On the simulated test set, LenNet attains an F1 score of $97.11 %$ . Evaluating performance across probability thresholds, we set the threshold to 0.3, which yields a recall of $97.27 %$ and a precision of $96.95 %$ . The high recall indicates that the model successfully identifies nearly all true gravitational lenses, while the high precision demonstrates that its predictions are highly reliable with a low false positive rate. The F1-score, as the harmonic mean of these two, confirms the model’s excellent overall balance and accuracy. To supplement these metrics, we conducted an in-depth analysis of LenNet’s performance by examining the distribution of prediction confidence scores (Figure 3), the Precision-Recall Curves and the Average Precision (Panel A in Figure 4). The confidence distribution reveals that the majority of LenNet’s predictions have scores exceeding 0.8. That is, most predicted celestial objects have a prediction confidence level above 80%. Moreover, the AP curve for LenNet is stable and is positioned closest to the top-left corner, achieving an average precision of 96.95%. This demonstrates the model’s excellent precision and recall.

Figure 3

Figure 3. Confidence distribution of LeNet’s predicted candidates among the positive samples in the simulated dataset.

Figure 4

Four precision-recall curves, labeled A to D, compare different models for lens classification. A: LenNet with 97.80% AP, B: YOLO v5 l with 96.11% AP, C: YOLO v5 m with 94.83% AP, D: YOLO v3 with 74.08% AP. All charts have recall on the x-axis and precision on the y-axis, demonstrating varying performance among models.

Figure 4. This figure shows the Precision-Recall curves and Average Precision (AP) — which is the area under the precision-recall curve, summarizing a model’s performance by averaging precision across all recall levels — of four different models (A): LenNet, (B): YOLO v5l, (C): YOLO v5m, (D): YOLO v3 on the “lens” category. It is worth noting that the results of Model D (YOLO v3) differ significantly from the other three models (A–C), so the scale of the coordinates has been adjusted.

To benchmark LenNet’s performance against existing technologies, we conducted a comparative analysis against several models from the YOLO family, which are widely regarded as state-of-the-art in general-purpose object detection. We selected three representative baselines: the classic YOLOv3 and two modern variants, YOLOv5m (medium) and YOLOv5l (large). All models, including LeNet and every network under test, were trained and evaluated under identical experimental conditions on our dataset to ensure a fair and unbiased comparison. The summarized results are presented in Table 1. The results unequivocally demonstrate that LenNet achieves superior performance across all evaluated metrics. Notably, LenNet’s F1-score of 97.11% surpasses that of the strongest baseline, YOLOv5l (94.52%), by a significant margin of 2.59 percentage points. This performance gap underscores the primary advantage of a specialized architecture. While the YOLO models are highly optimized for general-purpose object detection, LenNet’s design, particularly its MFEBlock modules, is explicitly tailored to discern the subtle and complex morphological features characteristic of gravitational lenses. This domain-specific adaptation allows it to achieve a level of precision and recall that even state-of-the-art generalist models cannot match, thereby validating our architectural design choices and establishing LenNet as a highly effective tool for this specialized scientific task.

Table 1

Table 1. Comparison of detection performance among different models on the simulated dataset, evaluated using F1-score, recall, and precision. LenNet achieves superior results across all metrics, significantly outperforming other baseline models.

4.2 Transfer learning to the real dataset

Although deep learning models demonstrate excellent performance on simulated data, their scientific utility is ultimately determined by their ability to generalize to real-world observations. A critical challenge in this transition is the “sim-to-real” gap, where subtle differences between simulated and real data can degrade performance.

For instance, in tasks like ionized nebula classification, strong gravitational lens substructure detection, and galaxy merger identification, differences between real and simulations in spectral features, image noise, and data distributions have been shown to reduce model accuracy (Belfiore et al., 2025; Alexander et al., 2023; Ćiprijanović et al., 2021), underscoring the need for techniques like transfer learning.

In this work, the real observations used for transfer learning consist of high-quality lens candidates identified in KiDS DR4 (Petrillo et al., 2019; Li et al., 2020; Grespan et al., 2024; Chowdhary et al., 2024). Since the criteria for a “high-quality” candidate vary across these publications, we adopt the classification specified in each respective source. While most of these candidates lack spectroscopic confirmation, their distinct morphological features in the images suggest a high likelihood of being genuine lenses. For the purpose of testing our networks, we treated these candidates as confirmed lenses. We collected a total of 401 lens candidates, with randomized positions in the $501 \times 501$ pixel large images as illustrated in Figure 5. These images were then randomly split into training and testing sets for the transfer learning experiments. To improve the data diversity of the transfer learning dataset, we additionally incorporated non-lens images as supplementary data, with their quantity randomly selected to be the same as the real lenses. For instance, a transfer dataset built on 200 real-world samples will include not only these 200 real samples but also 200 real non-lens images.

Figure 5

Six-panel image of galaxy observations showing various celestial formations against a dark sky with stars. Each panel includes coordinates and features enlarged insets. Colors range from bright blues and reds to softer yellows and greens.

Figure 5. Examples of gravitational lens candidates from KiDS DR4 used for transfer learning. Each $501 \times 501$ pixel image contains a galaxy-galaxy strong lensing (GGSL) event, with insets highlighting the lensed features.

To specifically quantify the impact of the “sim-to-real” gap and the necessity of transfer learning, we investigated the performance of LenNet and YOLO models on real images without transfer learning. The test shows that neither the LenNet nor the YOLO model achieves satisfactory performance on real-world images. When tested on 402 real-world images (including 201 with lenses and 201 without lenses), LenNet can only detect 93 targets with 13 false detections when the confidence threshold is set to 0.1. In contrast, under the same confidence threshold of 0.1, even the YOLO v5 l model, which performed best in tests on the simulated dataset, detected only 77 targets while having 73 false detections. These results clearly show the drawbacks of directly applying models trained on simulated data to real-world scenarios, highlighting why transfer learning is indispensable. Furthermore, the scarcity of labelled real-world astronomical data often precludes the reliable training of deep learning models from scratch. We employ transfer learning to address this challenge; this technique leverages knowledge from a model trained on a large, data-rich source domain (our simulations) and adapts it to a target domain with limited data (real observations) (Weiss et al., 2016). As established in Section 4.1, the superior feature extraction capabilities of LenNet ensure its pre-trained weights offer a robust foundation for fine-tuning on real images. This approach avoids training a model from scratch on a small dataset—a process prone to overfitting—by instead fine-tuning the existing powerful feature representations to the specific nuances of real survey data. The transfer learning experiment proceeded as follows: we utilized the LeNet model, pre-trained on a comprehensive simulated dataset, as the initial weights for fine-tuning on real-world data. Benefiting from the pre-trained weights’ ability to extract general features, this transfer-learning approach exhibits greater stability. To accelerate model convergence, we moderately increased the learning rate and fine-tuned the model on two small, independent datasets of real-world gravitational lens images. One dataset contains 400 images, consisting of 200 real lens-containing samples and 200 real lens-free samples; the other dataset contains 720 images, consisting of 360 real lens-containing samples and 360 real lens-free samples. Since lens-containing samples are the dominant samples for model training, the 400-image dataset is referred to as the “200-set” and the 720-image dataset as the “360-set” in the following context. To provide a direct performance benchmark, we also applied the identical fine-tuning procedure to YOLOv5m—the strongest baseline from our previous analysis—using the 200-image dataset. The comparative performance metrics are presented in Table 2. The fine-tuning experiments yield two key insights. Firstly, when both models are fine-tuned on the identical, limited dataset of 200 real images, LenNet significantly outperforms YOLOv5m. Specifically, LenNet achieves an F1-score of 89.47%, exceeding YOLOv5’s score of 81.77% by about eight percentage points. This outcome strongly suggests that the specialised features learned by LenNet during pre-training are more robust and directly transferable to real-world lens detection. LenNet’s architecture, tailored for astrophysical morphologies, provides a more effective starting point, allowing it to adapt more efficiently with limited target data than its general-purpose counterpart. Secondly, the performance of LenNet demonstrates positive scaling with the quantity of training data. When the fine-tuning dataset was expanded from 200-set to 360-set images, the model’s F1-score improved from 87.63% to 89.47%. This improvement confirms that while LenNet is effective with minimal data, its performance can be systematically enhanced as more labelled examples become available.

Table 2

Table 2. Performance comparison of the LenNet and YOLOv5 models. Both models were pre-trained on the simulated dataset and subsequently fine-tuned on real lensing images via transfer learning. The second column specifies the number of real lensing images used for the fine-tuning phase.

In conclusion, these findings validate the effectiveness of our two-stage training strategy (simulation pre-training followed by real-data fine-tuning). The results not only highlight LenNet’s strong potential for immediate application in current astronomical surveys but also illuminate a path forward for its continuous improvement. As new gravitational lenses are discovered and validated, they can be incorporated into the fine-tuning set, creating a virtuous cycle where the model’s detection capabilities are progressively refined and enhanced over time.

4.3 Observe LenNet from the aspect of feature maps

Visualizing the model’s internal feature maps provides qualitative insight into its decision-making process. This interpretability analysis is critical for verifying that the model learns to identify physically meaningful structures and for fostering confidence in its application as a scientific instrument. For this analysis, we selected a representative test image (Figure 6) that features an unobscured gravitational lens. Crucially, this image also contains other astronomical objects with morphologies or sizes that could potentially confound the lens finders. We trace the data’s progression through the two primary architectural stages of LenNet: the backbone is used for initial feature extraction, and the FPN, which is part of the Feature Fusion Neck, facilitates feature fusion and refinement.

Figure 6

A star field with various colored celestial objects scattered across a dark background. The celestial object to be detected is highlighted with a red square labeled

Figure 6. An example of a simulated gravitational lensing image containing an unobscured lens, used to illustrate the interpretability of the LenNet model. The red bounding box indicates the GGSL identified by the model.

The backbone serves as the foundational perception module of the network. Its role is to decompose the raw input image into a rich, hierarchical set of feature representations at multiple scales. Figure 7 displays the feature maps extracted from different layers of the backbone. As observed, the shallower layers capture low-level primitives such as edges, textures, and simple shapes. These activations are distributed across nearly all celestial objects in the frame, indicating that at this stage, the network is performing a broad, class-agnostic analysis of the visual information. As we progress to deeper layers, the features become more abstract and semantically complex, combining the initial primitives into representations of whole objects or significant parts thereof. At this stage, while activations are still present on multiple objects, the network has successfully generated a multi-scale inventory of all potentially salient structures in the image, providing the raw material for the subsequent detection stages.

Figure 7

Five images display datasets labeled by their dimensions. The top row includes

Figure 7. Feature maps from the LenNet backbone for the input survey image presented in Figure 6. In the image, the blue areas indicate regions where the network does not focus its attention. The “stem” layer denotes the feature map following initial feature extraction, with “dark2” to “dark5” representing feature maps processed by subsequent feature extraction blocks. The tensor shapes of these layers are also illustrated in the model architecture diagram.

The features extracted by the backbone are then processed by the FPN. The FPN’s critical function is to fuse information across different scales, enhancing features relevant to the target class (gravitational lenses) while suppressing background noise and features from irrelevant objects. Figure 8 visualizes the output feature maps from the FPN. The transformative effect of the FPN is immediately apparent. A stark contrast emerges between the diffuse activations in the backbone and the highly localized, high-intensity activations in the FPN maps. Across all three scales shown, the features corresponding to the gravitational lens are sharply and decisively enhanced. Simultaneously, the activations on neighboring galaxies and other distractor objects have been significantly attenuated. This visualization powerfully demonstrates that the FPN has learned to function as a highly effective spatial and semantic filter. It successfully integrates the multi-scale information from the backbone to converge on a high-confidence representation of the gravitational lens. This process creates a precise and unambiguous saliency map that provides clear guidance to the final detection head, enabling it to localize the lens with high accuracy. The clarity and focus of these final feature maps directly explain the high precision and recall scores reported in our quantitative analysis.

Figure 8

Three color-coded heatmaps labeled P3, P4, and P5, showing different channel configurations: P3 with 80x80 size and 192 channels, P4 with 40x40 size and 384 channels, and P5 with 20x20 size and 768 channels. Each map features a central bright yellow area, surrounded by gradients of green and purple.

Figure 8. The Feature Pyramid Network (FPN) feature maps of the LenNet model for the input survey image presented in Figure 6. These three images represent the final output of the Feature Fusion Neck, which corresponds to the ultimate feature maps of LenNet. Notably, the yellow regions in the center are highly consistent across the images, suggesting a strong likelihood of the presence of a gravitational lens at these locations. In contrast, the blue areas indicate regions where the model does not focus its attention, implying that it considers these areas unlikely to contain gravitational lenses.

5 Conclusion

This paper presents LenNet, a high-performance model for the automated detection of gravitational lenses. Our results show that LenNet not only excels on simulated datasets but also achieves outstanding performance on real-world data. It significantly outperforms several mainstream detection models, with a recall of 97.27%, a precision of 96.95%, and a high F1-score of 97.11% on simulated test data. Furthermore, feature visualization analysis demonstrates LenNet’s ability to extract key gravitational lens characteristics even in complex astronomical backgrounds, highlighting its robustness and interpretability.

Following its success with simulated data, LenNet maintains strong performance under a transfer learning strategy. The model adapts well to new datasets and requires only a small amount of training data. This approach not only reduces training costs but also preserves high performance, underscoring LenNet’s efficiency and adaptability. Experimental results show that even with just the 200-set, LenNet achieves a recall of 84.58%, a precision of 90.91%, and an F1-score of 87.63%. With the 360-set, the model sees a clear performance boost: recall reaches 82.93%, precision rises to 97.14%, and the F1-score improves to 89.47%. These findings validate the model’s generalization capabilities and demonstrate its practical utility in low-resource scenarios. Compared to training from scratch, transfer learning greatly reduces computational costs, enhancing the model’s scalability.

LenNet clearly exhibits strong potential. With continued training on newly identified lenses, its performance can be further improved. This dynamic adaptability makes LenNet a valuable and evolving tool for gravitational lens detection and astronomical research. Looking ahead, as the volume of sky survey data continues to grow and new samples are accumulated, LenNet can be incrementally optimized through continuous learning. This positions it as a powerful solution for gravitational lens detection in large-scale astronomical imaging. With future deployments, LenNet is expected to support a wide range of practical applications and serve as an important tool in advancing astrophysics and cosmological studies.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

Author contributions

PL: Writing – original draft, Writing – review and editing, Investigation, Software, Methodology. HL: Writing – review and editing, Writing – original draft, Investigation, Data curation. ZL: Writing – original draft, Writing – review and editing, Methodology, Investigation. XC: Writing – original draft, Writing – review and editing. RuL: Writing – original draft, Investigation, Data curation, Project administration, Methodology, Writing – review and editing, Supervision. HS: Writing – review and editing, Software. RaL: Writing – review and editing. NN: Writing – review and editing. LK: Writing – review and editing. VB: Writing – review and editing. CT: Writing – review and editing. LG: Writing – review and editing.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. This work is supported by the National Natural Science Foundation of China (Grant No. 12588202), the China Manned Space Program with grant No. CMS-CSST-2025-A03. Rui Li acknowledges the National Natural Science Foundation of China (No. 12203050). Xiaoyue Cao acknowledges the support of the National Natural Science Foundation of China (No. 12303006). Ran Li is supported by the National Natural Science Foundation of China (Nos. 11773032, 12022306).

Acknowledgments

We acknowledge the use of OpenAI’s ChatGPT for assistance with language editing and refinement of this manuscript.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The author(s) declared that they were an editorial board member of Frontiers, at the time of submission. This had no impact on the peer review process and the final decision.

Generative AI statement

The author(s) declare that Generative AI was used in the creation of this manuscript. Use OpenAI's ChatGPT for assistance with language editing and refinement of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Footnotes

¹Li et al. (2024b) also employ a transformer-based neural network to directly search for GGSLs in full-survey images. However, their method still assumes a fixed “window size”, potentially missing lenses whose Einstein radius exceeds this predefined size

²Due to this choice, the LenNet neural network trained in this study is optimised to find GGSLs with LRG deflectors. Nevertheless, LenNet’s capabilities could be extended to identify GGSLs with disc or low-mass galaxy deflectors by introducing greater diversity and realism into the training set generation

References

Alexander, S., Gleyzer, S., Parul, H., Reddy, P., Tidball, M., and Toomey, M. W. (2023). Domain adaptation for simulation-based dark matter searches with strong gravitational lensing. Astrophysical J. 954, 28. doi:10.3847/1538-4357/acdfc7