ERGPNet: lesion segmentation network for COVID-19 chest X-ray images based on embedded residual convolution and global perception

The Segmentation of infected areas from COVID-19 chest X-ray (CXR) images is of great significance for the diagnosis and treatment of patients. However, accurately and effectively segmenting infected areas of CXR images is still challenging due to the inherent ambiguity of CXR images and the cross-scale variations in infected regions. To address these issues, this article proposes a ERGPNet based on embedded residuals and global perception, to segment lesion regions in COVID-19 CXR images. First, aiming at the inherent fuzziness of CXR images, an embedded residual convolution structure is proposed to enhance the ability of internal feature extraction. Second, a global information perception module is constructed to guide the network in generating long-distance information flow, alleviating the interferences of cross-scale variations on the algorithm’s discrimination ability. Finally, the network’s sensitivity to target regions is improved, and the interference of noise information is suppressed through the utilization of parallel spatial and serial channel attention modules. The interactions between each module fully establish the mapping relationship between feature representation and information decision-making and improve the accuracy of lesion segmentation. Extensive experiments on three datasets of COVID-19 CXR images, and the results demonstrate that the proposed method outperforms other state-of-the-art segmentation methods of CXR images.


Introduction
COVID-19 is an acute respiratory infectious disease.The patients usually have uncertain symptoms such as ground-glass opacity, bilateral lower lobe consolidation (Zhang et al., 2023), diffuse airspace disease (Bougourzi et al., 2023), and pleural effusion (Jacobi et al., 2020) in the lungs.Accurately determining the lung disease areas of COVID-19 patients can help clinicians formulate appropriate treatment to prevent further deterioration of the patient.As an important means in the field of computer-aided diagnosis, image segmentation can assign semantic category information to each pixel.Therefore, it is widely used in practical tasks such as disease judgment (Wang et al., 2021), precise treatment (Lyu et al., 2022), and lesion monitoring (Chowdhury et al., 2020).
During the epidemic, many COVID-19 image segmentation methods based on deep learning were explored.Such as (Huang et al., 2020;Zhou et al., 2020;Paluru et al., 2021) based on convolutional neural networks, (Bhattacharyya et al., 2022), based on conditional generative adversarial networks, (Tiwari and Jain, 2022), based on lightweight capsule networks, (Jia et al., 2023), based on graph reasoning, and (Joshi et al., 2022;Tiwari et al., 2022) combined with transfer learning.These methods have made effective contributions to the diagnosis and treatment of COVID-19 patients.However, due to the limitations of the receptive field of conventional convolution operations, long-distance dependencies of feature information cannot be established.Therefore, it is difficult to make adequate judgments on diseased pixels when facing the following challenges: The first challenge is that COVID-19 CXR images are characterized by sparse features and blurred backgrounds, making it difficult to form rich semantic representations.As depicted in the top row of Figure 1, the red arrows indicate the infected areas.However, the image does not exhibit clear infection characteristics, which poses a challenge for the network to accurately distinguish and classify infected pixels.To alleviate this issue, some researchers employ multi-task learning to improve the network's capability of capturing features of infected pixels.For instance, (Zhao et al., 2022), proposed a cascaded segmentation classification network to suppress the interference of background regions during feature extraction by utilizing prior knowledge from the lung segmentation network.They improved the network's capability to extract features by combining key point extraction with a deep neural network.(Munusamy et al., 2021).developed a novel Fractal CovNet architecture using Fractal blocks and U-Net for the segmentation of chest CT-scan images to localize the lesion region.(Fan et al., 2022).proposed a segmentation network for COVID-19 infected regions.This network incorporates an edgeguided module and a reverse attention module to fully extract the blurred boundary details of the infected area.(Chen et al., 2023).designed an unsupervised method for COVID-19 segmentation, that utilizes a teacher-student network to learn rotation-invariant features for segmentation.However, multi-task learning imposes an additional computational burden on the network, and traditional cascaded convolutions have limited receptive fields and cannot capture deep feature information within the codec layer.Therefore, these methods struggle to adequately identify the details of infected pixels in COVID-19 CXR images.
The second challenge is that the outline and scale of the infected area in COVID-19 CXR images vary greatly, which increases the difficulty for the network to identify crossregional weakly correlated features.As shown in the second row of Figure 1, the white area represents the region impacted by COVID-19.However, this change in scale and range blurs the  details, making it difficult for the network to establish associations between local and global features, leading to misclassification.Therefore, enhancing the global long-distance dependencies and multi-scale feature mapping relations of the network is essential for alleviating the aforementioned problems.For instance, (Mahmud et al., 2021), designed a horizontal expansion module for the multi-level encoder-decoder structure and combined it with pyramidal multi-scale feature fusion to minimize the semantic gap between features of varying scales.(Wang et al., 2020).proposed an anti-noise Dice loss to effectively handle lung lesions of varying sizes and appearances.(Mahmud et al., 2020).proposed a three-layer attention-based segmentation network, combining a three-layer attention mechanism with parallel multi-scale feature optimization to achieve precise segmentation of COVID lesions.(Yu et al., 2022).improved the network's ability to perceive features in infection regions at different scales by combining a dual-branch encoder structure with spatial attention.(Li et al., 2022).within the codec layer.The features inside the encoding and decoding layers are fused through the residual connection so that the network can extract multi-scale features inside the encoding and decoding layer.The GPM performs multi-dimensional perceptual integration of high-dimensional semantic information at the bottleneck and guides the decoder to perceive global semantic information in low dimensions.The attention module respectively performs spatial and category weight corrections on feature information to enhance the network's sensitivity to target information.Finally, the error of the prediction results is optimized through the deep supervision  (1) ERM is designed to extract deeper and wider feature information inside the encoding and decoding layers to reduce the impact of the inherent ambiguity of COVID-19 CXR images on network segmentation.
(2) GPM is proposed to promote the high-dimensional feature information of the codec structure to form global perception capabilities, and then guide the low-dimensional features to establish dependencies between long-distance feature information, thereby reducing the interference caused by cross-scale lesions on feature recognition.(3) Spatial and channel attention are designed to correct the weights of feature information at different stages to improve the network's sensitivity to target information.2 Materials and methodology

Data description
In order to validate the effectiveness of the method proposed in this paper, we conducted extensive experiments on two publicly available datasets and one dataset with enhanced images.Among them, the public datasets COVID-QU-Ex (Tahir et al., 2021) and QaTa-COV19 (Degerli et al., 2021) are from researchers at Qatar University and Tampere University.We only used data for which there was a breakdown of COVID-19, and the details of the data are described below.
The COVID-QU-Ex with 33,920 CXRs including 2913 COVID-19 samples with their corresponding ground-truth segmentation masks.The pixel size is 256 × 256, and the depth is 8-bit.These images are divided into a training set of 1864, a validation set of 466, and a test set of 583.
The QaTa-COV19 with 121,378 CXRs including 9258 COVID-19 samples with their corresponding ground-truth segmentation masks.The pixel size is 224 × 224, and the depth is 8-bit.Among them, 5716 images were used as the training set, 1429 images were used as the validation set, and 2113 images were used as the test set.
In the COVID-19 CXR enhanced dataset, we use contrastlimited adaptive histogram equalization and gamma correction techniques to enhance the original image, and then fuse the two enhanced images with the original image to obtain the final dataset.The COVID-19 CXR enhanced dataset contains 2400 images with a pixel size of 256 × 256 and a depth of 24-bit.There are 1600 images as the training set, 400 images as the test set, and 400 images as the validation set.

Overview of the network
The overall architecture of ERGPNet is shown in Figure 2, which includes ERM, GPM, attention module, and deep supervision.ERM consists of deep embedded residuals (DER) and shallow embedded residuals (SER), which extract low-dimensional features and high-dimensional features, respectively, and mutually enhance the information obtained from each other.GPM radiates the global perception ability of high-dimensional features to lowdimensional space, guides the fusion of global contextual information and captures feature relationships between crossscale pixels.The attention module consists of parallel spatial attention and serial channel attention, which enhance the network's sensitivity to target regions and target channels, respectively, while reducing the influence of noise information on network discrimination.The deep supervision enables the network to calculate the loss in more detail, thereby achieving optimal prediction results.

Embedded residual module
The network needs to utilize more comprehensive information in order to establish an accurate mapping relationship between features.Most convolutional neural networks, however, utilize two linear convolutions of size 3 × 3 in the encoder-decoder layer to extract features.This method limits the receptive field of the network at the encoder-decoder layer and disregards deeper details, leading to the inability to accurately identify infected pixels.Inspired by U2-Net (Qin et al., 2020) and the residual structure, we propose ERM to extract deeper and wider feature information inside the encoderdecoder layer.Specifically, ERM has two structures, including DER and SER.
The structure of DER is shown in Figure 3A.First, the input feature f in is sequentially passed through two convolution blocks to extract shallow features f i (i 1, 2) .Then, the shallow semantic features obtained are inputted into four convolution blocks successively to extract deep features of different scales f i (i 3, 4, 5, 6) .Among them, shallow features highlight local fine-grained information, while deep features have abstract information with better generalization.In addition, we utilize the residual connection to merge the shallow features f i (i 1, 2) with the features t i (i 1, 2) { }during the feature recovery process.Then, the merged features are inputted into the corresponding feature recovery convolution block, which emphasizes the representation of detailed information.Fusing cross-scale deep features f i (i 3, 4, 5, 6) to obtain f c : Where Cat • { } represents the channel concatenation operation, D n means downsampling by a factor of n.Then f c is upsampled by different multiples and input into the corresponding feature recovery convolution block to obtain t i (i 2, 3, 4) { }respectively.The information from each feature recovery convolution block is fused with multi-scale features f c , resulting in the extraction of richer information.The calculation process is given as follows: The structure of SER is shown in Figure 3B.Because the feature information of the underlying encoder-decoder block has low resolution and high abstraction characteristics, an excessively deep convolutional structure can lead to overfitting of features.Therefore, SER is designed with only three layers of convolutional extraction blocks.At the same time, we use dilated convolutions with different parameters instead of downsampling to prevent the loss of abstract information.And the cross-level feature information is integrated through the residual connection.Furthermore, the information flow across layers is integrated via residual connections to increase the feature-aware  Frontiers in Physiology frontiersin.orgrange of convolutional blocks.Thus, SER can effectively capture valuable information in high-dimensional features.

Global perception module
The cross-scale variation of infected regions in COVID-19 CXR images poses a great challenge for network segmentation.Usually, low-dimensional semantic information is helpful in identifying small-scale detail features, while targets with large scale changes often require high-dimensional information with global perception capabilities as a guide.While ASPP (Chen et al., 2017) can capture cross-regional features through multi-scale convolution kernels, FPN (Lin et al., 2017) can obtain long-range feature dependencies by fusing prediction information at different scales.However, these methods may cause the repulsion of features of different scales, resulting in the loss of some feature information.Therefore, after fully considering the characteristics of the encoderdecoder structure network, we have designed a simple and effective GPM at the bottleneck.This module guides the generation of lowdimensional features by leveraging the global awareness of highdimensional features.\ The structure of the GPM is shown in Figure 4.The input feature information f in is extracted respectively by three different feature extraction methods.First, global average pooling is used to compress f in in the spatial dimension, reducing the amount of feature calculation while establishing the association between channel feature information and spatial feature information.The calculation process is given as follows: Where C ∈(1, 2, 3, /, 512) represents the channel of the feature, F c represents the feature of channel C after spatial pooling, f c xy represents the eigenvalue with coordinates are (x, y) on channel C, Cat • { } represents the channel concatenation operation.Each point on the one-dimensional feature f g after pooling contains feature information of a spatial plane.Secondly, the channel dimension of the feature f in is compressed to one dimension through 1 × 1 convolution, and the relationship between channels is established, so that the features of different channels can be learned interactively.Then, the one-dimensional channel feature map is spatially downsampled by a factor of 2, and the downsampling feature is matrix multiplied by the feature f g to obtain f m : Frontiers in Physiology frontiersin.org Where ⊗ represents matrix multiplication, and Conv 1×1 means 1 × 1 convolution operation.Each pixel in f m has channel and spatial weight information.Then, use 3 × 3 convolution to extract the feature of f in , and fuse it with the upsampling by a factor of 2 feature f m to obtain the final output feature f out : Where ⊕ denotes elementwise summation.Each pixel in the output feature f out perceives the information of other pixels.Finally, the feature f out is upsampled and fused to the decoder side to provide guidance for low-dimensional perceptual global information.This helps improve the robustness of the network when extracting features across scales.

Attention mechanism module
The attention mechanism can assign different weights to the feature information in order to enhance the network's ability to respond to the target area and category.However, since CXR images have more blurred features than natural images, the conventional single attention mechanism cannot maintain high sensitivity to feature information.To enhance the network's ability to perceive feature information of COVID-19 CXR images, we redesign the attention module.Specifically, in the feature information transfer process of the encoder-decoder structure network, the encoding end is more inclined to extract regional feature information, while the decoding end is more inclined to extract category feature information.Therefore, we designed parallel spatial attention and serial channel attention to improve the sensitivity of the network to regional information and category information, respectively.
Figure 5A shows the parallel spatial attention.Given an encoder output feature f e ∈ R Ce×He×We , where e ∈ 1, 2, 3 { }denote the features output by different encoding layers, C, H, and W denote the depth, height, and width of the feature, respectively.Then, two pooling kernels are used to reshape the feature f e in the height and width dimensions to obtain the two-dimensional feature matrix of the feature map in the height and width dimensions.The calculation process is given as follows: Where m and n represents the number of two-dimensional feature matrices in the width and height dimensions, respectively.Then, we transpose the feature matrix f e w ∈ R Ce×He to obtain f e w ∈ R He×Ce , and subsequently perform matrix multiplication with f e h ∈ R Ce×We to obtain f e c ∈ R He×we : exp f e c 1,mn c 1 exp f e c 1,mn (13) Where f e c,mn represents the feature pixel of point (m, n) in the feature matrix of f e c .Then, apply SoftMax processing to obtain f e ∈ R 1×He×We .Next, use a 1 × 1 convolution operation and Sigmoid activation to obtain the output feature F e ∈ R Ce×He×We .Finally, it is fused with the input feature f e , and the feature map M corrected by spatial attention is output: Compared to the previous method of directly connecting feature information between encoders and decoders, the use of spatial attention correction can enhance the representation of spatial feature information and improve the network's sensitivity to regional features.
Figure 5B shows the serial channel attention.For the features f d ∈ R Cd×Hd×Wd output by the decoder layer, where d ∈ 2, 3, 4 { }, represent the features output by different solution layers.First, reshape it as Then, matrix multiplication is performed on the transposed matrix of f d and f d .After SoftMax processing, the channel feature matrix Where (i, j) represents the number of different channels of f d , and f d ij represents the influence of channel i on channel j.Perform adaptive pooling and Sigmoid operation on the feature f d ∈ R Cd×Nd and then multiply it with Finally, perform matrix multiplication with the channel feature matrix f d ∈ R C d ×C d to obtain the final output matrix K: Where δ is a learnable parameter initialized from 0. This method combines two techniques: non-local autocorrelation matrix operation and self-setting pooling.The goal is to enhance the interdependence between channel features and improve the network's sensitivity to the channel response of the target category.

Deeply supervised loss function
Deep supervision can improve the reliability of the network's prediction outcomes.Therefore, this paper uses deep supervision to optimize the training process of ERGPNet.Specifically, we fuse feature prediction losses at different depths at the decoder side to guide the network to make feature information decisions.The calculation process is given as follows: Where L d p , (d 1, 2, 3, 4) represents the loss of each layer in the encoder prediction map.W d p denotes the weight of each layer in the encoder prediction loss.L p signifies the loss after merging the multilevel prediction map, and W represents the weight used to merge the multi-level prediction loss.For each level of loss L, we use binary cross entropy to calculate it.The calculation process is given as follows: Where (x, y) are the coordinates of the pixel, (H, W) is the height and width of the image, G t (x,y) represents the true label of the feature pixel, and P t (x,y) represents the predicted label of the feature pixel.By stacking the prediction loss of multiple levels of feature maps, the error of network segmentation results is reduced.

Evaluation metrics
We quantitatively evaluate the model's performance at the pixel level using a confusion matrix.First, the pixels in the infected area are marked as positive, and the background pixels are marked as negative.Then count the following elements: the number of pixels correctly predicted as the positive class (TP); the number of pixels correctly predicted as the negative class (TN); the number of pixels incorrectly predicted as the positive class (FP); and the number of pixels incorrectly predicted as negative class Number of pixels (FN).Finally, we evaluated the model's performance using the following metrics: Accuracy, Precision, Recall, F1score, and MIoU.The mathematical definitions of these evaluation metrics are as follows: The accuracy here is the ratio of correctly classified pixels among the overall pixels.
The precision rate here refers to the probability that among the samples predicted to be infected pixels are actually infected pixel samples.

Recall TP TP + FN (23)
The recall rate here refers to the probability of predicting an infected pixel sample among samples that are actually infected pixels.
The F1 here is the harmonic mean of precision and recall.It is often used to measure the overall performance of both when high precision and high recall are required.
The MIoU is used to evaluate the overlapping ratio between the actual segmentation mask and the predicted segmentation mask.

Implementation details
We conduct experiments on a workstation equipped with an Intel Xeon Gold 8350 CPU @ 2.60 GHz and a 12 GB NVIDIA GeForce RTX 3080Ti.The experimental language used was Python 3.8, and all models were executed in PyTorch 1.10.CUDA 11.3.In the training process, in order to balance memory usage and convergence efficiency, we use Adam optimizer and set β 1 0.9 and β 2 0.999.The initial learning rate is set to 0.0001, and an adaptive learning rate decay strategy is adopted at the same time.After every 10 epochs, if the loss of the validation set does not decrease, the learning rate is reduced to 0.1 times its original value.We set the batch size to 8, applied a weight decay of 0.0005, and implemented early stopping and gradient clipping techniques to prevent overfitting.Finally, the model weights obtained from training are tested on the test set, and the corresponding evaluation metrics are obtained.
Figure 6A shows the loss curves of all networks on the verification set data for 100 epochs.For clarity, the loss curve of the proposed ERGPNet is shown in black.It can be seen that the loss of all networks reaches a balance between 60 and 80 epochs and no longer decreases.This indicates that the network has achieved convergence.Among them, ERGPNet, U-Net++, CENet, and AttentionU-Net utilize deep supervision loss during training, resulting in higher than other networks.
Figure 6B shows the accuracy curves of all networks for 100 epochs on the validation set.Similarly, the accuracy curve of the ERGPNet proposed in this paper is represented in black.It can be observed from the figure that the accuracy curves of the validation set for all networks fluctuate significantly.This fluctuation may be attributed to the complex characteristics of the task of segmenting the infection region in COVID-19 CXR images.Although the fitting process exhibits strong fluctuations, these fluctuations decrease as the Epoch increases, eventually reaching a stable state.And it can be seen that the accuracy of the proposed ERGPNet is better than other methods.
In order to understand the structural advantages of ERGPNet, we compared it with the structures of other networks.The details are as follows.
(1) U-Net: The symmetric up-and-down sampling process and skip connections in this network provide a benchmark for the codec structure.However, due to the single convolution process of U-Net and the simple skip connections between encoders and decoders, network training is prone to overfitting.Therefore, as shown in Tables 1-3, U-Net obtained lower MIoU indicators of 80.47%, 79.56%, and 80.52% in the three data sets, respectively.(2) U-Net++: This network changes the skip connection method in U-Net and adopts a dense connection method so that the decoder side obtains more information flow.And error correction is performed through in-depth supervision, which further improves the network's decision-making ability on feature information.However, dense links also cause additional calculations, and no attention is paid to the extraction of multi-scale information.Therefore, Unet++ still cannot have good performance on the COVID-19 segmentation task.
(3) MiniSeg-Net: In order to reduce the computational load of densely connected networks, MiniSeg-Net uses the Downsampler Block and Attentive Hierarchical Spatial Pyramid Module as the basic modules.First, the network feature information is dimensionally reduced, and then the information of different sizes of receptive fields is obtained through multi-scale feature fusion.This network has minimal experimental parameters and training speed but cannot obtain enough rich feature information.Therefore, there are many missed detections in the determination of infected pixels.See Figure 7, 8, 9. (4) AttentionU-Net: This network adds attention-gating units in the skip connection process, mainly to highlight the salient features of specific local areas.However, single attention cannot enhance the network's sensitivity to target category information, so there will be some errors in determining the category, resulting in mediocre performance.
(5) CE-Net: Since continuous pooling will lead to the loss of spatial information, a contextual feature extraction module is proposed in CE-Net to capture broader and deeper semantic features by cascading multi-scale atrous convolutions.And further obtain contextual information through multi-scale pooling operations.Because this network has powerful multi-scale spatial information extraction capabilities, it has good performance on the COVID-19 segmentation task.The MIoU in Table 1 and Table 3 reached sub-optimal indicators of 81.39% and 81.51%, respectively.(6) COPLE-Net: An anti-noise framework is proposed in this network, which adaptively integrates the student model and the teacher model to suppress the influence of noise.And capture multi-scale feature information through residual connections and the ASPP module.However, because the network uses a bridge connection of simple compression channels, it is easy to create a semantic gap, which affects the performance of the network.(7) Inf-Net: This network extracts edge information from lowdimensional features through the explicit edge attention module, and then aggregates high-level features through parallel partial decoders to generate regional information.Finally, the reverse attention module is used to guide the connection between edge information and regional information.This method corrects the network's attention to the target area but ignores the connection of hidden layer features outside the domain.This causes the network to over-segment long-distance areas, as shown in Figure 8.
Different from the above network structure, ERGPNet changes the feature extraction method within the encoding and decoding layer, can extract multi-scale information within the encoding and decoding layer, and reduces the problem of sparse features caused by the inherent blurriness of COVID-19 CXR images.Different from the skip connection methods of UNet and COPLE-Net, ERGPNet uses spatial attention for optimization in the connection process, which increases the weight of target area information while reducing the semantic gap between codecs.At the same time, the channel attention correction performed in the decoder part enhances the network's sensitivity to target category information, making the information determination more accurate than other networks.And unlike other networks that extract global information through multi-scale convolution kernels or multi-scale feature fusion, this paper globalizes the high-level semantic information at the bottleneck of the codec structure in different dimensions and establishes the correlation between local features and global features.Therefore, ERGPNet achieved the optimal MIoU of 81.66%, 80.79%, and 81.73% on the three data sets, respectively.
To gain a more detailed understanding of the segmentation performance of the networks, we visually compare the segmentation results of all networks on the test dataset.Figure 7 shows the segmentation results of the COVID-QU-Ex test dataset.The irrelevant background area pixels are indicated in gray, the lung area pixels are indicated in black, and the COVID-19 infected area is indicated in white.It can be observed that our method achieves better detail segmentation results on the smallarea infected images in the first to third rows.Additionally, the segmentation error rate of infected pixels is lower compared to other networks.This is because ERB and SER can enable the network to accurately extract features at different levels, achieving a balance and interaction between feature information, and obtaining more comprehensive feature representations.
On the large-area infected image segmentation results of the fourth and fifth row images, our method also achieved good performance.This also verifies the robustness of ERGPNet when dealing with infected images of various sizes.
Figure 8 and Figure 9 show the segmentation results of the network on the QaTa-COV19 and COVID-19 enhanced datasets, respectively.Pixels in the infected area are marked as white, while other background pixels are marked as black.As shown, it can be seen that the proposed method is superior, and it allows for more accurate identification of subtle regions, such as lines 3-6 in Figure 8 and lines 4-6 in Figure 9.This is because the ERM, combined with the attention mechanism, enhances the network's sensitivity to the detailed features of the target area, thereby preventing the loss of information during the segmentation of small areas.And by comparing the infection segmentation results of different scales and contours in Figure 7, Figure 8, and Figure 9, it can be observed that our method is more effective in distinguishing infected areas across various scales.This is because GPM establishes the cross-region dependency of pixel features, which improves the robustness of the network for crossscale lesion segmentation.Through visual comparison of network segmentation results, we further prove the effectiveness of each module of ERGPNet.

Ablation analysis
In order to assess the effectiveness of ERM, GPM, and attention module in ERGPNet, we conducted the ablation analysis in this section.The hyperparameters are set the same during the experiment to ensure the fairness of the results.Quantitative experimental results are shown in Table 4, where the baseline model represents the simplest U-Net network.Since our proposed ERM can better extract the feature information in the codec layer than the conventional convolution, F1 and MIoU are increased by 0.51% and 0.44%, respectively.GPM enhances the network's ability to perceive global information, so F1-Score and MIoU increase by 0.27% and 0.22%, respectively.The attention module enhances the sensitivity of the network to target region and channel features, so both F1-Score and MIoU are increased by 0.34%.In summary, each module can increase the segmentation performance of the network to a certain extent.
In order to further verify the performance of each module within the network, we selected a node in the decoder and performed a visual analysis of the node feature map after adding each module.As shown in Figure 10.The feature map in column A is the node features of the ordinary structure, column B is the node features added to ERB, column C is the node features added to GPM, and column D is the node features added to the attention module.First, by comparing columns A and B, we can find that the added ERB module obviously captures richer features.Secondly, it can be seen from column C that the added GPM module makes the network pay attention to the global contour information.Finally, the addition of the attention module obviously enables the network to better focus on the target area and reduces the representation of irrelevant feature information.Overall, each module plays a positive role in the network's ability to extract feature information.

Feature comparative analysis
In order to analyze in detail how features change during network computation, we visually compared the feature maps of ERGPNet and U-Net.Figure 11 shows the output features of the first two layers of encoders, the last two layers of decoders, the 1 × 1 convolutional layer, and the Sigmoid function of the network.We randomly visualized the four feature channels of the codec layer for comparison.Although the features are fuzzy and abstract, it can still be seen that the proposed method has advantages.Since the feature maps output by the 1 × 1 convolutional layer and the Sigmoid function only have background and foreground channels, it is evident that the contours segmented by our method are more detailed and accurate, as shown by the white circles in the Figure 15.This is due to the fact that each of our functional modules is specially designed to improve the network's feature awareness of infected areas.

Grad-GAM analysis
To explore the regions of interest during network learning, we use Grad-CAM (Selvaraju et al., 2017) to visualize feature information as heat maps.As shown in Figure 12.The U-Net++ network pays attention to the feature information of many nontarget areas, which may be caused by overfitting during feature extraction.Due to the lightweight design of MiniSeg-Net, it is difficult to generate sufficient attention to the target area.In contrast, CE-Net, Inf-Net, and our network can generate sufficient attention to the target region information.But overall, our network is significantly clearer when focusing on infected areas, which also illustrates the robustness and specificity of our network in focusing on COVID-19 features.

Conclusion
This study proposes a novel ERGPNet network that can accurately segment lesion areas of COVID CXR images with inherent blur and cross-scale lesions.First, we propose an ERB to replace the conventional convolution, which can extract richer information in the encoder-decoder layer.Secondly, GPM is designed to enhance the mapping relationship of global features and reduce the impact of cross-scale changes of infected regions on network segmentation performance.Then, considering the characteristics of the encoder-decoder network, parallel spatial attention and serial channel attention are designed to enhance the network's sensitivity to pixels in infected regions.Finally, the deep supervision method is used to ensure that the network achieves optimal convergence results.The effectiveness and superiority of the proposed algorithm have been verified through segmentation experiments conducted on three datasets.In addition, ablation experiments and visual analysis also demonstrate the effectiveness of each functional module within the network.However, segmenting infected regions with complex contours is still a challenge, as shown in Figure 9, line 6.Therefore, further improving the network's ability to identify edge information in infected areas is our future research direction.

FIGURE 2
FIGURE 2Overall architecture of the ERGPNet.
proposed a multi-level attention-based lightweight segmentation network.It helps the network handle changes in scale by incorporating Atrous Pyramid Pooling at the encoding and decoding bottlenecks.However, most of these methods enhance the network's global perception ability by using multi-scale convolution kernels or by fusing encoder features from different scales.The detailed information on low-dimensional and highdimensional features cannot be fully utilized, and the long-distance dependencies of high-order features are ignored.Therefore, it cannot effectively deal with the cross-scale variation of the infected area.To solve the above problems, this paper proposes a new global perception network (ERGPNet) based on embedded residual convolution.The network mainly consists of Embedded residual module (ERM), global perception module (GPM), attention module, and deep supervision module.The ERM replaces the 3 × 3 convolution kernel which increases the convolution depth

FIGURE 4
FIGURE 4The structure of the global perception module.
FIGURE 5 (A) Structure of Parallel Spatial Attention Modules.(B) Structure of the serial channel attention module.
FIGURE 6 (A) Loss curve on validation set.(B) Accuracy curve on the validation set.

FIGURE 11
FIGURE 11Visual comparison of feature changes between ERGPNet and U-Net.

TABLE 1
Quantitative evaluation metrics on the COVID-QU-Ex dataset, the optimal and suboptimal indicators are marked with bold values.

TABLE 3
Quantitative evaluation on the feature-augmented COIVD-19 image dataset, the optimal and suboptimal indicators are marked with bold values.

TABLE 4 Ablation
studies of ERM, GPM, and MixAttention on the COVID-QU-Ex dataset.