Low Dose CT Denoising by ResNet With Fused Attention Modules and Integrated Loss Functions

X-ray computed tomography (CT) is a non-invasive medical diagnostic tool that has raised public concerns due to the associated health risks of radiation dose to patients. Reducing the radiation dose leads to noise artifacts, making the low-dose CT images unreliable for diagnosis. Hence, low-dose CT (LDCT) image reconstruction techniques have offered a new research area. In this study, a deep neural network is proposed, specifically a residual network (ResNet) using dilated convolution, batch normalization, and rectified linear unit (ReLU) layers with fused spatial- and channel-attention modules to enhance the quality of LDCT images. The network is optimized using the integration of per-pixel loss, perceptual loss via VGG16-net, and dissimilarity index loss. Through an ablation experiment, these functions show that they could effectively prevent edge oversmoothing, improve image texture, and preserve the structural details. Finally, comparative experiments showed that the qualitative and quantitative results of the proposed network outperform state-of-the-art denoising models such as block-matching 3D filtering (BM3D), Markovian-based patch generative adversarial network (patch-GAN), and dilated residual network with edge detection (DRL-E-MP).


INTRODUCTION
X-ray computed tomography (CT) is one of the most used diagnostic tools in medical imaging. It provides fine details of human internal structure noninvasively, which is ideal for detecting abnormalities in patients. However, the use of this image modality requires the use of X-rays to capture the region of interest. Exposure to such ionizing radiation can cause health risks including cancer (Z. Wang et al., 2020). Although some may argue that the effects of radiation from commercial CT scans is overstated, the dramatic expansion of the CT usage has already increased the global annual cumulative ionizing radiation dose by 34% (Tahmasebzadeh et al., 2021). Hence, researchers have been exploring effective ways to reduce radiation dose for medical imaging diagnosis without decreasing the accuracy of the image quality due to the added presence of noise.
Generally, radiation reduction is usually performed by controlling the X-ray current tube or by minimizing the X-ray photon count (Kulathilake et al., 2021). This process degrades the signal-tonoise ratio (SNR) of the X-ray signals, resulting in lower-quality CT images with noise artifacts, making clinical diagnosis less reliable. Various methods of radiation reduction have been introduced, which have already achieved improved results including sinogram domain filtering, iterative reconstruction (IR), and image denoising using deep learning techniques, all of which aim to follow the "as low as reasonably achievable" (ALARA) principle (Yi and Babyn, 2018).
Projection domain filtering uses raw projection data before analytic CT image reconstruction. For noise removal, the noise present in the projection space should be well characterized (Wang et al., 2008). A recent study by Ma et al. (2021) proposed an attention deep residual dense convolutional neural network (CNN) with the intent of extracting noise features from the LDCT projection data in order to extract the clean sinogram for reconstruction. Although the fusion of the local and global feature information during this pre-processing of the sinogram data obtained pleasing results, acquiring raw sinogram data remains quite challenging from commercial CT scanners (Ma et al., 2021). Model-based iterative reconstruction (MBIR) techniques perform image reconstruction based on object projections. A continuous sequence of comparing an image assumption with the real time measured values for this method made it almost impossible for the early scanners to perform the method (Pickhardt et al., 2012). However, with the rapid advancement of computer technology, this technique can now be handled and can achieve higher image quality in terms of the image texture and spatial resolution (Hashimoto and Takamaeda-Yamazaki, 2021). Learned experts' assessment-based reconstruction network (LEARN) has been introduced, which utilizes the regularization and parameters used during the IR training process to effectively recover the images while trying to reduce the computational costs (Chen et al., 2018). A continued drawback is that the results are still susceptible to noise artifacts and the computational cost is high and similar to the sinogram domain filtering; there is also a limitation regarding the collection of projection data. Manifold and graph integrative convolutional (MAGIC) network simultaneously extracts pixel-level and topological features by using spatial and graph convolutions in an attempt to address the data limitation issue but still faces some potential issues regarding the optimization of the network design (Xia et al., 2021b).
For this study, CT image post-reconstruction is implemented using deep learning methods, which offers a more robust solution to overcome the issues regarding the mentioned iterative methods. Deep learning methods have been evolving throughout recent years and have been effectively providing reliable outcomes when applied in different fields especially in computer vision. These methods take advantage of the graphics processing unit (GPU) parallel computing in accelerating the training process when a network model contains deeper layers, which tends to have the vanishing gradient problem. Numerous state-of-the-art deep learning models have been developed in terms of reducing noise artifacts in LDCT images. Generally, the CT reconstruction process involves mapping features of normaldose CT (NDCT) images with the low-dose images (LDCT), and this can be done through the denoising algorithms. Blockmatching 3D (BM3D) is a transformation domain technique in which the same patches are stacked into 3D groups by block matching and transformed into wavelet domain during the reconstruction process (Dabov et al., 2007). Further, recent developments of generative adversarial networks (GANs) are also booming in the LDCT denoising research community due to the framework's ability to produce fine details of denoised images (Goodfellow et al., 2020). A GAN main framework typically consists of a generator that generates fake denoised images which will then be sent off to the discriminator. The discriminator gives a score on how fake denoised images compare with the NDCT images. This sequence repeats until the generated image becomes acceptable (L. Chen et al., 2020). Even though this framework certainly preserves the structural information of the images, problems like blurring remain noticeable. Sharpnessaware GAN (SAGAN) focused on addressing this problem of blurring effect and introduced an additional sharpness detection network for measuring the sharpness of the denoised image (Yi and Babyn, 2018). Moreover, boosting attention fusion GAN (BAFGAN) implements sub-modules that can include long-range dependencies of the LDCT images to produce higher-quality denoised images (Lyu et al., 2020). Similarly, U-Net-based discriminator in GAN framework (DU-GAN) simultaneously learns both local and global differences between the LDCT and NDCT images for a better regularization of the model (Huang et al., 2021). Although this framework can reliably provide exceptional outputs, the deep and complex architecture is also prone to instability due to the oscillating number of parameters during the training process. The local parameters for each sub-network of GAN must be trained as well as the parameters of the overall GAN during the training process, which is the main challenge with GAN architectures. As the accuracy of the discriminator increases, the performance of the generator gets worse during the training process. The unbalanced performance of the discriminator and the generator can cause vanishing gradient, making the whole system unstable (Arjovsky and Bottou, 2017).
A simpler but more stable denoising structure is the use of residual network (ResNet), in which skip connections between pre-and post-convolutional layers during the denoising process are implemented (He et al., 2016). The structure of a residual network provides decreased computational costs than GANs without deteriorating the quality of the denoised images. A residual encoder-decoder CNN (RED-CNN) demonstrates the effectiveness of using symmetric convolution and deconvolutional network using skip connections in denoising LDCT images at high computational speed (Chen et al., 2017a;Chen et al., 2017b). A parameter-dependent framework (PDF)based RED-CNN network has also been introduced, which is trained simultaneously via two multilayer perceptrons (MLPs) that are used for modulating the feature maps of CT reconstruction process (Xia et al., 2021a). A ResNet merged with U-Net is able to learn both local and global image features, avoiding the vanishing gradient system, which is similar to the objective of DU-GAN but has a very comprehensive architecture while achieving the same results (Liu et al., 2021). The feasibility of a residual neural network was also explored by applying the concept of transfer learning for LDCT image denoising especially when an unknown noise level is present (Zhong et al., 2020). In addition, dilated residual learning with an edge detection layer (DRL-E-MP), composed of a Sobel kernel, integrated the advantages of having dilated convolutions instead of the standard convolution and symmetric shortcut connections for conserving the data features as well as capturing the structural details at the image boundaries better (Gholizadeh-Ansari et al., 2019). Further, a similar network uses a dilated residual learning with perceptual loss and structural dissimilarity (DRLPS), in which the focus is to take into consideration the structural detail in low contrast regions (Ataei et al., 2020a). Inspired by DRL-E-MP, DRLPS, and BAFGAN denoising models (Gholizadeh-Ansari et al., 2019;Ataei et al., 2020a;Lyu et al., 2020), fused attention modules in dilated residual learning network (FAM-DRL) is introduced. This proposed network applies the concept of the attention modules from BAFGAN. Since BAFGAN has a complex architecture and faced instability issues, the proposed denoiser utilizes dilated convolutional layers and skip connections for faster network training, better stability, and more effective fusion of the feature attention modules. In this experiment, FAM-DRL would be optimized using the combination of perceptual loss via VGG-16 Net for the prevention of edge oversmoothing, structural dissimilarity loss (DSSIM) for texture enhancement, and per-pixel loss for the symmetry between NDCT and LDCT images (Kulathilake et al., 2021). The main contribution of this paper is the unique architecture of the proposed denoising network which achieves the following: 1) protection of edges from blurring, 2) enhancement of image textures, and 3) preservation of structural details of the CT images.
The remainder of this paper is organized as follows: Network Architecture provides full detail of the components used for the proposed network; Experiments discusses the data, training details and environment, and the evaluation method for the experiment. The Results section presents the quantitative and visual results, followed by the Discussion section where analytic observations are documented. Finally, the Conclusion summarizes the overall findings of this study.

NETWORK ARCHITECTURE
In this section the proposed network containing the fused attention modules for the fusion of spatial-and channel-wise features of the images is presented.

Proposed Dilated Residual Network
Shown in Figure 1, the proposed denoiser network is constructed using 3 × 3 dilated convolution layers with a dilation rate of 2, batch normalization (BN), and ReLU layers in order to extract the shallow features. Further, the number of filters used for each convolutional layers follows the standard setting of 64 (Zhong et al., 2020). For this process, 512 × 512 LDCT images, x, are used as an input. More details about the datasets are discussed in Experiments section of this paper. Next is the generation of the multi-dimensional deep features in the cascaded boosting module groups (BMG). For this experiment, three BMG blocks are implemented. In each BMG, a stack of n ϵ{1, . . . , N} boosting attention fusion blocks (BAFB) contain the fusion of the spatial and channel attention modules as shown in Figure 2, which will be further discussed in Fused Attention Modules. Lastly, deconvolution + BN + ReLU make up the reconstruction layers as represented by the three post-convolutional layers after the BMG modules in Figure 1. To prevent the vanishing gradient problem, symmetric skip connection (SSC) between the pre-and post-convolutional blocks are applied. To test the accuracy of the overall network, peak signal-to-noise ratio (PSNR) and structural similarity index metrics (SSIM) are used for comparing the structural information of the NDCT-LDCT image pairs.

Fused Attention Modules
The long-range dependencies of the CT image can be obtained by passing the input through several convolutional layers. A simple Conv + BN + ReLU operation cannot simply achieve the high-and low-frequency information of the feature map present during the pre-convolutional process. Hence, as demonstrated inside the BAFBs in Figure 1, the integration of a spatial and channel-attention modules are implemented, which is shown in Figure 2A. The fusion of these boosting attention modules captures the long-range dependencies of the image during the feature extraction process. The structure of the attention modules is based on the boosting modules used in BAFGAN (Lyu et al., 2020). Without additional supervision, the fused attention mechanism allows the network to focus on the most relevant features. Hence, avoiding the use of similar feature maps instead highlights the primary features that are useful for LDCT denoising tasks (Sinha and Dolz, 2021).

Spatial Attention Module
On the one hand, the spatial attention module (SAM), f SAM (·), in Figure 2B uses the feature maps obtained from the third convolutional block, f CR1 (x), of the network as an input. The process can be represented as follows: Further, it uses SoftMax activation function also known as the normalized exponential function for a smoother normalization in different dimensions, making each component to be in the interval [0, 1]. This helps in incorporating the prior assumptions based on the topological spatial-wise in the structure of the image. For this spatial network, the assumption is that the feature vectors would be dependent on each other in a spatially smooth consistent way (Miladinovi`c et al., 2021). The main purpose is to improve the performance of FAM-DRL with the additional spatial dependency layers, shown in Figure 2B.

Channel Attention Module
On the other hand, the channel attention module (CAM), f CAM (·), also uses the same input as SAM, but this module also captures the channel-wise features instead of capturing the long-range dependencies only. The channel attention module pipeline is demonstrated in Figure 2C, which can also be represented as follows: This module uses average pooling, which permits a small amount of invariance in the image and could extract more features than normal max pooling. This enhances the features from all the channels, increasing feature discriminability for preserving structural details of the image. A sigmoid activation function or the logistic function is used to also capture nonlinearities, which allows the network to learn more complex structures in the data.

Overall Attention Module
In order for the spatial and channel-wise characteristics to complement each other, a fusion, f fused (·), between the two is applied as well as implementing inner skip connections. Mathematically, where ⊕ denotes element-wise addition and © represents channel concatenation in this case. At the end of this module, the new generated features are fed into a convolutional layer, f c , producing the spatial-channel attention features.

Loss Functions
For this research, the combination of three loss functions 1) mean-squared error (MSE), 2) perceptual loss, and 3) structural similarity index is proposed for the optimization of the overall network.

Per Pixel Loss
Mean squared error (MSE), considered as a per-pixel loss function, is one of the most common accuracy measurements that calculates the difference between the LDCT, x i and NDCT, y i images. Then, all the absolute errors between pixels are added: The application of MSE can cause oversmoothing problem along the edges of CT images during the training process as observed in a CycleGAN and FFDNet denoising models Gu and Ye, 2021).

Perceptual Loss
In order to address blurring issue, the proposed model also utilizes the perceptual loss calculated from using the VGG16pretrained network (Simonyan and Zisserman, 2015). Unlike MSE, perceptual loss takes high level features into consideration in order to more accurately correspond to the human visual system. This is due to its ability of learning the features more accurately as proven in DRL-E-MP and cascaded CNN (Gholizadeh-Ansari, Alirezaie, and Babyn 2019; Ataei et al., 2020b). This perceptual loss utilizes the feature maps, ϕ i , that are extracted from the last convolutional layer in blocks i 1, 2, 3, 4 of the VGG16Net with size h i × w i × d i , which can be expressed as follows:

Structural Similarity Index Metrics
Finally, structural similarity index metrics (SSIM) have the ability to compare the structural information of the image such as the texture, contrast, luminance, and the compression (Kulathilake et al., 2021). The SSIM between the LDCT and NDCT can be calculated as follows: where μ, σ, and σ xy stand for the mean, sample standard deviation, and sample covariance, respectively. However, this cannot be applied directly to the network as a loss function since the objective of this expression is to maximize the output value close to 1 and would provide higher values as the loss. Therefore, structural dissimilarity (DSSIM) expressed in Eq. 7 is implemented which is the SSIM equivalent as a kernel loss function.
Overall Objective Function The overall objective function for the proposed network can be represented as follows: where γ 1 , γ 2 , γ 3 are the sum-to-one weights for the three loss components and (Ŷ, Y) is the LDCT and ground-truth image pair. Each of the weights is determined during the training process, where the maximum value of the losses after each epoch is used for updating the values of the weights. The loss function that obtained the greatest loss would receive a higher scale than the other functions.

Training Environment
The training operation of the model especially with the parameters in the BMGs are the same with implementation done with BAFGAN (Lyu et al., 2020). The proposed network for this research was trained for 200 epochs with a batch size of 4 and using the ADAM optimizer with a learning rate of 0.0002, β 1 0.01, and β 2 0.999. The implementation of this model was done with Tensorflow-Keras API on Windows operating system with Intel ® Core ™ i7 CPU @2.80 GHz processor and NVIDIA GeForce GTX 1080 graphics card.

Quantitative Results
This section provides the quantitative results of the variation of the models as well as the different algorithms: 1) modified BM3D, 2) patch-GAN, 3) DRL-E-MP, 4) FAM-DRL with MSE, 5) FAM-DRL with perceptual loss (PL), 6) FAM-DRL with SSIM, and 7) the proposed FAM-DRL with MSE + PL + MSE. Table 2 summarizes the PSNR and SSIM obtained, while Figures 3 and 4 show separate charts for the trend of models in terms of PSNR and SSIM, respectively. Each model of the models is run using the five datasets in order to obtain the average PSNR and SSIM. For the Piglet dataset, the average PSNR of the models ranges from 39 to 42, while the average ranges from 0.7 to 0.9 as shown in Table 2. In terms of PSNR, it shows in Figure 3 that the proposed FAM-DRL has gained a slightly higher improved PSNR (42.93) compared to the other models for Piglet dataset. This  The bold values highlights the highest PSRN/SSIM value for each column. Looking at the SSIM trend in Figure 4, the difference between the average SSIM of the models is slightly smaller using the different datasets. Despite these small gaps between the SSIM of the models, the proposed model with the integration of the objective functions still ranks first when it comes to the highest SSIM. Moreover, FAM-DRL with only SSIM loss function ranks second as expected since the use of SSIM as loss function aims to minimize the distinction of the structural information between the NDCT and LDCT image pairs. As for the other models, it shows that there is no clear pattern of which model comes next after the proposed model and the model with only SSIM kernel function. This discrepancy is due to the variation of the structural information of the different datasets.

Visual Results
In Figure 5 sample results are displayed utilizing the first slice of each dataset. The marked regions in Figure 5 correspond to structural details of the image where the differences between the algorithms are pronounced.
The marked regions as shown in each dataset slice image in Figure 5 are highlighted in Figures 6i-10i along with the visual results of the algorithms, Figures 6-10ii-viii. Investigating the visual results of BM3D, there is an obvious oversmoothing problem that can be observed in Figures 6, 8, 10ii as well as apparent checkbox artifacts in Figures 7, 8, 9ii. Markovnian patch-GAN and FAM-DRL (MSE) show slightly better visual results than BM3D but still display similar problems. This is due to the fact that these algorithms only use MSE as loss function, which is well known for causing oversmoothing along the edges. In comparison to visual results of FAM-DRL with only perceptual loss in Figures 6iii-10iii, FAM-DRL (SSIM) shows evident artifacts in Figures 6-10iv. Despite the apparent artifacts, FAM-DRL (SSIM) is able to preserve the textural details of the images. The combination of these three objective functions embedded in the proposed network presents well-structured denoised images closer to the NDCT or ground-truth images,  as demonstrated in Figures 6vii-10vii and Figures 6viii-10viii, respectively.

DISCUSSION
The overall results show that the proposed FAM-DRL with the integration of the three loss functions outperforms the benchmark models as well as variations of the model itself.
While FAM-DRL with only perceptual loss obtained higher PSNR compared with the other two variations of the proposed model (FAM-DRL with MSE, FAM-DRL with SSIM), FAM-DRL with only SSIM gained higher values in terms of SSIM as demonstrated in Figure 4. For the visual results, oversmoothing along edges is noticeable when only MSE was applied to the network; enhancement of the perceptual quality is visible but introduced some abnormalities when perceptual loss is used, and image texture is more enhanced when SSIM is applied.   Despite the drawbacks displayed by each loss function, the output of the combination of the three in the network complement their limitations individually. Hence, the overall proposed model shows promising results when compared to the state-of-the-art models.
The modified BM3D and patch-GAN acquired the lowest PSNR and SSIM values, summarized in Table 2, which are slightly lower than the FAM-DRL with MSE loss function only. These models implemented the use of MSE loss function. Although the outputs from MSE are acceptable quantitatively, this does not guarantee having appealing visual results since this loss function typically causes blurring effects. The regions shown in Figures 6-10 correspond to structural details of the image where the differences between the algorithms are most pronounced as marked in Figure 5. The visual results of the models that utilized MSE as loss functions are shown in Figures 6-10ii-iii. Based on these results, blurring effects and noise artifacts stand out when compared to the variations of the proposed models.
According to Table 2, the PSNR/SSIM for DRL-E-MP is really close to the results obtained for FAM-DRL with perceptual loss only and with the proposed FAM-DRL with the combination of the objective functions. This is due to the fact that both models used the same perceptual loss functions derived from the same blocks in VGG16-Net. This can also be observed in the PSNR and SSIM trend in Figures 3 and 4, respectively, not only with the quantitative results but also with the visual results as demonstrated in Figures 6-10, in which the images show the specific regions selected for each dataset. The models that use perceptual loss display more natural and perceptually appealing results. Even though the use of perceptual loss seems effective enough, it can introduce some anomaly due to regularization and hyper-parameter tuning since the perceptual loss applied for this experiment used the pre-trained VGG16-Net. For example, there is an apparent generation of fracture in Figure 7v that could indicate a remodelled bone. Blurring effects can also be seen in Figures 8 and 9, which contain fine details of the images.
That being said, when perceptual loss was combined with the other loss functions, the proposed model obtained the highest PSNR and SSIM compared with the other models for all the datasets used. The overall visual result of the proposed model contains the textural details close to the ground-truth image while also maintaining high SNR and avoiding the oversmoothing problem from applying MSE. Therefore, this study can be deemed as successful for it meets the expected results experimentally.
Although the improvements shown in this paper are due to the fused attention modules, which the benchmark models do not have, another comparison could be done for further research which focuses on the effectiveness of using the attention modules  by testing the accuracy of the architecture with and without the attention modules. Moreover, comparison metrics such as contrast-to-noise ratio (CNR) and noise power spectrum (NPS) for measuring the intensity difference at low-contrast regions and texture quality of the LDCT and NDCT paired images could be used for comparative studies (Brombal et al., 2019). However, CNR does not capture visibility dependence of the image structure on the detail size, which PSNR can measure. Moreover, using NPS for measuring the accuracy would also require the background of the images to be removed since it is highly dependent on the various characteristic image parameters (Dolly et al., 2016). This includes the size and number of the region of interest, which should be constant within in the images; otherwise it would cause statistical fluctuations. Therefore, a Phantom-based study is desired for future work to examine the image quality of denoised LDCT images in terms of CNR and NPS measures.
In addition, almost all existing deep learning-based approaches, like the proposed network, usually require LDCT and NDCT paired training datasets. However, there is no guarantee to have paired LDCT and NDCT images readily available. The acquisition of the paired datasets for post-reconstruction of CT images can be from multiple scans, like the datasets provided by the Mayo Clinic, or from data simulations for producing matches from unpaired data, like the datasets provided by Yi and Babyn. For the network model to be trained without LDCT-NDCT image pairs, unsupervised learning method is recommended.

CONCLUSION
In this experiment, it was shown that creating feature maps by implementing the fusion of spatial-and channel-attention modules can enhance the SNR of the images. The use of dilated convolutions and skip connections, main components of the proposed model, also provided efficiency for the possible increased computational costs of the model that can be commonly seen in GAN-based denoising models.
In addition, we also demonstrated the individual contribution and limitations of each objective functions used in the network such as perceptual loss for enhancement of the perceptual visual results and SSIM kernel loss function for image enhancement.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material; further inquiries can be directed to the corresponding author.