Intelligent identification of pavement cracks based on PSA-Net

The identification of pavement cracks is critical for ensuring road safety. Currently, manual crack detection is quite time-consuming. To address this issue, automated pavement crack-detecting technology is required. However, automatic pavement crack recognition remains challenging, owing to the intensity heterogeneity of cracks and the complexity of the backdrop, e.g., low contrast of damages and backdrop may have shadows with comparable intensity. Motivated by breakthroughs in deep learning, we present a new network architecture combining the feature pyramid with the attention mechanism (PSA-Net). In a feature pyramid, the network integrates spatial information and underlying features for crack detection. During the training process, it improves the accuracy of automatic road crack recognition by nested sample weighting to equalize the loss caused by simple and complex samples. To verify the effectiveness of the suggested technique, we used a dataset of real road cracks to test it with different crack detection methods.


Introduction
With the continuous development and improvement of China's roads as infrastructure construction, how to carry out scientific and intelligent maintenance management has become a fundamental research problem. The diagnosis of pavement diseases is an important issue in road maintenance, and cracking is the dominant type of pavement disease. The identification of pavement cracks is a critical duty for ensuring road safety. In this work on pavement crack detection, methods based on non-deep learning approaches are referred to as traditional crackdetecting methods. In the past few years, many researchers have been working on the automated detection of cracks. These works can be divided into five categories: 1) crack detection methods based on the wavelet transform: the wavelet transform decomposes the image into different frequency bands, and defect and noise are converted into distinct amplitude wavelet coefficients, which allows them to be applied to pavement crack detection work. Peggy et al. (2006) created a complicated coefficient map by applying a multi-scale 2D wavelet transform (Peggy et al., 2006); the crack region is then determined by scanning for the wavelet coefficients from the most enormous scale to the most miniature scale. However, this method cannot handle cracks with limited continuity or significant curvature characteristics. 2) Image thresholding crack detection method: Scholars use preprocessing algorithms to reduce lighting artifacts and then threshold the image to obtain candidate cracks. The processed crack images are further refined using morphological techniques (Chambon and Jean-Marc, 2011;Huang and Zhang, 2012;Xu et al., 2013;Li et al., 2015). The aforementioned methods were further developed, and new graph-based methods can achieve crack candidate refinement (Zou et al., 2012;Kelwin and Lucian, 2014;Marcos et al., 2016). 3) Handcrafted features and classification to achieve crack detection: Most conventional crack-detecting algorithms rely on handcrafted features and patch-based classifiers to achieve crack detection. Handcrafted features, such as HOG and LBP, are extracted from the image patches as descriptors. Then, a classifier, such as a support vector machine, is used to achieve crack recognition and classification (Hu and Zhao, 2010;Srivatsan et al., 2014;Rafal et al., 2015). 4) Crack detection is carried out based on edge detection: Yan et al. (2007) introduced a morphological filter in crack detection and used an improved median filter to remove noise to achieve crack detection (Yan et al., 2007). Albert and Nii (2008) applied the Sobel edge detector to detect cracks and used a two-dimensional empirical pattern decomposition algorithm to remove speckle noise (Albert and Nil, 2008). Stochastic structure forest was used by scholars and combined with structural information for crack detection (Piotr and Lawrence, 2013;Shi et al., 2016). 5) Complete crack detection based on the minimal path: The shortest path approach was suggested by  to extract basic open curves from photographs for achieving crack detection . Vivek et al. (2012) proposed using an improved minimal path method to detect the same type of similar contours in the image structure to achieve crack detection (Vivek et al., 2012). This enhanced method requires less a priori knowledge about the topology and endpoints of the required curve. Amhaz et al. (2016) proposed a two-stage approach for crack detection: first, endpoints are selected in the local range, and second, the minimum path is selected in the global range to finally achieve crack detection (Amhaz et al., 2016). Nguyen et al. (2011) proposed a two-stage approach for crack detection by introducing freeform anisotropic features that offered a method that simultaneously considered strength and the shape of cracks to complete the identification and detection of cracks (Nguyen et al., 2011). However, traditional crack detection methods are extremely challenging to identify and detect due to the strong influence of human factors and the low efficiency of detection based on transformation methods and are not applicable to complex scenes. Their performance is still limited.
For the last several years, deep learning has seen extraordinary progress in the field of computer vision . Scholars have made several attempts to use deep learning techniques for crack identification. Zhang et al. (2016) developed a patch-based fracture detection neural network comprising four convolutional layers and two fully connected layers . In addition, Zhang et al. (2016) compared their approach with handcrafted features to demonstrate the advantages of deep learning methods in feature representation. Pauly et al. (2017) used a deeper neural network to identify the road cracks (Pauly et al., 2017). Feng et al. (2017) proposed a deep active learning system to deal with the problem of limited labeling resources (Feng et al., 2017). Eisenbach et al. (2017) presented a road condition dataset for training a deep learning network and first evaluated and analyzed state-of-the-art methods for pavement distress detection. The approach mentioned previously considers crack detection as a patch-based classification challenge, dividing each picture into tiny patches and then training a deep neural network to determine whether or not each patch is a crack (Eisenbach et al., 2017). However, this approach has a complex operational process and is sensitive to the size of the patches. With the rapid development of semantic segmentation tasks (Jonathan et al., 2017;Vijay et al., 2017;Fan et al., 2018), many algorithms have been applied to different scenarios with good application results. Schmugge et al. (2017) proposed a SegNet-based crack segmentation method to detect cracks by aggregating crack probabilities. In conclusion, deep learning-based techniques show great potential for road crack detection applications (Stephen et al., 2017). Joshi et al. (2022) adopted a segmentation-based deep learning method for surface crack detection. Wang et al. (2022) presented fully convolutional network architecture for crack detection in fast-stitching images. To this end, we propose a new deep learning framework called PSA-Net for the particular task of road crack detection, which focuses on three aspects, namely, multi-scale feature information

FIGURE 1
Schematic diagram of the network framework.

Frontiers in Environmental Science
frontiersin.org extraction, spatio-temporal attention mechanism, and pyramid pooling, focusing on the contextual semantic information and edge information on crack images to achieve end-to-end pavement crack detection, which aims to improve the accuracy of crack intelligent recognition and detection.

Methods
For segmentation tasks, contextual information impacts the segmentation's effectiveness. Generally speaking, when we judge the category of an object, besides directly observing its appearance, we sometimes also assist the environment in which it appears, ignoring these to make judgments that sometimes cause problems. The intelligent crack identification and detection of pavement in this paper are similar to the segmentation task, where we can improve the accuracy and precision of crack identification with the help of auxiliary information. First, this subsection describes the composition of the algorithm, and the network used consists of encoder and decoder architecture, as shown in Figure 1. We use ResNet-101 as the backbone in the feature extraction stage (He et al., 2016). The encoder uses the pre-trained model (ResNet-101) and the dilated convolution strategy to achieve the feature map extraction, and the extracted feature map is 1/8 the size of the input. The pyramid pooling module fuses the feature map to get the fused feature with general information, which is upsampled and connected with the feature map before pooling. Finally, the final output is obtained by a convolutional layer.

Dilated convolution
Dilated convolution is a technique for solving picture semantic segmentation issues in which downsampling affects image resolution and results in information loss. By adding holes to expand the perceptual field, the original 3 × 3 convolution kernel can have a perceptual area of 5 × 5 (dilated rate = 2) or be more significant with the same number of parameters and computation, thus eliminating the need for downsampling. It has the advantage of increasing the field of perception and allowing each convolution output to contain a more extensive range of information. The information is without pooling data or creating loss ambiguity under the same computational conditions. Dilated convolution is often used in real-time image segmentation. Dilated convolution can be used when the network layer requires a bigger perceptual field, but the number or size of convolution kernels cannot be increased, owing to restricted computing resources. The feature extraction module of our network uses dilated convolution to increase the perceptual field and further improve the segmentation efficiency. The mathematical expression of dilated convolution is (Yu and Koltun, 2015) M where I and j denote the positions of the image; X and Y represent the length and width of the input image, respectively; h(i, j) denotes the feature value of the input image at (i, j); atr denotes the void rate; g is the convolution kernel function; and M(i, j) denotes the result of the input image after convolution. Dilated convolution is to expand the

Frontiers in Environmental Science
frontiersin.org size of the convolution kernel by zero-filling in the standard convolution, so that it can better capture the context information on the feature map. The size of the dilated convolution is achieved by adjusting the atrous rate.

Pyramid pooling
The pyramid pooling module aggregates contextual data from several places and enhances the capacity to access global data. Experiments demonstrate that such an a priori representation (referring to PSP as a structure) is successful and produces outstanding results on a variety of datasets. The module incorporates four pyramid-scale features. The first row in red is the coarsest feature global pooling, generating a single bin output, and the next three rows are pooling features at different scales (as shown in Figure 1). To ensure the weight of the global features, if the pyramid has a total of N levels, a 1 × 1 convolution is used after each class to reduce level channels to the original 1/N. The size before unspooling is then obtained by bilinear interpolation and CONCAT function together. The pooling kernel size for the pyramid levels is settable and related to the input sent to the pyramid. We used four ranks with kernel sizes of 1 × 1, 2 × 2, 3 × 3, and 6 × 6.

Feature fusion
Feature fusion is a popular component of current network topologies that merges features from distinct layers or branches. It is often performed using basic operations (such as summation or splicing); however, this is not always the best option. This is a unified general scheme for attentional feature fusion that applies to the most common scenarios, including short-hop and long-hop connections and feature fusion induced in the inception layer. As shown in Figure 1 and Figure 2, we present the multi-scale channel attention module, which solves the challenges of fusing information supplied at distinct scales to better fuse features with inconsistent meanings and scales. We also show that the early integration of feature maps might be a bottleneck, which can be solved by adding another attention level. With fewer parameters or network layers, our model outperforms the latest network on both road crack segmentation datasets, suggesting the more sophisticated attention mechanism used for feature fusion has excellent potential to consistently produce better results than direct feature fusion.
The input feature maps in the channel attention module are pooled and averaged into the shared MLP layer. Then, the output features of the shared MLP layer are summed by sending elementwise and activated by the sigmoid function to obtain the feature map of the channel attention module. The channel attention module (CAM) compresses the feature map in spatial dimensions to get a onedimensional vector and then operates on it. The channel attention module focuses on what is essential in this graph. Mean pooling has feedback for every pixel point on the feature map. In contrast, maximum pooling has feedback for gradients only where the response is most evident in the feature map when performing gradient backpropagation calculations.
The feature map generated from the channel attention module is utilized as an input in the spatial attention module (as shown in Figure 3). First, execute maximum and average pooling depending on the channel, followed by the CONCAT operation on both layers. The feature map produced from the spatial attention module is then obtained by the sigmoid function after convolution is conducted and reduced to one channel.
The spatial attention module compresses channels and performs mean and maximum pooling in the channel dimension. The final pooling operation is to extract the channel's greatest value, and the number of extractions is H W. The average pooling procedure is used to acquire the channel's average value, and the number of extractions is also H W. As a result, a two-channel feature map is generated.

Loss function
The loss function is the gap between the predicted and actual values of the model. That is to find a standard to help the training mechanism optimize parameters at any time, so as to find the parameters with the highest precision of the network role at all times toameterork proposes to

FIGURE 4
Flow chart of algorithm implementation.

Frontiers in Environmental Science
frontiersin.org describe a criterion to help the training mechanism in optimizing the par facilitate finding the network at the greatest accuracy. We want the predicted value to be infinitely close to the actual value, so the difference needs to be minimized (in this process, the loss function needs to be introduced). The choice of the loss function in this process is very critical. In specific projects, some loss functions calculate the gradient of the difference falling fast, while others fall slowly, so choosing the appropriate loss function is also very critical.
As the most common loss function, cross-entropy is not optimal for MIS tasks, like road crack image segmentation, in which objects often occupy only a small area, or some medical image processing tasks such as, for example, retinal vascular and eye segmentation. We use the Dice coefficient loss function instead of the common cross loss due to its performance in the presence of ground truth, which is widely used to evaluate segmentation performance. Suppose k is the class label, where k = {1, 2,---, K}, and K ∈ N. The ground truth label vector and the predicted probability vector can be expressed as Y = {y 1 (k), y 2 (k), --, y i (k), --, y N (k)}, whereŷ i (k) ∈ [0, 1, 2,...n] andŶ i (k) ∈ [0, 1, 2,...n]. The Dice loss function can be expressed as follows: where N denotes the number of pixels and k and ω k are the number of classes and the category weights, respectively. Figure 4 shows the flow chart of the algorithm of the proposed method in the paper.

Numerical experiments 3.1 Data set
We collected and labeled a road crack dataset. The dataset contains a total of 600 images. We split the 600 images into a training set, a test set, and a validation set in the ratio of 8:1:1. We also conducted extensive experiments on several public benchmark datasets. The experimental results demonstrate that our method achieves SOTA performance compared to the most common methods.

Evaluation indicators
We used the PyTorch deep learning framework to build our network. We considered several performance metrics for experimental comparison to better evaluate the experimental results, including accuracy, F1-score, recall, precision, and other evaluation metrics.
where TP denotes the true-positive case, FP denotes the false-positive case, FN denotes the false-negative case, and TN denotes the truenegative case.

Experimental details and experimental results
Our proposed network is an end-to-end architecture system that uses a ResNet network that has been pre-trained on ImageNet for four downsampling operations in the encoder phase. We used the PyTorch deep learning framework to build our network. The training and testing platforms were both Ubuntu 18.04, and in terms of hardware configuration, two 3090 graphics cards were used, each with 24G of video memory. In the training process, we used the small batch stochastic gradient descent (SGD) method with a batch size of 8 and a learning rate of .0001. We also used Adam optimization and SGD methods for comparison experiments, and chTor found that SGD usually performs better but that Adam converges in a shorter time. Although Adam converges faster, we carry out a better and more biased performance in terms of performance and time.
The results of our test on the dataset are shown in Figure 5. It is worth noting that these datasets are not exposed to the network in advance to better validate the effectiveness of the proposed method, where image is the original input of an RGB image, ground truth represents our label, and predict is the resulting map predicted by our algorithm. The experimental results show that the results achieved by our algorithm on the crack images have been less different from ground truth, which is enough to prove the effectiveness and accuracy of our algorithm.
The specific performance comparison of different methods is shown in Table  four metrics such as precision, recall, accuracy, and F1, respectively. It can be seen that our method achieves the most advanced performance in all four evaluation metrics compared to these state-of-the-art methods.
To further validate the adequate performance of our method, we disassembled it into multiple parts. We conducted a large number of combinatorial experiments to validate the efficiency of each module fully. As shown in Table 2, the most basic CNN + CS (context spatial) yields a dice metric of .87 on our dataset. CNN + CS + ASPP (atrous spatial pyramid pooling) yields a dice metric of .92 on our dataset. We use ResNet as the backbone; the combination of ResNet + CS gives a dice metric of .90 on our dataset, and ResNet + CS + ASPP provides a dice metric of .94 on our dataset. In addition, the best performance of .98 is achieved when using ResNet that has been pre-trained on ImageNet.
Step-by-step tests further validate the effectiveness of the proposed method in this study.

Conclusion
This study proposes a new deep learning network architecture called PSA-Net for road crack detection. Our algorithm is developed from three aspects, namely, multi-scale feature information extraction, spatio-temporal attention mechanism, and pyramidal pooling, focusing on the contextual semantic information and edge information on crack images, and it is an end-to-end segmentation algorithm. We designed PSA-Net for pavement crack detection without increasing the number of network parameters. We conducted our experiments on our road crack detection datasets. The experimental results show that our PSA-Net offers a significant advantage on various datasets, sufficient to prove the algorithm's effectiveness. In the future, we will continue to contribute to road crack detection and consider combining the detection model with the segmentation model to improve the performance of our algorithm further (H et al., 2013).

Data availability statement
The raw data supporting the conclusion of this article will be made available by the authors, without undue reservation.

Author contributions
XL proposed the architecture of the algorithm and wrote the paper. JZ helped write the paper and implement some of the algorithms. DW and MaL provided experimental data. EM gave insight into the implementation process. MeL and FG labeled the images and helped further refine the paper.

Funding
This work was supported by the China National Key R&D Program under grant 2022YFF0802600.

Frontiers in Environmental Science
frontiersin.org

Conflict of interest
DW and ML were employed by Chongqing Urban Construction Investment (Group) Co., Ltd.
The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.