Edge-Aware Pyramidal Deformable Network for Unsupervised Registration of Brain MR Images

Deformable image registration is of essential important for clinical diagnosis, treatment planning, and surgical navigation. However, most existing registration solutions require separate rigid alignment before deformable registration, and may not well handle the large deformation circumstances. We propose a novel edge-aware pyramidal deformable network (referred as EPReg) for unsupervised volumetric registration. Specifically, we propose to fully exploit the useful complementary information from the multi-level feature pyramids to predict multi-scale displacement fields. Such coarse-to-fine estimation facilitates the progressive refinement of the predicted registration field, which enables our network to handle large deformations between volumetric data. In addition, we integrate edge information with the original images as dual-inputs, which enhances the texture structures of image content, to impel the proposed network pay extra attention to the edge-aware information for structure alignment. The efficacy of our EPReg was extensively evaluated on three public brain MRI datasets including Mindboggle101, LPBA40, and IXI30. Experiments demonstrate our EPReg consistently outperformed several cutting-edge methods with respect to the metrics of Dice index (DSC), Hausdorff distance (HD), and average symmetric surface distance (ASSD). The proposed EPReg is a general solution for the problem of deformable volumetric registration.


INTRODUCTION
Deformable image registration is to perform spatial transformation between a pair of images and establish a non-linear point-wise correspondence to achieve spatial consistency (Sotiras et al., 2013). By doing so, mono-/multi-modality information can be fused into the same coordinate system. It plays a very important role in various medical imaging studies to provide complementary diagnostic information and investigate changes of anatomical structures. Although many algorithms have been proposed over the past few decades (Sotiras et al., 2013;Shen et al., 2017;Haskins et al., 2020), registration is still a challenging task. Traditional registration methods may be computationally expensive and time-consuming due to their iterative optimization during deformation estimation procedure. Moreover, most existing deformable registration solutions require separate rigid alignment before non-rigid registration, and may not well handle the large deformation circumstances. Therefore, efficient and accurate deformable registration scheme is still greatly expected to compensate for complicated nonrigid deformations.
As illustrated in Figure 1, the goal of registration is to match all corresponding anatomical structures in two images to the same spatial system through plausible deformable transformation. To calculate the desired deformable transformation, several non-linear deformation algorithms (Klein et al., 2009) have been proposed, such as large diffeomorphic distance metric mapping (LDDMM) (Beg et al., 2005;Auzias et al., 2011), standard symmetric normalization (Avants et al., 2008), and Demons (Vercauteren et al., 2009). These methods treat deformable registration as a procedure of iterative optimization for maximizing the similarity [such as mean square error (MSE) (Wolberg and Zokai, 2000), normalized mutual information (NMI) (Knops et al., 2006), and normalized cross-correlation (NCC) (Rao et al., 2014), etc.] between fixed and warped moving images. However, the iterative optimization strategy may take a relatively long time to deal with complicated volumetric deformations.
To address aforementioned issue, deep neural networks have been widely investigated for the registration task in recent years (Haskins et al., 2020). The registration networks are beneficial to aggregate abundant features from paired images to effectively predict the deformation field. Eppenhof et al. (2018) employed synthetic random transformations to train a registration framework based on a convolutional neural network (CNN). Fan et al. (2019) also applied a supervised CNN for image registration by using obtained ground-truth deformation fields as the supervision information. Uzunova et al. (2017) synthesized a large amount of realistic ground-truth data using model-based strategy to train a registration network. Yang et al. (2017) proposed a patch-wise deformation prediction model, which is a deep encoder-decoder network devised to estimate the momentum-parameterization of LDDMM model. The major limitation of supervised registration networks (Cao et al., 2017;Sokooti et al., 2017;Uzunova et al., 2017;Yang et al., 2017;Eppenhof et al., 2018;Fan et al., 2019) is the prerequisite of the ground-truth registration fields, which would highly affect the network performance. However, unlike segmentation or detection tasks, it is always difficult to obtain registration ground-truth.
In contrast, some studies have focused on unsupervised deep learning algorithms which achieved great success in various registration tasks (Sheikhjafari et al., 2018;Kuang and Schmah, 2019). The mechanism of unsupervised registration networks is to build model to obtain the deformation fields based on maximizing the similarity between two images, thus is without the need of ground-truth deformations. Li and Fan (2018) proposed to predict deformation parameters using a fully convolutional network, but this is a 2D approach that tends to ignore the volumetric information. Rohé et al. (2017) utilized U-net (Ronneberger et al., 2015) to estimate the deformation field of 3D cardiac MR images and employed the sum of squared differences (SSD) as the similarity loss. Balakrishnan et al. (2018) proposed an end-to-end network with crosscorrelation as its loss function and spatial transformer networks (STN) (Jaderberg et al., 2015) as warping module. However, the prerequisite of this network was another rigid alignment. In addition to unsupervised learning, weakly supervised registration methods usually pay extra attention on the correspondences between structural information of two images, such as the extracted corresponding anatomical landmarks in prostate MR and ultrasound images (Hu et al., 2018). The weakly supervised network with structural similarity could provide more reliable registration but still requires a small amount of manual annotations.
One major challenge facing existing registration neural networks is the effective solution for large deformation compensation. To tackle this issue, Hu et al. (2019) proposed a registration network based on (Balakrishnan et al., 2018), which warps the multi-resolution feature maps to obtain the deformation field. However, in such a way, the low-resolution deformation field cannot be accurately acted on subsequent high-resolution features, thus may degrade the registration accuracy. At the same time, many other cascade/recursive networks (de Vos et al., 2019;Zhao et al., 2019a,b) have been proposed. The general idea of these networks is to progressively estimate the complicated transformation relationship between moving and fixed images, which is similar to the iterative optimization idea of traditional algorithms. For example, deep learning image registration (DLIR) (de Vos et al., 2019) combined affine and non-linear networks to calculate both affine alignment and non-linear registration. Volume tweening network (VTN) (Zhao et al., 2019b) cascaded several registration sub-networks, which deforms the moving images by multiple times according to the deformation estimation. The recursive cascade network (Zhao et al., 2019a) expanded the number of cascaded networks, and only calculated the similarity of the last cascade for training. In general, the cascade/recursive networks simplify the challenge of large deformation based on progressive deformation estimation. But the performance of these networks would be affected by the training strategies and the cumulative errors caused by the cascaded propagation.
In this study, we devise a novel edge-aware pyramidal deformable network (EPReg) for unsupervised volumetric registration. The proposed EPReg is a dual-stream pyramid framework, which utilizes original images and corresponding edge-aware maps to compose dual inputs, and generates multiscale paired feature maps for recursively transforming the information between images into more representative features to predict more accurate deformation field. Finally, the trained EPReg can perform deformable registration in one forward pass. Extensive experiments on three 3D brain magnetic resonance imaging (MRI) datasets demonstrate that our proposed network achieves satisfactory registration performance.
The main contributions of our work are 2-fold.
1. We propose to fully exploit the useful complementary information from the multi-level feature pyramids to predict multi-scale deformation fields. Such coarse-to-fine estimation facilitates the progressive refinement of the predicted registration field, which enables our network to handle large deformations between volumetric images. 2. We integrate edge information with the original images as dual-inputs, which enhances the texture structures of image content, to impel the proposed network pay extra attention to the edge-aware information for structure alignment.
The remainder of this paper is organized as follows. Section 2 presents the details of the edge-aware pyramidal deformable network. Section 3 shows the experimental results of the proposed EPReg for the application of brain MRI registration. Section 4 elaborates the discussion of the proposed network, and the conclusion of this study is given in section 5.

EDGE-AWARE PYRAMIDAL DEFORMABLE NETWORK
The proposed registration network is illustrated in Figure 2 (A) with its affine alignment block (B) and deformable registration block (C). We denote the input volumetric image pair as a fixed volume (I f ) and a moving volume (I m ). Their edge maps are denoted as E f and E m , respectively. The EPReg network leverages deformable pyramid to progressively transform the information between I f -I m and E f -E m into more representative features to predict more accurate deformation field (φ 4 ∼ φ 1 ). The trained EPReg can attain deformable registration in one forward pass. The following subsections first give a brief introduction on the deformable registration and then present the details of our scheme and elaborate the novel edge-aware pyramidal deformable architecture.

Preliminaries
Volumetric registration is to establish the voxel-wise correspondences between different volumes (i.e., fixed volume I f ∈ R 3 and moving volume I m ∈ R 3 ). The goal is to predict the optimal deformation field φ, so that the warped moving volume I m • φ ∈ R 3 can be matched with I f . The optimization problem can be defined as: where I m • φ denotes I m warped by φ. L sim defines similarity criterion and L smooth regularizes the deformation φ to match any specific properties in the solution, and λ is a regularization parameter. There are several conventional formulations for L sim and L smooth , respectively. Common similarity measures include MSE, NMI, NCC, and structural similarity index (SSIM) (Wang et al., 2004). L smooth is often formulated as a regularizer on the spatial gradients of the displacement field.

Pyramidal Deformable Network
The proposed EPReg is build on the dual-stream pyramid architecture as shown in Figure 2A. The dual-stream encoder part is with shared parameters. As shown in Figure 3, the encoder part consists of four down-sampling convolutional blocks. Each convolutional block contains a 3D strided convolution with stride of 2, to reduce the spatial dimension in half. For the second and third convolutional blocks, residual connections (He et al., 2016) are employed. Specifically, two residual blocks are successively employed, each of which consists of two convolutional layers with a residual connection. For the last convolutional block, a 3D atrous spatial pyramid pooling (ASPP) (Wang et al., 2019) module is used to resample features at different scales for the capture of more representative multi-scale information. Batch normalization and rectified linear unit (ReLU) operations are applied in each convolutional block. The convolutional blocks capture hierarchical paired features (i.e., F 1 ∼ F 4 and corresponding M 1 ∼ M 4 ) of the input volumetric pair, which are then used to progressively predict multi-scale deformation field (φ 4 ∼ φ 1 ). Such coarse-tofine estimation based on paired feature pyramids enhances the capability for handing large-scale deformation estimation.
We first perform rigid alignment on the feature maps F 4 and M 4 with high-level semantic information. Specifically, we devise an affine block to achieve global alignment. As shown in Figure 2B, the affine block consists of an affine convolutional layer (a residual block and a 1 × 1 × 1 convolution operation) and a global average pooling (GAP) layer. It takes paired feature maps F 4 -M 4 as input, and outputs the affine deformation field φ 4 , which contains 12 degrees of freedom: FIGURE 2 | The proposed edge-aware pyramidal deformable network for volumetric image registration. The schematic illustration of the proposed (A) dual-stream multi-level registration architecture, (B) affine alignment block, and (C) deformable registration block. The EPReg utilizes original images (fixed image I f and moving image I m ) and their corresponding edge maps (E f and E m ) to compose dual inputs, and generates multi-scale paired feature maps (F 1 ∼ F 4 and M 1 ∼ M 4 ) for transforming the information between images into more representative features to predict more accurate deformation field (φ 4 ∼ φ 1 ).
where θ a represents the parameter learned by affine block f a . According to the estimated φ 4 , the 3D affine grid G 4 can be generated and then warp the moving volume to rigidly align with the fixed volume. Based on paired feature pyramids, we then progressively carry out the non-rigid registration via the devised deformable block (see Figure 2C). The deformable block contains a deformable convolutional layer, which consists of two residual blocks and a 1 × 1 × 1 convolution operation. To estimate φ i , the input of the deformable block includes three components, i.e., F i , warped M i using previously estimated φ i+1 , and the fused previous feature maps (F i+1 and warped M i+1 using φ i+1 ): where θ d i represents the parameter learned by the i-th deformable block f d i , and i = 1, 2, 3. In such a way, based on previous deformation estimation (i.e., φ i+1 ) and paired feature pyramids, each deformable block further estimates extra deformation φ i , which can integrates with φ i+1 to attain more accurate non-rigid registration. For the i-th deformable block, the 3D deformable grid G i is the combination of G i+1 (with 2×upsampling) and φ i (see Figure 2C). Figure 4 illustrates one example of the progressive deformation estimation and registration. It can be observed that the progressive deformation estimation can gradually refine the whole deformation field. The low-resolution deformation field φ 3 contains coarse and global deformation information, while the high-resolution deformation field φ 1 captures more detailed local displacements. Thus, the whole deformation field can attain accurate registration. FIGURE 4 | One example to illustrate the multi-scale deformable registration fields and the procedure of progressive registration. It can be observed that the progressive deformation estimation can gradually generate more explicit local displacements, thus provide more and more accurate registration even for the large deformation case.
In summary, we utilize multi-level feature pyramids generated from paired volumes to estimate the multi-scale deformation fields. Our network generates the multi-scale deformation fields in a coarse-to-fine manner, which aggregates both high-level context information and low-level details. Highlevel context information is applied to the coarse and rigid alignment, while low-level details are devoted to the nonrigid registration.

Edge-Aware Dual Inputs
We integrate edge information with the original image as dualinputs for each encoding stream (see Figure 5), which enhances the texture structures of image content, to impel the proposed network pay extra attention to the edge-aware information for structure alignment.
Considering the effectiveness and easy implementation of the Sobel edge detector, 3D Sobel operator is designed to extract the edges of original volume. The 3D Sobel operator contains three filtering kernels as S x , S y , and S z . Each kernel is a 3 × 3 × 3 tensor, and is responsible for the calculation of image gradient along x-/y-/z-axis. The kernel S z is shown as an example: Kernels S x and S y are with the same kernel weights as S z , but along different directions. S x , S y , and S z are applied to convolve a volumetric image I, and further generate its corresponding edge map E as follows: where * denotes the convolution operation. The generated edge map is beneficial to impel the network leverage edge-aware information for structure alignment.

Training Loss
We adopt the patch-based cross-correlation (Rao et al., 2014) as the similarity function: Frontiers in Neuroscience | www.frontiersin.org whereĪ f andĪ m denote volumes with local mean intensities subtracted. p ∈ denotes each voxel in volumetric image, where is the whole image domain. Voxel p n is the local neighborhood in v 3 (v = 9 in our implementation) volumetric patch at the center of voxel p.
To avoid obtaining an unpractical or discontinuous deformation field, we also add a diffusion regularizer L smooth to impose smooth constraint on the spatial gradients of the overall deformation field φ: where φ = 4 i=1 upsample2 i (φ i ) is the aggregation of multi-scale deformation fields.
As the deformation is estimated progressively, we consider similarity loss for each scale of the registration pyramid. Therefore, the total loss is defined as: (8) where down2 i denotes a down-sample operation with a factor of 2 i . G i is the 3D deformation grid generated to warp the moving volume.

Materials
The study protocol was reviewed and approved by the Institution's Ethical Review Board. Experiments were carried on three public brain MRI datasets with manually labeled region of interests (ROIs), including Mindboggle101 (Klein and Tourville, 2012), LPBA40 (Shattuck et al., 2008), and IXI30 (Serag et al., 2012).
• Mindboggle101 (101 T1-weighted MRI volumes): 62 volumes were involved to conduct experiments as described in Kuang and Schmah (2019). Specifically, 42 volumes (i.e., 1, 722 pairs) from subsets of NKI-RS-22 and NKI-TRT-20 were used for training, and 20 volumes (i.e., 380 pairs) from OASIS-TRT-20 were involved for testing. • LPBA40 (40 T1-weighted MRI volumes): 30 volumes were randomly selected for training and the remaining 10 volumes were used for testing. • IXI30 (30 T1-weighted MRI volumes): all 30 volumes were used for testing. In order to investigate the generalization ability of the network, we employed the model trained on LPBA40 to register images from IXI30 dataset.
All volumes were pre-processed by histogram and intensity normalization, and skull-stripping using FreeSurfer (Fischl, 2012).

Implementation Details
In our experiments, each input volumetric image was resized into the dimension of 192 × 192 × 192. The network was trained on a GPU of NVIDIA Tesla V100. The value of the regularization parameter λ was set empirically as 1000. For the whole registration network, the number of epochs was set to 300. The network was implemented using Pytorch and Adam optimization (Kingma and Ba, 2014), and the learning rate was initially set to 2e-4, with 0.5 weight decay after every 10 epoch.

Evaluation Metrics
To quantitatively evaluate the registration performance, Dice index (DSC) (Dice, 1945), Hausdorff distance (HD) (Huttenlocher et al., 1993), and average symmetric surface distance (ASSD) (Taha and Hanbury, 2015) were calculated. The DSC is defined as: where R I f and R I m are the segmented ROIs of I f and I m , respectively. The HD measures the longest distance over the shortest distances between the segmented ROIs of I f and I m . The ASSD is defined as: where B I f and B I m are the segmented surfaces of I f and I m , respectively. The operator d(, ) is the shortest Euclidean distance operator.
All evaluation metrics were calculated in 3D. A better registration shall have larger DSC, and smaller HD and ASSD.

Registration Accuracy
We compared our EPReg network with four cutting-edge brain MRI registration schemes: SyN (Avants et al., 2008), VoxelMorph (Balakrishnan et al., 2018), FAIM (Kuang and Schmah, 2019), and MSNet (Duan et al., 2019). For a fair comparison, we obtained their results either by directly taking the results from their papers or by generating the results from the public codes provided by the authors using the recommended parameter setting. In addition, we also compared the network without edge-aware input, which is denoted as PReg. Figure 6 shows the visual comparisons from different registration methods on Mindboggle101 dataset. Our network can generate more accurate registered images, and the internal structures can be preserved consistently by using our network. Table 1 further reports the numerical results on five regions of the images from dataset Mindboggle101. It can be observed that our EPReg consistently achieved best registration performance with respect to DSC and HD metrics. Regarding the ASSD evaluation, our network obtained the best ASSD values on occipital and temporal regions; and the second best ASSD on the frontal, parietal and cingulate regions. It is worth noting that both our EPReg and PReg networks outperformed other cuttingedge methods by a large margin in terms of DSC values, which demonstrates the proposed deformable pyramid contributed to the improvement of registrations.
For the LPBA40 dataset, the visualization and quantitative results (including seven regions) are shown in Figure 7 and Table 2, respectively. It can be observed that our network achieved overall satisfactory registration performance on LPBA40 dataset. We further calculated DSC, HD, ASSD values of 54 corresponding sub-regions from warped and fixed volumes. The comparison results are illustrated in Figure 8. For the 54 sub-regions, the proposed EPReg achieved the best DSC, HD, and ASSD values on 41, 32, 38 sub-regions, respectively.
The numerical results on IXI30 are illustrated in Figure 9. IXI30 dataset has 95 subregions but 30 of them are extremely small regions. Thus we calculated DSC, HD, and ASSD values of the remaining 65 subregions. It can be observed that the proposed EPReg achieved the best DSC, HD, and ASSD values on 49, 35, 49 sub-regions, respectively, which shows that our network has satisfactory generalization ability. Figure 10 visualizes registered images from different methods. Our network again attained overall satisfactory registration performance on this dataset.

Statistical Analyses
To investigate the statistical significance of the proposed network over other compared registration methods, a student test was conducted. Specifically, the two-sample, two-tailed t-test was employed to pairwisely compare the registration performance between our method and the other five methods on three different datasets (see Table 3). It can be observed that the null hypotheses for the five comparing pairs on the metric of DSC were not accepted at the 0.05 level. As a result, our method can be regarded as significantly better than the other five compared methods on DSC metric. In addition, the null hypotheses for the pairs of SyN-EPReg, VM-EPReg, and FAIM-EPReg on all three metrics were not accepted at the 0.05 level, which demonstrates our method was significantly better than these three methods on all three metrics. The p-values of PReg-EPReg on metrics of HD and ASSD from dataset Mindboggle101 were beyond the 0.05 level, which indicates that our method and PReg achieved similar performance with regard to the HD and ASSD evaluation on dataset Mindboggle101. The pair of MSNet-EPReg held the similar results on metrics of HD and ASSD from dataset IXI. In general, the statistical analyses prove that our method had an overall better registration performance than other compared cutting-edge methods.

Comparison of Time Efficiency
We further compared the time efficiency. Table 4 lists the inference time for registering a pair of images using different methods. It can be observed that for the affine alignment, the time spent by our affine component (less than 0.3 s) was much less than that of the traditional affine alignment method (i.e., ANTS, Avants et al., 2008) using iterative optimization. Considering the overall registration time, our end-to-end registration network was much faster than other registration networks which require extra affine alignment (i.e., alignment using ANTS, Avants et al., 2008) before deformable registration.

Methods
We evaluated the proposed edge-aware pyramidal deformable network on three different brain MR datasets and compared with four cutting-edge registration methods. The DSC, HD, and ASSD were employed to evaluate all methods. Specifically, HD and ASSD were employed to provide evaluation on the differences in boundary shape between two volumes. The numerical results and statistical analyses show that our method had an overall better registration accuracy than other compared cutting-edge methods. Furthermore, our registration network provided very efficient inference procedure, which achieved accurate volumetric registration within 1 s. From the experimental results, it can be demonstrated that the proposed progressive deformation estimation scheme contributed to the improvement of registration accuracy and efficiency.
The proposed registration method focuses on the large deformation estimation by progressively predicting the residual deformation. It is beneficial for the estimation of large-scale deformation. Further validations on other body parts (e.g., chest or abdominal CT images) will be our future work.
The problem of registration validation in clinical settings is still an open issue. Most of recent research articles focusing on developing new registration approaches employ DSC as a primary metric to evaluate the registration accuracy (Balakrishnan et al., 2018;Cao et al., 2018;Loi et al., 2018Loi et al., , 2020Fan et al., 2019;Huang et al., 2021). For a fair comparison, we also applied DSC to evaluate the registration performance. However, the DSC may suffer of the limitation to be dependent on the volume of structures. Thus we further employed HD and ASSD to evaluate boundary differences between two volumetric regions. Although there is no guaranteed thresholds w.r.t. DSC, HD, ASSD for quality assessment of registration on brain MR images, the comparison results on a series of sub-regions (over 50 sub-regions) show the proposed network consistently outperformed other cutting-edge registration methods with respect to metrics of DSC, HD, and ASSD. In addition, the visual results in Figures 6, 7, 10 illustrate the satisfactory registration performance obtained by our method. In our future work, the phantom study which could generate the ground truth deformation fields for straightforward registration validation would be conducted.

CONCLUSION
In this paper, we have presented an edge-aware pyramidal deformable network for unsupervised volumetric registration. The proposed network focuses on the large deformation estimation by progressively predicting the residual deformation.
Our key idea is to fully exploit the useful complementary multi-level information from paired features to predict multiscale deformation fields. We achieve this by developing a deformable pyramid architecture, which can generate multi-scale paired feature maps for progressively transforming the paired information into more representative features to predict more accurate deformation field. In addition, we leverage extra edge information to impel the network pay attention to the edgeaware alignment. Extensive experiments on several 3D MRI datasets demonstrate that our edge-aware pyramidal deformable network achieves satisfactory registration performance. The coarse-to-fine progressive registration procedure is beneficial to compensate for the large-scale deformation, and can be regarded as a general solution for deformable volumetric registration.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s.