Deep Learning Deformation Initialization for Rapid Groupwise Registration of Inhomogeneous Image Populations

Groupwise image registration tackles biases that can potentially arise from inappropriate template selection. It typically involves simultaneous registration of a cohort of images to a common space that is not specified a priori. Existing groupwise registration methods are computationally complex and are only effective for image populations without large anatomical variations. In this paper, we propose a deep learning framework to rapidly estimate large deformations between images to significantly reduce structural variability. Specifically, we employ a multi-level graph coarsening method to agglomerate similar images into clusters, each represented by an exemplar image. We then use a deep learning framework to predict the initial deformations between images. Warping with the estimated deformations brings the images closer in the image manifold and their alignment can be further refined using conventional groupwise registration algorithms. We evaluated the effectiveness of our method in groupwise registration of MR brain images and compared it against state-of-the-art groupwise registration methods. Experimental results indicate that deformation initialization enables groupwise registration to converge significantly faster with competitive accuracy, therefore facilitates large-scale imaging studies.

Groupwise registration methods do not require a pre-specified reference image, but instead automatically determine the hidden common space in an unbiased manner. Groupwise registration techniques typically simultaneously align a cohort of images to a common space (Sabuncu et al., 2009;Spiclin et al., 2012;Wachinger and Navab, 2013). For example, in Joshi et al. (2004) an initial group center is defined by the average of all affineregistered images. The group center is iteratively updated with the average of images registered to it. While this method mitigates bias, it leads to registration inaccuracy as the initial group center is fuzzy. This limitation was addressed in Wu et al. (2011) by constructing a "Sharp-Mean" group center by weighted averaging of the registered images. ABSORB (atlas-building by self organized registration and bundling) (Jia et al., 2010) is a groupwise registration algorithm that warps the images based on their neighboring images. However, ABSORB does not consider the whole image distribution and takes into account only the immediate neighbors of an image. HUGS (hierarchical unbiased graph shrinkage) (Ying et al., 2014) models the image distribution using a graph and formulates groupwise registration as a dynamic graph shrinkage problem where images, represented as nodes, are warped along graph edges. Yet another groupwise registration strategy is by constructing a minimal spanning tree with a root node that gives a minimum overall edge length to all other nodes. The image deformation is estimated by composing all the transformations along the path from a leaf node to the root node (Hamm et al., 2009).
The aforementioned methods assume a single common space and are not designed to deal with heterogeneous populations with large anatomical variations. An inhomogeneous population with large deformations is better represented using multiple group centers and directly warping the images to a single group center is ineffective and inaccurate (Sabuncu et al., 2008;Liao et al., 2012). As a remedy, the population is typically divided into multiple homogeneous subgroups with an atlas constructed for each subgroup for registration (Sabuncu et al., 2009;Ribbens et al., 2010). For example, Wang et al. (2010) cluster the population into subgroups and perform groupwise registration within each subgroup. The center images of the subgroups are then registered using a pyramidal hierarchy. While effective, these methods are computationally expensive and not scalable to large datasets.
In this paper, we present a novel deformation initialization framework to reduce anatomical variations prior to groupwise registration. This removes large structural variations in an inhomogeneous image population so that conventional groupwise registration algorithms can be applied more effectively and accurately. Our initialization framework is formulated as a two-step process: (i) graph coarsening and (ii) deep learning deformation prediction. In the first step, the images are represented using a graph and are clustered via iterative graph coarsening. In the second step, deep learning is employed to estimate the deformations between images in the population according to the hierarchical structure resulting from graph coarsening.

METHODS
Given a diverse dataset of MR brain images with large intersubject variability, our objectives are to (i) reduce the anatomical variability in a dataset such that the images can be simultaneously registered to a single latent common space and (ii) speed up groupwise registration so that it is scalable to large-scale datasets. To achieve these objectives, we will employ deep learning for predicting large deformations to reduce structural variations so that the images are close enough to be registered efficiently and accurately to a common space.

Multi-Level Graph Coarsening Based Image Clustering
We propose to use multi-level graph coarsening for image clustering. Graph coarsening is used in multi-level graph partitioning to construct smaller graphs by hierarchically combining neighboring vertices (Xiao et al., 2013;Safro et al., 2015). That is, for an h-level coarsening, we have where G h denotes the size of the graph at level h. Graph coarsening can generally be accomplished by either (i) contraction or (ii) algebraic multigrid (AMG) (Safro et al., 2015) scheme. In the current work, we use AMG graph coarsening (Ruge and Stüben, 1987;Rakai et al., 2012) to split the dataset into image clusters. Let G 0 = V 0 , E 0 be the original graph consisting of a vertex set V 0 = {v i | i = 1, . . . , N 0 }, representing images I = {I i | i = 1, . . . , N 0 }, and an edge set E 0 = e ij | i, j = 1, . . . , N 0 , representing the similarity between the images. Each edge e ij is defined for images I i and I j and is calculated via normalized cross correlation (NCC) as where x 0 is a voxel location in the brain region andĪ i and I j are mean intensity values of images I i and I j , respectively. If registration needs to be performed across modalities, information theoretic measures, such as mutual information, can be used. The fine graph G 0 is progressively coarsened with the coarsened graph at level l is denoted as G l = V l c , E l c . The coarsening algorithm is detailed below: Step 1: The connection of vertex j with respect to vertex i is considered strong if for given ρ ∈ (0, 1]: where N l is the total number of vertices at level l. Note that this criterion is not symmetric with respect to i and j. With ρ = 1, the connection of the vertex with maximum similarity with vertex i is considered strong. This results in a large number of clusters, each with few images. On the other hand, if the value of ρ is too small, then we get too few clusters. We set ρ = 0.95 so that connections with above 95% of the maximum similarity value are considered strong.
Step 2: The desirability ψ i of a vertex to be selected as a coarse vertex is computed as the total number of strong connections with respect to the vertex. The vertex with the largest desirability is designated as a coarse vertex and all vertices that are strongly connected with respect to this coarse vertex are designated as fine vertices.
Step 3: The desirability values of vertices strongly connected with respect to fine vertices are increased by 1. The desirability values of vertices strongly connected with respect to coarse vertices are decreased by 1.
Step 4: Steps 2 and 3 are repeated until all the vertices are designated as either coarse or fine.
Step 5: Steps 1 − 4 are repeated with the coarse vertices with l ← l + 1 until we get stable graphs (i.e., when the size of the two consecutive graphs is same G l = G l−1 ).
The coarsening process is illustrated in Figure 1. Eventually, images are hierarchically grouped into clusters, where at each level the cluster exemplars are represented by coarse vertices and cluster members are represented by fine vertices. Cluster exemplars at the highest level will be used for deformation initialization.

Deep Learning Based Registration
The cluster exemplars obtained at the highest level (h) of graph coarsening, I h = J j | j = 1, . . . , N h c , will be used to deform all the images in the dataset. Registration is performed using a convolutional neural network (CNN) (Fan et al., 2018) with these exemplars regarded as fixed templates. The CNN is based on U-Net (Ronneberger et al., 2015), with additional convolutional layers at same levels of contracting and expansive paths to learn high-level features that are helpful in predicting the deformation fields (Figure 2). More specifically, the network consists of (i) 3 × 3 × 3 convolutional layers followed by ReLU and batch normalization, (ii) 2 × 2 × 2 max pooling layers, (iii) 2 × 2 × 2 deconvolutional layers, (iv) 1 × 1 × 1 final convolutional layers, and (v) 3 × 3 × 3 convolutional layers added between the contracting and expansive paths. In addition, a loss function is added in each layer to ensure that the parameters of the frontal convolutional layers are updated. This strategy helps to avoid over-fitting caused by the more frequent parameter update of the later convolutional layers. The registration network takes the overlapping 64 × 64 × 64 patches as input, and outputs 24 × 24 × 24 deformation field patches. In order to obtain a deformation field that is equal in size to the input image, we extract the predicted deformation field patches with a step size of 24 without overlap. A CNN is associated with each template. To train the CNNs, we first select the template which is most similar to all other templates, based on the following criterion: The CNN is trained using dual-guidance: (i) coarse guidance from deformation fields estimated using an existing registration method and (ii) fine guidance using image dissimilarity betweeñ J and the warped subject images. The latter ensures that the training does not completely depend on the guidance from ground-truth deformation fields estimated from the existing registration method. We used diffeomorphic Demons (Vercauteren et al., 2009) to estimate the ground-truth deformation fields. Our learning model is therefore semisupervised with loss function consisting of two components: (i) the Euclidean distance between the predicted and the groundtruth deformation fields (loss u ) and (ii) the sum of squared intensity difference betweenJ and the subject image warped using the predicted deformation field (loss SSD ). As shown in Figure 2, the deformation field is predicted at three different resolution levels, therefore loss u is comprised of the loss functions computed at each level i.e., loss u = loss high u + loss mid u + loss low u . The two components of the total loss function were dynamically balanced during the training stage (loss total = α * loss u + β * loss SSD ). Initially the first component is given a higher weight α to converge quickly and then the prediction is refined by giving more weight β to the second component of the loss function. At each epoch, the sum of the weights of the two components was equal to 1. We trained the network for 10 epochs, which we found enough for convergence.
We used a 75 : 25 train-test split of the dataset (excluding the templates) and trained the network using the ADAM optimizer with a learning rate of 1 × 10 −2 . Once the network was trained with respect toJ, the networks for the other templates were trained using transfer learning by initializing the weights with those of the network trained with respect toJ and updating the weights using the ADAM optimizer with an overall learning rate of 1 × 10 −7 . We kept a small learning rate as all the CNNs have a common task domain. Transfer learning allows the training of the CNNs to be expedited.
Once the networks have been trained, each of them is used to register the images to the templates, producing a set of deformation fields U = u ij | i = 1, . . . , N 0 , j = 1, . . . , N h c . Our goal is to warp each image to a hidden common space using the average deformation computed with respect to the templates. To achieve this, we first invert the deformation fields as The deformation field for an image I i is computed as Frontiers in Neuroinformatics | www.frontiersin.org  and is used to warp the image I i via I i ′ = I i •ū i (see Figure 3). This process is repeated for all the images, producing a set of warped I ′ = I i ′ | i = 1, . . . , N 0 . The alignment of the images in I ′ can be improved using groupwise registration algorithms. Since the differences between the images are smaller, they can be brought to a common space more efficiently in a smaller amount of time.

Evaluation of Registration Performance
The efficacy of our method was evaluated both qualitatively and quantitatively, in comparison with SharpMean (Wu et al., 2011), ABSORB (Jia et al., 2010), and GroupMean (Joshi et al., 2004). With initialization, the methods are denoted as iSharpMean, iABSORB, and iGroupMean.
The experiments were conducted by combining all the T1 weighted MR images from LONI LPBA40 (Shattuck et al., 2008) and IXI 1 datasets. As LONI LPBA40 has 40 images and IXI has 30 images, in our dataset we have a total of 70 images that we registered jointly. All the images have 184 slices of 220 × 220 pixels with isotropic voxel size of 1mm 3 . The age range for LPBA40 is 29.20 ± 6.30 years and IXI is 20-54 years. The union of two datasets ensures that the images exhibit large inter-subject variability characterized by the presence of different   age groups (young adults and elderly). Figure 4 shows some typical images from this dataset, indicating significant intersubject differences. All the images were histogram matched and affine registered using ANTs (Avants et al., 2008). The image which is most similar to the rest of the images in the dataset is used as template for affine registration. Also, for training the deep learning based registration network, we had a total of 69 images (excluding the template), which were divided using a 75 : 25 train-test split. This means that 52 images were used for training and the remaining 17 images were used for testing.
For qualitative assessment, we used checkerboard images to simultaneously display two images so that structural boundaries can be compared. In the ideal situation where two images are perfectly aligned, the checkerboard image will be seamless. Figure 5A shows the axial-view checkerboard image of two randomly selected images before initialization, showing apparent misalignments especially in the lateral ventricular region. In contrast, Figure 5B shows that deformation initialization reduces variability across images. Figure 6 shows the 3D surface renderings of the group mean images given by different methods. It can be seen that groupwise registration with initialization improves sharpness of the group mean images of SharpMean, ABSORB, and GroupMean.
For quantitative evaluation, we computed the NCCs between all the warped images with respect to the group mean image generated by each method. The results, shown in Table 1, indicate that deformation initialization moves the images closer and groupwise registration with initialization yields higher mean NCC values (statistically significant with paired t-tests, p < 0.05), than without initialization.
Evaluation was also performed based on Dice ratio of hippocampus and brain tissue segmentation, i.e., cerebrospinal fluid (CSF), gray matter (GM), and white matter (WM). The Dice ratio (D) is given by where V 1 is the volume of a segmented tissue or hippocampus in the subject image domain and V 2 is the volume of a segmented tissue or hippocampus in the reference image domain. The reference image for hippocampus and brain tissues is obtained in the common space, respectively by majority voting based on the hippocampus and tissue segmentation of all the warped images. The results for different brain tissues are summarized in  Figure 7A for CSF, GM, and WM. The Dice ratios for hippocampus are summarized in Table 4. The groupwise registration methods with initialization show improved Dice ratio (statistically significant with p < 0.05) as compared to no initialization.
We computed the 95th percentile of Hausdorff distance for performance evaluation. The Hausdorff distance (HD) is given by HD(R, S) = max (max where R and S are the 3D point sets of the boundaries of the tissue segmentations or hippocampus of the reference image and subject image, respectively. d(r, s) is the Euclidean distance between two finite point sets. We reported the 95th percentile of HD since it is less sensitive to outliers. Table 3 summarizes the results for different brain tissue types. The overall value yielded by deformation initialization is 1.37 (±0.306) mm thus confirming its usefulness in the reduction of anatomical variability. In addition, deformation initialization improves groupwise registration. Box plots are shown in Figure 7B for evaluation. Table 4 summarizes the 95th percentile of HD for hippocampus. We can see that the initialized groupwise registration significantly decreased the Hausdorff distance as compared to without initialization. Table 5 summarizes the computational times of all the methods along with the number of iterations needed for convergence. Groupwise registration methods with initialization are faster and converge very quickly, compared with no initialization. More specifically, SharpMean took around 18 h, whereas iSharpMean converged within 3 h with comparable accuracy. iABSORB achieved results comparable to ABSORB and requires 21 h less. iGroupMean took just half an hour  Statistically significant improvements (p < 0.05) with or without initialization are marked in bold.
to converge, compared with 13.5 h taken by GroupMean.
These results indicate that initialization improves registration accuracy by reducing anatomical variability and is hence important for detection of subtle changes associated with aging and disorders.

Significance of Graph Coarsening
To investigate the impact of graph coarsening on template selection, we performed two experiments.
In the first experiment, instead of utilizing graph coarsening, we used randomly selected templates for deformation initialization. The number of selected templates was kept consistent with that given by graph coarsening. It can be observed from Table 6 that the accuracy decreases in comparison with initialization using graph coarsening. This demonstrates the importance of taking into consideration the image distribution in template selection.
In the second experiment, we evaluated the effects of the number of templates. Using a single template (i.e.,J), although giving good alignment (Table 6), will affect subsequent population analysis [e.g., voxel-based morphometry (VBM)] with bias toward the selected template and neglecting intersubject variation. Moreover, if the selected template image is an outlier, population analysis can be severely affected. Graph coarsening takes into account inter-subject heterogeneity and determines multiple images that are representative of image subpopulations. The higher Dice ratios given by single template case is partially due to the greater image sharpness when no averaging is performed.

Generalizability
To assess generalizability, we conducted two experiments. In the first experiment, we trained the registration network with the LONI LPBA40 dataset and tested it with the IXI dataset. In the second experiment, we trained the registration network with both LONI LPBA40 and IXI datasets and tested it on the IXI dataset. Figure 8 shows the Dice ratios for 78 ROIs (see Table 7) of the IXI dataset. The overall Dice ratio achieved in first and second experiment is 73.07 (±9.91) and 74.29 (±9.84) %, respectively p > 0.05 ,     Statistically significant improvements (p < 0.05) as compared to graph coarsening are marked in bold.

CONCLUSION
In this paper, we presented an effective and efficient deformation initialization method for groupwise registration of images with large anatomical differences. Deformation initialization decreases structural discrepancies and brings the images closer to the common space. The results validated that deformation initialization improves alignment accuracy and significantly reduces computation times.

AUTHOR CONTRIBUTIONS
SA implemented the code, performed experiments, and prepared the manuscript draft. JF contributed in code implementation. PD and XC helped in designing the experiments. JF, PD, and XC participated in idea discussion. P-TY contributed in algorithm development and critical revision of the manuscript. DS designed the project, led the team, and revised the manuscript.