An Optical Flow Based Left-Invariant Metric for Natural Gradient Descent in Affine Image Registration

Accurate spatial alignment is essential for any population neuroimaging study, and affine (12 parameter linear/translation) or rigid (6 parameter rotation/translation) alignments play a major role. Here we consider intensity based alignment of neuroimages using gradient based optimization, which is a problem that continues to be important in many other areas of medical imaging and computer vision in general. A key challenge is robustness. Optimization often fails when transformations have components with different characteristic scales, such as linear versus translation parameters. Hand tuning or other scaling approaches have been used, but efficient automatic methods are essential for generalizing to new imaging modalities, to specimens of different sizes, and to big datasets where manual approaches are not feasible. To address this we develop a left invariant metric on these two matrix groups, based on the norm squared of optical flow induced on a template image. This metric is used in a natural gradient descent algorithm, where gradients (covectors) are converted to perturbations (vectors) by applying the inverse of the metric to define a search direction in which to update parameters. Using a publicly available magnetic resonance neuroimage database, we show that this approach outperforms several other gradient descent optimization strategies. Due to left invariance, our metric needs to only be computed once during optimization, and can therefore be implemented with negligible computation time.


INTRODUCTION
Modern neuroimaging techniques are providing a detailed examination of the nervous system at unprecedented scale. Human studies such as the Alzheimer's Disease Neuroimaging Initiative [1] or Open Access Series of Imaging Studies [2] are providing hundreds of publicly accessible three dimensional brain images at the millimeter scale to the neuroscience, medical imaging, and computer vision communities. Consortiums such as the BRAIN Initiative Cell Census Network are making petabytes of neuroimaging data available from animal models at the micron and submicron scale [3]. Extracting insight from these massive databases remains a challenge, and establishing common coordinate systems for reporting results is an important step to enable synchronization between different laboratories, between imaging modalities, and across populations [4][5][6].
This standardization is enabled by image registration techniques, where optimal spatial transformations are calculated to maximize similarity between new observations and images in standard coordinates. There are many approaches to implementing these techniques. One framework involves building alignments based on point sets including landmarks [7][8][9], curves [10], and surfaces [11,12]. Another framework involves voxel based imaging data [13][14][15][16][17]. Additionally hybrid approaches are often used that can combine the strengths of point based and voxel based techniques [18][19][20]. There are many more thorough reviews of these techniques, including comparisons in [21].
In this work we focus on voxel intensity based image registration with gradient based optimization. In these approaches robustness to parameter selection is essential. When optimizing over parameters, differences in scales must be accounted for during gradient descent before updating parameters in the negative gradient direction. For example, simple elastix [22] defines an estimate scales routine. Simple elastix is based on the simple ITK wrapper [23] of the powerful ITK library [24,25]. Simple ITK hard-codes relative scales in manner that tends to give good results for human neuroimages at the millimeter scale. However, efficient automatic methods are essential for generalizing to new imaging modalities, to specimens of different sizes, and to big datasets where manual approaches are not feasible.
Here we address this gap by designing a metric for natural gradient descent [26]. In differential geometry, vectors can represent perturbations of parameters, while covectors represent linear maps taking vectors to the real numbers. In gradient based optimization, gradients are covectors, and interpreting them as a perturbation to update parameters is not well formulated. For example, when optimizing over position with parameters with units of "meter", gradients have units of "per meter", and should not be added to a quantity with units of "meter". Using a Riemannian metric (inner product between vectors in a given tangent space), vectors can be converted to covectors and vice versa. Applying this conversion to gradients in optimization before updating is known as natural gradient descent.
In this work we consider affine and rigid transformations, which contain a linear map and a translation. These two components have different units: the linear part is unitless, and the translation part has units of length. This makes choosing stepsizes in gradient based optimization critical. Often orders of magnitude difference in scales means registration algorithms will not converge without laborious and non-reproducible parameter tuning. We introduce a metric for natural gradient descent based on the L 2 norm of optical flow, and we show that it is left invariant, invariant to image padding, and related to Gauss-Newton optimization. We demonstrate the advantages of this approach over more standard methods on a brain image registration dataset, and discuss our results in the context of related methods.

METHODS
In this section we define our coordinate systems and optical flow metric for the 12 parameter affine and 6 parameter rigid case, enumerate several properties, and summarize our validation experiments. We use lowercase letters to denote scalars, boldface lowercase letters to denote vectors, and boldface uppercase letters to denote matrices. Exceptions are I, J which by convention will denote images, and g which by convention will denote a metric tensor.

The Optical Flow Metric for Affine Transformations
We consider registration from an atlas image I: X ⊂ R 3 → R m , to a target image J: X → R n . Often in medical imaging m n 1 (for grayscale images), but it is often 3 (for red, green, blue images), or can be other values.
We work with vectors and affine transforms in homogeneous coordinates, so a point in R 3 is written as x (x 0 , x 1 , x 2 , 1) T , and an affine transform is written as A a 00 a 01 a 02 b 0 a 10 a 11 a 12 b 1 a 20 a 21 a 22 b 2 0 0 0 1 Images are transformed via the group action A · I(x) I(A −1 x), which is implemented computationally through trilinear interpolation. We use a coordinate chart based on lexicographic ordering of the above components, φ: affine matrix → R 12 with A1(a 00 , a 01 , a 01 , b 0 , a 10 , . . . , b 2 ) T . We use the chart induced basis for the tangent space, whereby the standard basis vectors e i are pushed forward, E i Dφ −1 A e i . In this parameterization the push forward is independent of A. For example, the first two basis vectors are given by Tangent vectors can be thought of as small perturbations to the coordinates δa, and we write δA Dφ −1 A δa. A perturbation A1A + ϵδA induces optical flow on our template image via We define our inner product g A between two tangent vectors δA, δB at the point A to be given by the L 2 inner product of the optical flow, scaled by the determinant of A, denoted |A|: This metric is left invariant, as can be seen by making a change of variables in the integral, y A −1 x, x Ay, dx |A|dy: Since A −1 δA is the push forward of the vector δA tangent to the point A by the map A −1 , this implies that this choice of metric is left invariant. An important benefit of this property is that during an optimization procedure we can compute it at identity once, and apply a simple linear transformation to compute it at other locations A.
We compute g id in coordinates as a 12 × 12 matrix via [g id ] ij g id (E i , E j ).

The Optical Flow Metric for Rigid Transformations
We use zyx Euler angles for parameterizing the rotation group, i.e., where · denotes matrix multiplication. We let φ be the associated At identity (all parameters equal 0), the chart induced basis for the tangent space is: A standard result shows that the chart induced basis at a point A is We can compute g id in coordinates as a 6 × 6 matrix via [g id ] ij g id (E i , E j ) using (3).

Converting Covectors to Vectors
For natural gradient descent we convert derivatives of the loss function with respect to the coordinates (which are covectors), to perturbations in the coordinates (which are vectors), by applying the Y map, which we define here. If dA is a covector at the point A with coordinates da, and δB is a vector at the point A with coordinates δa, then we can write the action of dA on δB in coordinates as The lift from a covector to vector using the Y map is defined implicitly through the relationship In a given coordinate system this amounts to solving a linear system of equations (i.e. inverting the matrix

Computing the Metric at A
We compute g id by integrating over the image once at the start of optimization, and then use left invariance to define a simple transformation for computing g A . To do this we need to express the push forward of a vector with A −1 as a matrix multiplication operation. This is computed via where N is 12 for general affine transforms and 6 for rigid transforms. The components of these matrices can be easily computed from their action on basis vectors. We can then write, using multiplication of N × N, matrices:

Invariance to Padding
In a typical situation I is an image of a head in air, and air has a signal 0. In this case we can add any amount of zero padding to our image without changing the metric because DI is zero in the padded regions. This may seem trivial, but it is not respected by choices made in other software, such as the "automatic scales estimation" routine in Section 2.6 below.

Relationship to Gauss Newton
If the loss function being considered is sum of square error, then this approach is roughly equivalent to Gauss Newton optimization. The minor differences are constant scale factors, numerics of interpolation, and potentially regions over which error is summed. This suggests that choosing a step of order 1 is reasonable for this cost function.

Independence to Coordinate Origin
Traditionally the positioning of a coordinate origin has been critical to the success of affine registration. If the origin is far from the imaging data (e.g., the corner of the image), then the transformation is very sensitive to changes in the linear parameters. Because shifting the coordinate origin is equivalent to applying an affine transformation, and our metric is left invariant, our approach is invariant to the coordinate origin. This is illustrated in our experimental results.

Mapping Experiments
We implement a gradient based affine image registration algorithm by minimizing sum of square error, and mutual information loss functions. We use 100 randomly sampled pairs of images from the LPBA40 dataset [5] in native space. This dataset consists brain images from 40 healthy volunteers, with 20 male and 20 female, between 19.3 and 39.5 years old. Images were 3D Spoiled Gradient Echo MRI volumes from a GE 1.5T system, with 1.5 mm coronal slice spacing, and 0.86 mm in plane resolution (38 subjects) or 0.78 mm (2 subjects). This dataset has been used for multiple image registration validation studies, most notably [21] which compared 14 different nonrigid registration methods. An example image pair is shown in Figures 1A,B, which corresponds to our first randomly selected pair.
We compare the natural gradient descent approach described here to three different gradient descent approaches. In the "vanilla" method we update the linear and translation jointly with no scaling, and in the "alternating" method we update them every other iteration (i.e., we update the linear part on iteration 0, the translation part on iteration one, the linear part on iteration 2, etc.).
We also compare to one other approach, "automatic scales estimation" used in simple elastix [22]. In this approach, the derivative of the transformation's output with respect to each parameter is taken, yielding a 3 × 12 matrix of partial derivatives at each pixel. Next the sum of squares is computed down each row, yielding a 1 × 12 vector at each pixel. Last the quantity is averaged over all pixels. This gives a set of scale factors which are used to normalize the components of the gradient before updating parameters. These scale factors are generally only computed once. We note that this is related to our approach with three modifications: 1) we set I(x) x in Eq. 3, 2) we consider the diagonal elements of g only, and 3) we define the components of g A to be equal to those of g id .
In all cases, gradient update steps are performed using a golden section linesearch with a maximum of 10 steps. The minimum in a given search direction is bracketed by a step size of 0, and a step size that would increase the loss. The latter is found by increasing the stepsize found from the previous iteration by a the golden ratio (roughly 1.6) until the loss function increases. Occasionally the step size is found from the golden section search to be 0 within machine precision, in which case we set it to 1 × 10 -10 to initialize the procedure at the next gradient descent iteration. Note that in this case optimization will stop, except in the alternating method where it may continue. For each gradient descent method we consider 12 parameter affine registration (for sum of square error and mutual information loss), and 6 parameter rigid registration (sum of square error only). We also consider placements of the coordinate origin in the image center, half way to the corner, and at the corner. For one case we consider a finer grained sampling of the image origin as discussed in the results section.
In each case we compute the loss function at each step of optimization for 50 steps, and report normalized results that have been shifted and scaled to lie in the range 0-1. The minimum for the best method is assigned the value 0, and the initial value of the loss function (equal for all methods) is assigned the value 1. We show these curves directly, and also show the average tied rank at each iteration. The tied rank for method X is defined as 1 plus the number of other methods X does better than, plus half of the number of other methods it performs equal to. This gives a number between 1 (worst) and 4 (best).
Algorithms are implemented in pytorch and gradients with respect to parameters are computed automatically. We use double precision arithmetic to deal with a large dynamic range of step sizes. For metric computation derivatives DI are computed on a voxel grid via centered differences, and integrals are computed by summing over voxels and multiplying by the voxel volume. Algorithms are run on a NVIDIA GeForce RTX 2080 Ti Rev. A GPU device with 11GB of memory.

Methods Summary
In the methods section we derived a metric based on the L 2 norm of optical flow of a template image, for the 12 parameter affine and 6 parameter rigid case, and proposed a corresponding natural gradient optimization algorithm. These correspond to 12 × 12 or 6 × 6 (respectively) positive definite matrices that depend on an atlas image.
We test our approach using 100 randomly selected pairs of neuroimages from the LPBA 40 dataset [5]. We consider four methods (natural gradient descent, "vanilla" gradient descent, alternating minimization, and automatic scales estimation), 3 placements of coordinate origins (center, half way, and corner), 2 transformation groups (affine and rigid), and 2 loss functions (sum of square error, and mutual information used in the 12 parameter case only).
Our first randomly selected pair of images is shown in Figures  1A,B. We also show the initial error (difference between images)  in Figure 1C, and the error after alignment for the affine Figure 1D and rigid Figure 1E case.

Metric Tensors
For the example images in Figure 1, the metric tensor at identity for the 12 parameter affine registration is shown in Figure 2. The three plots show results when the origin is at the center, half way to the corner, and at the corner. Even with the origin in the center, the tensor has significant off diagonal elements, including negative values, suggesting that simply scaling gradient components is suboptimal. With the origin at the corner, there are stronger off diagonal elements, and a much larger difference between the linear and translation parts. For the example images in Figure 1, the metric tensor at identity for the 6 parameter rigid registration is shown in Figure 3. Similar trends are observed, with even stronger negative elements.

12 Parameter Affine Results With Sum of Square Error
Normalized sum of square error is reported as a function of optimization step in Figure 4 for the 12 parameter affine registrations. The median for each method is shown as a solid line, the 25th and 75th percentile are shown as a filled area, and all 100 curves are shown in pale colors. One observes that the natural gradient descent approach converges significantly faster than the other methods. Considering different coordinate origins has a large effect for the alternative methods and no effect for the natural gradient method. The alternative methods often fail to converge within 50 iterations, with the "vanilla" method failing even with a centered coordinate origin. This illustrates the important lack of robustness in this field.
The bottom right plot shows that the natural gradient method quickly achieves an average rank close to 4 (best), while the vanilla method quickly achieves a rank of 1 (worst). The alternating minimization approach performs better in the long term, while the automatic scales estimation approach performs better in the short term. Relative to the other methods our approach performs worse with a coordinate origin in the center and slightly better for a coordinate origin in the corner, though the effect is minor.

12 Parameter Affine Results With Mutual Information
Mutual information is reported as a function of optimization step in Figure 5 for the 12 parameter affine registrations. The median for each method is shown as a solid line, the 25th and 75th percentile are shown as a filled area, and all 100 curves are shown in pale colors. Trends are very similar to that seen for sum of square error, with the natural gradient descent method converging to the best even more quickly than for sum of square error.

6 Parameter Rigid Results With Sum of Square Error
Normalized sum of square error is reported as a function of optimization step in Figure 6 for the 6 parameter rigid registrations. The median for each method is shown as a solid line, the 25th and 75th percentile are shown as a filled area, and all 100 curves are shown in pale colors. One observes that the natural gradient descent approach converges significantly faster than the other methods for every origin placement except the center. Again, considering different coordinate origins has a large effect for the alternative methods and no effect for the natural gradient method. The bottom right plot shows that the natural gradient method quickly achieves an average rank close to 4 (best) for every case except a centered origin, where it tends to perform roughly equivalent to the alternating minimization scheme. The vanilla method quickly achieves a rank of 1 (worst). The alternating minimization approach performs better in the long term, while the automatic scales estimation approach performs better in the short term, although the order changes much sooner than in the 12 parameter case. Relative to the other methods, our approach performs worse with a coordinate origin in the center, and slightly better for a coordinate origin in the corner.
Because of the similarity between the alternating and natural gradient methods when the origin is centered, we consider comparing these methods as a function of origin position. We consider fraction between center and corner of 0, 0.1, 0.2, 0.3, 0.4, 0.5, and 1. For each iteration, and for each distance from the center, we use a sign test statistic by averaging across patients. Because this amounts to 350 statistical tests, we control familywise error rate using permutation testing [27]. We randomly flip the sign of each patient 100,000 times, and calculate corrected p values using maximum statistics. These results are shown in Figure 7. We see that the natural gradient method performs statistically significantly better than the alternating optimization method across all iterations when the origin is shifted by 0.2 or more, and for the first few iterations when the origin is shifted by 0.1 or less. This indicates that even small uncertainties in positioning of the origin, 3.6 cm in this case, can degrade the performance of traditional approaches. Shifts of this size are reasonable considering that the location of the brain's center is typically unknown until after this type of registration is performed. Our results show that the natural gradient method proposed here outperforms other gradient based minimization schemes in almost every case examined, with virtually no computational overhead. This result holds for both 12 parameter affine transformations and 6 parameter rigid transformations, for sum of square error and for mutual information objective functions. The method does not depend on the placement of the coordinate origin, and can be easily implemented within any registration framework to improve convergence speed and reduce the number of hyperparameters that need to be tuned or estimated.
Another interesting finding is that the automatic scales estimation approach, while sophisticated, tends to perform worse than the simpler alternating minimization approach. This is particularly true when the origin is centered, which is the setting recommended by simple elastix [22].
There are few studies of natural gradient descent applied to image registration in the literature. A search on google scholar for "natural gradient" image registration returns only two relevant results. The work [28] uses a Fisher information metric for deformable registration and adopts various sampling strategies to define a random process from which it can be computed. The work [29] uses a similar strategy for deformable registration, but considers pairwise registration in a population and therefore computes a Fisher Information metric by taking an expectation across the population. Both rely on some form of stochasticity, and neither of these consider or exploit invariance with respect to a transformation group. In [15], which considers nonlinear registration (Large Deformation Diffeomorphic Image Mapping, LDDMM), a distinction between covectors and vectors in the space of smooth vector fields is made, and gradient descent is performed in the latter by applying a smoothing operation which is equivalent to the inverse of a metric. Other deformable registration approaches with regularization may do this implicitly. This right invariant metric used in LDDMM does not depend on the images being registered, and is based on spatial smoothness which is not relevant in the affine registration case. In related work [30,31] developed strategies to learn a metric for deformable image registration, but considered these choices from an accuracy point of view rather than an optimization point of view. In this deformable registration setting, Gauss Newton optimization was used in [32], which is similar to our approach as described in Section 2.5.2.
The trend in computer vision and deep learning is to use stochastic gradient descent methods for optimization, and ITK and its derivatives includes several of these approaches for image registration. For example, the work of [33][34][35] is used for adaptive stepsize estimation. Other adaptive step size approaches have been developed and applied to deformable image registration as well [36]. These approaches do not consider different scale factors for different parameters. The well known ADAM optimization procedure [37] does update step sizes on a per parameter basis, but does not allow for mixing of parameters. The importance of this mixing is evident from the significant off diagonal elements seen in Figures 2, 3.
A modern trend in image registration is to use deep learning based approaches that replace optimization procedures considered here with prediction using a single forward pass of a deep network. These tools include Voxelmorph [38], Quicksilver [39], and others. While our contribution here does not apply directly to these methods, it could potentially be used to accelerate the training stage. An investigation into this potential will be the subject of future work.
Finally, the simple registration tasks considered here are often only one step in a long pipeline or joint optimization procedure. For example, in our work studying serial section images, 3D affine registration is performed jointly with nonlinear alignment and 2D rigid registration on each slice [40,41]. The approach developed here leads to a reduction by half in the number of parameters that need to be selected manually, allowing our pipeline to generalize to new datasets more effectively. This work demonstrates that straightforward applications of principles from differential geometry can accelerate research in neuroimaging.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: Laboratory of Neuroimaging LPBA40 atlas repository https://www.loni.usc.edu/research/atlases.

AUTHOR CONTRIBUTIONS
DT conceived of the study, developed the methods, carried out the analysis, and wrote the manuscript.