Multi-Template Mesiotemporal Lobe Segmentation: Effects of Surface and Volume Feature Modeling

Numerous neurological disorders are associated with atrophy of mesiotemporal lobe structures, including the hippocampus (HP), amygdala (AM), and entorhinal cortex (EC). Accurate segmentation of these structures is, therefore, necessary for understanding the disease process and patient management. Recent multiple-template segmentation algorithms have shown excellent performance in HP segmentation. Purely surface-based methods precisely describe structural boundary but their performance likely depends on a large template library, as segmentation suffers when the boundaries of template and individual MRI are not well aligned while volume-based methods are less dependent. So far only few algorithms attempted segmentation of entire mesiotemporal structures including the parahippocampus. We compared performance of surface- and volume-based approaches in segmenting the three mesiotemporal structures and assess the effects of different environments (i.e., size of templates, under pathology). We also proposed an algorithm that combined surface- with volume-derived similarity measures for optimal template selection. To further improve the method, we introduced two new modules: (1) a non-linear registration that is driven by volume-based intensities and features sampled on deformable template surfaces; (2) a shape averaging based on regional weighting using multi-scale global-to-local icosahedron sampling. Compared to manual segmentations, our approach, namely HybridMulti showed high accuracy in 40 healthy controls (mean Dice index for HP/AM/EC = 89.7/89.3/82.9%) and 135 patients with temporal lobe epilepsy (88.7/89.0/82.6%). This accuracy was comparable across two different datasets of 1.5T and 3T MRI. It resulted in the best performance among tested multi-template methods that were either based on volume or surface data alone in terms of accuracy and sensitivity to detect atrophy related to epilepsy. Moreover, unlike purely surface-based multi-template segmentation, HybridMulti could maintain accurate performance even with a 50% template library size.

Manual MTL volumetry is a labor-intensive task with high demands on neuroanatomical expertise. Although existing automatic segmentation algorithms produce excellent segmentation results for HP and AM in healthy controls (Collins and Pruessner, 2010), their performance in TLE is challenged by the combined effects of atrophy and positional abnormalities (Kim et al., 2012a). Only a relatively small number of studies have attempted segmentation of the entire MTL regions including parahippocampal gyrus (PHG) (Heckemann et al., 2006;Keihaninejad et al., 2012). A study (Hu et al., 2014) specifically segmented the EC, a PHG subregion considered a core epileptogenic zone in TLE (Bernasconi et al., 2003) with suboptimal accuracy (Dice index=73%), likely due to challenges imposed by its complex and variable shape.
Volume-based multi-template and label fusion approaches have been designed to account for shape complexity and anatomical variability by selecting a subset of templates from a large library that best describes the target structure (Collins and Pruessner, 2010;Khan et al., 2011). More recently, our previously proposed surface-based SurfMulti method automatically segmented HP using vertex-wise texture and shape sampling (Kim et al., 2012b), demonstrating improved performances compared to purely volumetric techniques (Collins and Pruessner, 2010). However, performance of purely surfacebased approaches likely depends on the availability of a large library, as it may be negatively impacted when the boundaries of the template and individual MRI are not well aligned. The label fusion in volume-based approaches has become sophisticated using local weighted averaging (Artaechevarria et al., 2009;Coupé et al., 2011;Eskildsen et al., 2012;Wang et al., 2013;Awate and Whitaker, 2014). These approaches have demonstrated the improvement of segmentation.
MICCAI Grand Challenge on Multiatlas Labeling (Landman and Warfield, 2012) systemically evaluated various multitemplate approaches for the segmentation of numerous brain structures but the parahippocampal gyrus. A total of 25 algorithms that were trained by 15 atlases were tested on 20 images. The performance for the hippocampus and the amygdala ranged 82-87 and 75-83% in mean Dice similarity index, respectively. Among the methods that were evaluated, the ones that displayed higher accuracy were the joint label fusion technique that used a joint probability of selected atlases to correct for the bias due to the inclusion of similar atlases in the template library or the training-set  and the Non-Local STAPLE algorithm that combined Staple method with the non-local means estimator (Asman and Landman, 2013).
The current work aimed at segmenting simultaneously HP, AM, and EC using a large template library (n = 175) which included shape and volume variants in relation to TLE (n = 135). We tested well-established volume-based and surface-based approaches as well as looked for a possibility of the combined approach. The proposed algorithm, HybridMulti, combined surface-based with volume-based similarity measures for optimal template selection. The SurfMulti was based on the linear alignment between the template and individual MRI. Volumebased approaches (Asman and Landman, 2013;Wang et al., 2013) rely also on the accuracy of the linear and non-linear registration. To improve alignment, we introduced a non-linear registration step that incorporates a novel hybrid cost function based on surface and volume. Our algorithm furthermore included a new multi-level feature weighting for shape averaging. We compared MTL segmentation of HybridMulti to our previous SurfMulti (Kim et al., 2012b) and two volume-based approaches with/without local weighted averaging (Collins and Pruessner, 2010;Wang et al., 2013); evaluations also took into account the influence of template library size on segmentation performance.

METHODS
HybridMulti includes a "template library construction" where the algorithm learns image features using a training-set and an "automatic segmentation" step where the algorithm segments MTL structures for an individual test MRI (Figure 1). Training set consists of MR images and manual labels of controls and patients ( Figure 1A). Labels are converted into surface meshes using spherical harmonics and point distribution model (SPHARM-PDM) that ensure shape-inherent point-wise correspondences across subjects (Styner et al., 2004(Styner et al., , 2006b. Each surface is mapped onto its corresponding MRI. In the beginning of the segmentation step, the pair of each template image and its MTL surface are mapped on the test image. As the test image does not have its own surface, the surface features extracted on the test image are from the surface of each template. By comparing the features extracted from each template and those from the test image, Surface-with volumederived similarity measures for optimal template selection are then computed to select an optimal subset n a (Figure 1B-1). Next, a non-linear registration that is driven by volume-based intensities and features sampled on evolving template surfaces is performed to improve alignment between each template in the subset n a and the individual MRI ( Figure 1B-2). The motivation of using this hybrid registration was to improve the boundary fitting by weighting the features extracted using deformable surfaces as well as to use a consistent similarity measurement in all the steps. After choosing a smaller subset n b , templates are then averaged using adaptive weighting combined with local averaging, which creates the final segmentation ( Figure 1B-3,4). The test image's features are updated during the series of the steps including template selection, non-linear registration and weighted averaging as the image and the surface deform. In this manner, the similarity of the deformable surface and the target MTL border is expected to increase and the surface gets a similar shape to the true MTL boundary.
Template Library Construction ( Figure 1A) Prior to the subsequent procedures, all MR images in the training-set and the test-set are spatially normalized by registering them into MNI ICBM 152 space. We create a template library that aggregates surface-based regional texture models of HP, AM, and EC as a joint representation of the three MTL structures.
Manually delineated labels of each MTL structure [linearly registered to MNI ICBM-152 space (Collins et al., 1994)] are converted into surface meshes and parameterized using the spherical harmonics and uniform icosahedron-subdivision model (SPHARM-PDM) that guarantees shape-inherent vertexwise correspondence across subjects (Styner et al., 2006a). MTL surfaces are treated as one concatenated surface, S MTL = [S HP , S AM , S EC ].
Each surface S MTL is mapped to its corresponding MRI. At a given surface vertex v, we define three spherical neighborhoods of 3, 5, and 7 mm radius. These spheres are subdivided into an inner region (IR) and outer region (OR) with respect to the surface boundary, where we compute the following texture features (Kim et al., 2012b): i) Normalized intensity (NI): the ratio between mean intensity and intensity standard deviation for each of IR/OR to capture regional tissue homogeneity. We defined NI IR, i = µ IR, i / SD IR, i and NI OR, i = µ OR, j / SD OR, i. ; ii) Relative intensity (RI): the ratio of mean intensity between IR and OR to assess the contrast between IR and OR voxels. RI was defined as RI i = 2 × (µ OR, iµ IR i ) / (µ OR, i + µ IR, i ); iii) Intensity gradient (IG): the 1st derivative of intensity along x-, y-, and zdirections to capture edge information was summarized into the magnitude as IG = g 2 x + g 2 y + g 2 z = ∂I ∂x + ∂I ∂y + ∂I ∂z . [x y z] is a voxel location and I is an image.
These texture features comprises a set of "true" feature vectors (3 normalized intensity + 3 relatively intensity + 3 gradients = 9 features), F v,j extracted at v-th vertex on the j th (1 . . . j . . . N) surface template. Previously we demonstrated that each feature almost equally contributed to the segmentation accuracy and observed the optimal result using all the features. Notably, we did not use the shape features proposed in our previous surfacebased framework (Kim et al., 2012b), which was used to constrain the shape deformation in the Automatic segmentation step. The deformation in the current study is instead governed directly by a volume-based non-linear registration (see section Boundary-Weighted Non-linear Registration of Template Subset to Test MRI).

Initial Template Subset Selection
From the template library, we first select a subset of candidates that are most similar to the test image. To that end, we compute the hybrid similarity O total that combined surface-based (O surface ) and volume-based (O volume ) similarity term between each template j and the test MRI i using: w surface is a weighting constant. The surface-based similarity O surface is defined as: O surface is calculated across all surface vertices v. It represents a normalized similarity between true features extracted from the j th (1 . . . j . . . N) template (F v,j ) and estimated features extracted from the test MRI i (F v,i ). O volume can be any similarity function including the cross-correlation or the normalized mutual information (NMI) that quantifies statistical intensity distribution dependency of two images A and B (Studholme et al., 1999). The computation of cross-correlation is generally faster while the NMI is more robust in similarity of multi-modal images compared to each other. For computational efficiency, we compute O volume within a mask defined by dilating the current template label three times. The number of selected templates (n a ) was empirically determined to maximize O total (see section Parameter Selection).

Boundary-Weighted Non-linear Registration of Template Subset to Test MRI
Each template MRI is non-linearly registered to the test MRI to increase shape similarity. To estimate the deformation field from a template T to the test MRI I, a "conventional" non-linear registration iteratively matches intensity features by maximizing a volume-based similarity function O vol, reg . Accordingly, the deformation field d is estimated as: O smooth is a smoothness term to constrain the estimated deformation. We employed a type of freeform deformation models defined in Collins et al. (1995). To improve the registration accuracy, we increase the weight of voxels on and nearby the target boundary by incorporating a similarity measure derived from the template surface evolving during the registration with the original volume similarity. Let S MTL, T be the true template surface on the original MRI and S MTL, S an estimated template mapped onto the test MRI. We define S MTL, S by deforming S MTL, T using the deformation field estimated at the current iteration. A surface-based feature similarity measure between S MTL, T and S MTL, S is defined as: where v is a vertex on surfaces S; F v is the relative intensity defined in 2.1. Therefore, O surf , reg is a correlation coefficient between feature F v,T extracted on S MTL, T and feature F v ,Ŝ extracted on S MTL,Ŝ . To estimate the deformation field, we redefine the Equation (3) as: O vol, reg is the correlation coefficient over a volume of interest (here, a geometric union of all MTL template labels in the library, subsequently dilated 5 times for more extensive spatial coverage) as in Collins and Pruessner (2010). A larger weight w surf , reg moves S MTL, S more rapidly to areas presenting with feature characteristics similar to those on the surface of the template image. Finally, Equation (5) is optimized using a derivativefree 3D Nelder-Mead Simplex approach (Lagarias et al., 1998) as also known as the simplex method, is a commonly applied approach. This method is applied to non-linear optimization problems for which derivatives may not be known and is robust against the local minima problem. This function has been used as the standard optimization method in the non-linear registration algorithm (Collins et al., 1995) we adopted in the current paper.

Subset Restriction and Global Weighed Averaging
The non-linear registration in the previous section (Boundaryweighted Non-linear Registration of Template Subset to Test MRI) is applied to decrease shape variability and to increase similarity between the template-subset and test image. From the initially selected n a template-subset (n a < N), we choose an even smaller subset of the n b most similar templates (n b < n a < N) based on Equation (1), increasing computational efficiency in subsequent steps. We determine n b empirically, which will be evaluated in the section Parameter Optimization. Optimal global weights for these n b templates are calculated using the similarity function Equation (2) as in Kim et al. (2012b). Let w S and w F be n b × 1 weight vectors for optimal surfaces and features. We then define S as the average surface of the n b template-subset as: Analogously, we define the weighted mean and SD of features at a given vertex v i by: Similarity from Equation (2) can be formulated for the templatesubset n b : Frontiers in Neuroinformatics | www.frontiersin.org F v, s is the estimated feature-set computed on the averaged surface S mapped on the test image. In the above formulas, weights are determined by maximizing the similarity between the n b template-subset and test image.
We initialized all components of w S and w F to 1/n. The cost function O subset is optimized using the multivariate derivativefree Nelder-Mead approach (Lagarias et al., 1998).

Multi-Level Local Weighted Averaging
To incorporate a local weighting to Equations (5-9), the resulting surface S in Equation (10) is resampled through icosahedronsubdivision (Styner et al., 2006b), first at the coarsest level l = l 0 . We determine weights at each sampling vertex, and interpolate these weights to vertices at the next finer level l 1 . Let w S l be a n b m weight matrix: m is the number of vertices at level l.
We compute w' S l , (a n b V vector) by interpolating w S, l to . For interpolation, we use the Fast Spherical Linear Interpolation (Shoemake, 1985). We define the locally weighted average surface as: The similarity function at the level l was defined as: (12) To achieve the final segmentation of all three MTL structures, we optimized w S l using the Nelder-Mead method while increasing subdivision level l=[l | l 0 , l 1 ,..., l max ]. The algorithm stops when Equation (11) stops increasing or l reaches preset l max to prevent from an extensive computation. The proposed multilevel approach using different subdivisions is mainly for coarseto-fine spatial fitting and the use of this strategy avoids the introduction of a constraint term preventing from local minima while the surface shape gets finer. In the current study, we set the coarsest level (l 0 = 2) where 42 equally distributed vertices are sampled; the finest level l max is determined empirically (See section MRI Acquisition).

Subjects
Our training-set included 40 healthy controls (18 men; mean ± SD age = 33 ± 12 years) and 135 drug-resistant TLE patients (61 men; mean ± SD age = 37 ± 11 years). TLE diagnosis and lateralization of the side of the seizure focus into left TLE (n = 65) and right TLE (n = 70) were determined by a comprehensive evaluation including video-EEG recordings and MRI. The Ethics Committee of the Montreal Neurological Institute and Hospital approved the study and written informed consent was obtained from all participants.
We also acquired 3T T1-weighted images on Siemens Trio Tim scanner using a 32-channel phased-array head coil. T1weighted images were acquired using 3DMPRAGE with 1 mm isotropic voxels (TR = 3,000 ms, TE = 4.32 ms, TI = 1,500 ms, flip angle = 7 • , matrix size = 336×384, FOV = 201 × 229 mm). This data was used to evaluate whether the algorithm consistently selected the same or similar parameter values for different dataset. The 3T dataset included 39 healthy controls and 84 drugresistant TLE patients who were further classified into left TLE (n = 38) and right TLE (n = 46).

Evaluation Metrics
To quantify the accuracy of automated segmentations, we computed the Dice similarity index:D = 2xv(M ∩ A)/(v(M) + v(A)), where M/A are the voxels comprising manual/automated labels; "M n A" are voxels in the intersection of M and A; v (·) is the volume operator.

Parameter Optimization
Based on maximal Dice overlap index between automated and manual labels, the following parameters were chosen empirically: weight of surface-based similarity w surface to select the optimal subset as in Equation (1); weight of surface-based similarity w surf , reg used in non-linear registration; size of initial template-subset n a ; size of final template-subset n b ; and finest subdivision l max in local weighting. We validated HybridMulti using a three-fold cross-validation where we subdivided our data into 3 sets with an almost equal sized sample (n = 58,58,59) and merged two sets among them to create a training-set and used the remaining set as a test-set while we balanced the proportion of controls (∼25%) and patients (∼75%) per set. The optimal parameters that resulted in most accurate segmentation were selected for each training-set. We segmented the test-set based on their corresponding training-set and the parameters. We repeated this process three times while all the three sets were tested.

Performance at Each Segmentation Stage
Segmentation accuracy was evaluated at the following stages: i) initial n a template-subset selection; ii) non-linear registration; iii) FIGURE 2 | Parameter optimization. All parameters were selected resulting in the best accuracy. The accuracy was measured using mean Dice index based on the three mesiotemporal structures and on three different test-sets (black, red, green) using a three-fold cross validation.
final n b template-subset selection; iv) global and local weighted averaging. We compared accuracy at each stage to that of the previous stage using paired t-tests.

Comparison With State-of-the-Art Multi-Template Approaches
We compared Dice indices between HybridMulti, and SurfMulti (Kim et al., 2012b), or a volume-based multitemplate approach (VolMulti) based on non-weighted averaging (Collins and Pruessner, 2010) or a volume-based approach (JointFusion) based on local-weighted averaging  in controls and each patient group using Student's t-tests. The parameters for each algorithm were selected empirically (VolMulti: size of subset = 15; JointFusion: search area r s = 3 x 3 x 3, patch size r p = 3 x 3 x 3, β = 2) which resulted in the best accuracy using a leave-one-out approach.

Detection of Mesiotemporal Atrophy Related to the Epileptic Focus
We assessed the ability of each automatic algorithm to detect each structure's atrophy in TLE groups relative to controls by computing Cohen's d ([mean volume controls-mean volume TLE] / pooled SD) that measures the effect size of a betweengroup difference, and calculated the significance of the observed effect using t-tests.

Impact of Template Library Size on Segmentation Accuracy
Keeping proportions of controls and patients constant, we randomly selected 40 subjects as a test-set. We then created the template library by selecting randomly from the rest of data, with various sizes: n = 88 (1/2), n = 58 (1/3), n = 44 (1/4), and n = 35 (1/5) of its original size. We repeated this process 20 times to avoid a possible bias. We evaluated automated segmentation accuracy at these smaller template library sizes.
Significances of all statistical tests were adjusted for multiple comparisons using Bonferroni-correction.

Parameter Selection
The parameters resulting in the best segmentation accuracy were selected at very similar values between the 3 test-sets when using a three-fold cross validation. The proposed HybridMulti achieved maximal accuracy with the following parameters: w surface = 3.1, w surf,reg = 1.1, n a = 17, and n b = 8 (average between the 3 test-sets; Figure 2). Use of the cross-correlation or NMI as the similarity function did not make a difference in segmentation accuracy. We thus used the cross-correlation as it was faster to compute. We also found that the local weighting using the finest subdivision l max larger than 5 (producing 252 sampling vertices) maintained the segmentation accuracy without a further improvement. Thus, we chose l max = 5 as a larger l max increased the computational time. JointFusion yielded best results with the following parameters: beta = 0.5; rp = 3; rs = 3. SurfMulti used n = 10 for the optimal subset whereas VolMulti used n = 14. All the algorithms were tested on a same computing environment (Linux workstation, 1 CPU, 2.30 Ghz, 8 GB RAM). Average computation times per individual hemisphere were 20 or 25 min for HybridMulti (O volume = cross-correlation or NMI, respectively; step-wise: initial subset selection: 1 min; non-linear registration: 10 [cross-correlation] or 15 [NMI] min; smaller subset selection: 0.5 min; global weighting: 3 min; Local weighting: 5.5 min); 15 min VolMulti; 15 min JointFusion; 13 min SurfMulti.
When performing the same evaluation on 3T dataset, we found the parameters yielding the maximal accuracy were selected at very similar values: w surface = 3.2, w surf ,reg = 1.2, n a = 17, n b = 8, and l max = 6.

Segmentation Accuracy in Different Steps
Accuracy of HybridMulti was improved gradually from the initial selection step and the highest accuracy was achieved at the final local weighted averaging (Figure 3).
Highest improvement was found at the boundary-weighted non-linear registration step for all structures (mean Dice = +4.8%, p < 0.0001). Moreover, the proposed non-linear registration that included a surface-term outperformed the original volume-based registration (Collins et al., 1995) (+2.5%, p < 0.001). Inclusion of local weighted averaging also significantly improved segmentation of EC (0.7%) and (HP: 0.3%) compared to the global weighting (p < 0.05).

Performance Comparison Between Algorithms
For all MTL structures, HybridMulti consistently outperformed SurfMulti and VolMulti in patients and controls (p < 0.001, Table 1), which was equally significant for 1.5T and 3T data ( Table 2). HybridMulti also showed a superior accuracy in TLE patients compared to JointFusion as higher Dice indices  were found in HP and EC ipsilaterally and in AM and EC contralaterally (p < 0.05). HybridMulti also segmented EC in healthy controls more accurately than JointFusion (p < 0.001). This pattern of difference between HybridMulti and JointFusion was similar in 3T data ( Table 2). For the 3T data, even using a smaller dataset, we found that all the methods resulted in accuracy comparable to the larger 1.5T dataset, with generally decreased SDs. A separate test that segmented 3T dataset using the 1.5T training-set showed the result where we found overall a slight drop down in the accuracy and a larger SD (Controls: HP = 89.5 ± 2.4; AM = 89.0 ± 2.9; EC = 82.8 ± 4.4; TLE-ipsilateral: HP = 88.5 ± 2.8; AM = 89.1 ± 3.2; EC = 82.5 ± 4.9; TLE-contralateral: HP = 89.2 ± 2.6; AM = 89.1 ± 2.8; EC = 82.5 ± 5.2) compared to when using a smaller-set of the same field strength training data. This suggests that using a lower field training-set to segment a higher field strength data results in slightly decreased accuracy due to a different tissue-contrast. Examples for 1.5T are shown in Figure 4 and those for 3T in Supplementary Figure 1.

Ability of Automated Methods to Detect Atrophy Related to the Epileptic Focus
Group-wise comparisons identified hippocampal atrophy ipsilateral to the seizure focus in TLE patients irrespective of the method, i.e., manual or automated (p < 0.05, Table 3). The effect sizes of atrophy detected using algorithms were all large (Cohen's d > 0.8). HybridMulti and JointFusion, nevertheless, detected an effect size of atrophy closest to manual volumetry (Cohen's d: manual = 1.67; HybridMulti = 1.57; JointFusion = 1.56).
Manual and HybridMulti segmentation also detected a large effect size of ipsilateral EC atrophy, which was significant compared to controls (t > 3.2, p < 0.05).

Impact of Template Library Size on Segmentation Accuracy
Reducing the template library size from N (n = 175) to N/5 (n = 35) showed that the accuracy of EC segmentation declined fastest compared to HP and AM, consistently in all algorithms tested. Size of the library had a lower influence on segmentation accuracy of HybridMulti, and volumebased approaches (JointFusion, VolMulti) than SurfMulti. Indeed linear model analysis of an interaction term between "segmentation method" and "size of the library" revealed a faster decline in Dice index for SurfMulti than for the other three methods (p < 0.001). HybridMulti and JointFusion, on the other hand, resulted in a similar accuracy when reducing the template library size from N to N/4 across all MTL structures (mean Dice decrease < 1%, p < 0.1, Figure 5). In HP and EC, reducing the library size from N/4 to N/5 influenced the accuracy  more significantly for HybridMulti than JointFusion (p < 0.01). However, the accuracy of HybridMulti was higher than that of JointFusion in all structures (mean Dice difference-HP: 0.3%; AM: 0.1%; E: 1%).

DISCUSSION AND CONCLUSION
We propose HybridMulti, an algorithm that combines surfaceand volume-based similarity to automatically segment key regions in the mesiotemporal lobe (i.e., HP, AM, and EC). In controls and TLE patients alike, segmentation accuracy was excellent, with Dice indices above 88% for HP and AM and above 82% for EC. In particular, the proposed method outperformed previous multi-template approaches in pathological MTL structures, as its overlap to manual delineation and its sensitivity to detect atrophy were superior. Reducing template library showed that our method is reliable in even case of a small size of training-set. Our algorithm was compared to three recently proposed multi-template approaches: volume-based approaches-JointFusion , VolMulti (Collins and Pruessner, 2010), and a purely surface-based framework-SurfMulti (Kim et al., 2012b). Improved segmentation accuracy of HybridMulti relative to these algorithms likely results from modeling both volume-and surface-derived features to select the optimal template subset and to improve the alignment between these templates and the test MRI prior to surface-shape FIGURE 5 | Impact of template library size on automated segmentations. Reducing the template library size from N (n = 175) to N/5 (n = 35) showed that the accuracy of EC segmentation declined fastest compared to HP and AM, consistently in all algorithms tested. Size of the library had a lower influence on segmentation accuracy of HybridMulti, and volume-based approaches (JointFusion, VolMulti) than SurfMulti.
averaging. Noticeably, our approach did not only sequentially apply a volumetric non-linear registration prior to the surfacebased segmentation; instead, surface features were integrated with volume data-term into a unified cost function governing the non-linear registration, an approach yielding additional increases in accuracy.
In addition to absolute gain in segmentation accuracy, the proposed HybridMulti algorithm demonstrated robust segmentation for our two separate data-sets when the size of the template library was reduced, an important challenge for purely surface-based approaches as shown in our analysis. Indeed, volume-based approaches were inclined to maintain its original accuracy at the largest template library when reducing the size of the library. At the smallest size that was tested in our study (n = 35), the accuracy of JointFusion and HybridMulti was almost equal in all MTL structures. This informs us to an interesting aspect of feature modeling where local features modeled nearby the structure's boundary may be individually very specific and become powerful with construction of a large training-set. On the other hand, features collected within a "relatively large" volume of interest may include redundant information in a large database but may provide supplementary characteristics of the target structure in case of using a limited size of template library.
In our hybrid approach, tuning of the weight between surfaceand volume-features according to the size of a given template library can possibly improve the segmentation accuracy.
Our EC segmentation in the current work (>82%) outperformed a previous study that reported a Dice index of 73% (Hu et al., 2014), and another study that segmented the whole parahippocampal gyrus with a similar degree of accuracy (Heckemann et al., 2006). The performance of HybridMulti was also superior to JointFusion and SurfMulti in the current evaluation. Nevertheless, our EC segmentation accuracy was still lower than that of HP and AM, which approached 90%. It is likely that intensity-based segmentation is challenged by the highly variable morphology of the collateral sulcus that defines the border of EC. Also, the posterior end of EC is defined with an external anatomical landmark. Use of a smaller size of template library also showed a faster decline of accuracy in EC than other MTL structures. In the literature (Bernasconi et al., 2001;Pruessner et al., 2002), multiple landmarks were borrowed to address for lack of intensity contrast when defining the border of EC. For example, the medial and lateral boundaries, which meet the same GM structures such as the subiculum of the hippocampus and the collateral sulcus, cannot be defined by the tissue contrast but by landmarks such as a location with a high angular shape. A human expert may intuitively identify such landmarks whereas the features used in our algorithm do not necessarily take into account them. The suboptimal modeling of these landmarks in our approach is likely the source of inaccuracy in segmentation. This faster decline in accuracy was consistently observed in all algorithms tested. Future works might, therefore, benefit from the incorporation of sulco-gyral shape patterns such as sulcal depth, curvature or spatial relationship with surrounding structures other than HP and AM.
A new multi-scale weighting strategy improved EC and HP segmentation. In particular, the improvement of EC segmentation was higher. This was in line with a previous finding that such a technique mainly improve the segmentation of structures presenting highly variable morphology (Artaechevarria et al., 2009).
The proposed algorithm and JointFusion detected largest effect sizes of atrophy in HP ipsilateral to the epileptic focus and resulted in the most sensitivity to detect hippocampal atrophy among algorithms. Only HybridMulti identified EC atrophy among algorithm even if the accuracy yet is to reach human expert's exquisiteness. Our results suggest that the proposed approach may have the potential for clinical utility in the presurgical evaluation of temporal lobe epilepsy.
Varying the parameters for HybridMulti (i.e., the weights for surface-term in the similarity measure and the registration, and the number of templates in the subset) yielded different segmentation accuracy. We determined these parameters in empirical fashion for optimal segmentation performance. We observed that almost same parameter setting were determined for achievement of the best results on both 1.5T and 3T. In a further analysis, we found that these parameters did not differ between segmentation of the three MTL structures. This suggests that the parameters optimized in our study, albeit done empirically, may be generally applicable to segmentation of other datasets or other brain structures. A more thorough analysis is demanded to establish the generalization of the parameters.
For 3T dataset, all the methods resulted in accuracy comparable to the larger 1.5T dataset, with generally decreased SDs. This likely explains that reliable segmentation can be achieved on 3T images where the higher tissue contrast and clearer structural boundaries seen.
As the initial selection was not optimal and we did not like to miss templates which can be potentially useful, we defined a relatively lager subset whereas we set a smaller sample in the subsequent selection with a deformable registration. Our empirical selection of parameters indeed found better segmentation performance was obtained using a larger subset in the initial selection (best performance at n = 17) and a smaller set in the latter selection (n = 8). The vertexwise correspondence between individual surface templates defined through SPHARM-PDM ensures the same topology across templates. When we averaged the template shapes, we performed a vertex-wise averaging method that averages the location of a given vertex of the correspondence between templates. The integrity of the topology was not corrupted after this averaging as the same observation is found in a similar process of shape averaging such as in construction of cortical surface template (Styner et al., 2004;Lyttelton et al., 2007).
To determine the number of templates with the best performance, it would be ideal if we observed a plateau occurring after the continuous hiking in Dice index value from the minimum number of templates to test with (Figure 5). However, no plateau with an on-going climbing pattern was found in EC, which make difficult to determine when the best performance takes place. The best performance might have been identified if we tested with more templates. This is our limitation as collecting a sufficiently large dataset is often a longterm procedure in the inpatient epilepsy monitoring unit. Thus, it was unrealistic for us to include more data in the study. Alternatively, the very slow increase in Dice index observed at the test with 90+ templates likely explains the increase of the templates would not gain a very significant improvement of the current method. There have been studies dealing with the size of the template library using statistical models (Awate et al., 2012;Awate and Whitaker, 2014). We did not explore the possible selection of too many similar templates in the subset. A previous study  investigated this using a joint label fusion technique that address for the covariance of the image appearance between any pair of two templates in the training-set. Generalization of the proposed method across different subcortical structures (e.g., ventricles, striatum, or thalamic nucleus) would be also interesting to enable their morphometry analysis, in particular with regard to size, shape, and variability. We are also working on to extend our current framework to segmentation of the subregions of MTL structures such as hippocampal subfields. The deep learning algorithm using convolutional neural networks (CNN) has been more widely applied in recent works for the medical image segmentation (Kamnitsas et al., 2017;Bao and Chung, 2018;Dolz et al., 2018). Augmentation of our relatively large set of our MRI data and manual annotations could meet the requirement for the training of the CNNs, which can be a proper future extension of our work. We are currently taking steps to make our tools available, including obtaining proper institutional ethics approval, with the plan to ultimately upload the software and training set to a public domain, such as the Neuroimaging Informatics Tools and Resources Clearinghouse (http://www. nitrc.org/).

AUTHOR CONTRIBUTIONS
HK implemented the study design and the algorithm. Performed the evaluation. Wrote and edited the draft. BC tested and retested the data using a conventional method to compare. AB provided the MRI data and patient clinical data. Edited the manuscript. NB manually labeled the MTL structures edited the manuscript.