Stable Anatomy Detection in Multimodal Imaging Through Sparse Group Regularization: A Comparative Study of Iron Accumulation in the Aging Brain

Multimodal neuroimaging provides a rich source of data for identifying brain regions associated with disease progression and aging. However, present studies still typically analyze modalities separately or aggregate voxel-wise measurements and analyses to the structural level, thus reducing statistical power. As a central example, previous works have used two quantitative MRI parameters—R2* and quantitative susceptibility (QS)—to study changes in iron associated with aging in healthy and multiple sclerosis subjects, but failed to simultaneously account for both. In this article, we propose a unified framework that combines information from multiple imaging modalities and regularizes estimates for increased interpretability, generalizability, and stability. Our work focuses on joint region detection problems where overlap between effect supports across modalities is encouraged but not strictly enforced. To achieve this, we combine L1 (lasso), total variation (TV), and L2 group lasso penalties. While the TV penalty encourages geometric regularization by controlling estimate variability and support boundary geometry, the group lasso penalty accounts for similarities in the support between imaging modalities. We address the computational difficulty in this regularization scheme with an alternating direction method of multipliers (ADMM) optimizer. In a neuroimaging application, we compare our method against independent sparse and joint sparse models using a dataset of R2* and QS maps derived from MRI scans of 113 healthy controls: our method produces clinically-interpretable regions where specific iron changes are associated with healthy aging. Together with results across multiple simulation studies, we conclude that our approach identifies regions that are more strongly associated with the variable of interest (e.g., age), more accurate, and more stable with respect to training data variability. This work makes progress toward a stable and interpretable multimodal imaging analysis framework for studying disease-related changes in brain structure and can be extended for classification and disease prediction tasks.


INTRODUCTION
Following progressive developments in healthcare, the subpopulation of individuals over 60 years old has become the fastest growing worldwide (lr1, 2015). Thanks also in part to rapid advancements in neuroimaging and data storage technology, it is not surprising that much work in recent years has been devoted to studying structural changes in the brain during natural aging and disease progression (e.g., Alzheimer's and Parkinson's diseases Wyss-Coray, 2016 and multiple sclerosis Elkady et al., 2017) as well as differences between healthy and non-normative subjects, particularly in the elderly population.
The naturally high dimensionality of neuroimaging data and their inherent spatial correlation structures present significant challenges to the above goals. Implications of these challenges are apparent in the drawbacks (Davatzikos, 2004) of voxelbased analyses (Ashburner and Friston, 2000) and motivate many of the dimension reduction and regularization techniques commonly applied to neuroimaging data. These methods include pattern identification and discrimination (Fan et al., 2007), which rely on sample sizes not feasible in small neuroimaging studies, and sparse regularization for feature extraction or selection (Batmanghelich et al., 2011;Sabuncu and Van Leemput, 2012;He et al., 2016;Yu et al., 2019;Su et al., 2020). Despite successful applications of the latter to medical data (Krishnapuram et al., 2005;Zou and Hastie, 2005;Ryali et al., 2010), these approaches are unstable with respect to selected or extracted features. In the presence of spatial correlation structures, such as in neuroimaging data where each voxel carries similar information to its neighbors, sparsity comes at the severe detriment of interpretability. To address this, approaches to shape regularization have been developed, notably, total variation (TV) penalties which penalize spatial gradients to encourage a model to apply similar weights to neighboring voxels.
Nonetheless, incorporating multimodal imaging data into analytic techniques remains a difficult problem in general. Previous neuroimaging studies have demonstrated that the simultaneous analysis of multiple imaging modalities can improve statistical power (Elkady et al., 2017), although methods for efficiently doing so are limited. As a result, most studies either analyze different imaging modes independently  or aggregate voxel-wise observations to the structural level (Cherubini et al., 2009). This approach is inefficient from the perspective of both statistical power and estimate interpretability.
As a focal example and application for this article, we consider specific iron changes in deep gray matter (DGM), which have been histologically associated with healthy aging in the brain (Hallgren and Sourander, 1958). Gradient-echo magnetic resonance imaging (MRI) is used extensively in vivo due to its sensitivity to such changes (Peters, 2002). Transverse relaxation rate (R2*) and quantitative susceptibility (QS), both voxel-wise measures derived from gradient-echo MRI, are typically used for this purpose (Haacke et al., 2010;Acosta-Cabronero et al., 2016). Several studies have demonstrated a positive association between iron levels and age in healthy controls using quantitative MRI methods, although they consider only a single MRI measure or anatomical structure (Cherubini et al., 2009;Haacke et al., 2010;Acosta-Cabronero et al., 2016). The combined use of R2* and QS has only recently been introduced, for the purpose of improving statistical power in delineating iron and non-iron changes in studies of multiple sclerosis (Elkady et al., 2017). This earlier study, however, only combines modalities in an initial feature selection stage rather than in a single model.
Related works have similarly investigated, with the previouslydiscussed methodological limitations, the complex associations between QS and R2* measures, iron, and myelin in brain structures. Through a biochemical and histological analysis of postmortem brains using Pearson correlation (e.g., between myelin and R2*) and linear mixed effect regression (e.g., of R2* on iron and myelin intensities) on region-or structurallyaggregated measures, Hametner et al. (2018) showed that R2* is only weakly sensitive to myelin in white matter regions that do not contain iron, but is strongly sensitive to iron in the basal ganglia and white matter regions that do contain iron. The authors further demonstrated that QS also has mixed sensitivities to iron and myelin in white matter. Taege et al. (2019) applied region of interest-based (i.e., structurally-aggregated) regression analyses using R2* and QS to investigate deep gray matter microstructure in the context of multiple sclerosis. Mangeat et al. (2015) used two-dimensional independent components analysis of magnetization transfer, cortical thickness, B 0 orientation, and R2* to extract information about the fraction of myelin in white matter. Bergsland et al. (2018) used QS and diffusion metrics to perform separate univariate statistical analyses of thalamic white matter tracts and used non-parametric correlation analysis to combine these measures and quantify their association with clinical disease metrics.
While all the previous works provide useful insight into the structure and composition of the brain, the use of region-based analyses relies of the typically incorrect assumption of intraregion homogeneity. Furthermore, these works generally do not jointly incorporate multiple imaging modes as model covariates. These are general limitations in the application of traditional statistical methods to high-dimensional neuroimaging data. In this work, we aim to address these limitations in the context of region-detection problems.
In this article, we propose a unified, multimodal framework for joint region detection associated with a numerical variable of interest such as age. We formulate this problem in terms of sparse regression with L 1 (lasso), TV, and L 2 group lasso penalties, with the last of these combining information across multiple modalities. These penalties have been widely used in the development of robust methods for medical data (He et al., 2016;Yu et al., 2019). Together, the penalties regularize spatial effect estimates for better interpretability, both within and between modalities. This work is concerned primarily with joint region detection, such as in our focal R2*-QS-age problem, where overlap between effect supports across modalities is encouraged but not strictly enforced. To address the computational difficulty in implementing these penalties together, we propose an optimization procedure using the alternating direction method of multipliers (ADMM) algorithm.
Our work addresses the need for multimodal neuroimaging techniques for joint region detection that can provide interpretable estimates accounting for similarities in spatial effect supports. In our focal example, it is critical to account for both R2* and QS since, voxel-wise, both are positively associated with iron accumulation (Langkammer et al., 2012): the increased statistical power in multimodal MRI analysis has previously detected subtle pathological changes in neurological diseases such as multiple sclerosis (Elkady et al., 2017) and will similarly improve the detection of aging-related neurodegenerative effects. Generally, our approach emphasizes the interpretability and stability of support estimates in terms of spatial smoothness and variability (with respect to perturbations in training data), respectively. Our estimators are much less affected by spatial correlation structures that would otherwise result in severe undersegmentation in other sparse analyses. Furthermore, our work easily accommodates multiple anatomical regions simultaneously: this contrasts with traditional voxel-based analyses, which are generally known to have unsuitable performance when considering multiple regions due to their disregard for spatial correlation (Ashburner and Friston, 2000). In the limited literature addressing multimodal data (Michel et al., 2011;Gramfort et al., 2013), our formulation using sparse regularization is novel.

Sparse Regression and Image Regularization
We first present more detail on the components of our proposed framework and define the two other models against which the proposed method will be compared. For notational simplicity, we ignore intercept terms in the following subsections and only consider two modalities, although our setup generalizes immediately.
Let X 1 and X 2 be n × p matrices corresponding to the two imaging modalities (e.g., R2* and QS) after columnwise standardization. These matrices have i-th row x x x ⊤ 1i and x x x ⊤ 2i , respectively, corresponding to vectorized, observed image data for the i-th subject. In general, p can be smaller than the number of voxels in the image due to the application of a mask (that is constant across subjects). Let β β β 1 and β β β 2 denote corresponding voxel effects that are to be estimated, and y y y an observed length-n vector of a continuous variable corresponding to the n subjects. Loosely, we are interested in associations of X 1 and X 2 with y y y, where p ≫ n.

Sparse Regression
The traditional (least squares) linear regression problem in this setting takes the form (β β β 1 ,β β β 2 ) = arg min β β β 1 ,β β β 2 y y y − X 1 β β β 1 − X 2 β β β 2 2 2 . This estimator is unstable and has extremely high variability, attributable to overfitting in this high-dimensional setting. Furthermore, this estimator typically yields estimates with only non-zero entries, making it useless for region identification. To address both these problems, we introduce a sparsity assumption on β β β = (β β β ⊤ 1 , β β β ⊤ 2 ) ⊤ by adding an L 1 (lasso) penalty to the previous problem, as an approximation of the L 0 penalty used in (NP-hard) best subset selection (Tibshirani, 1996;Huo and Ni, 2007). The lasso penalty is widely-established in both statistical theory and application as a method for variable selection (Tibshirani, 1996;Tibshirani et al., 2005;Simon et al., 2013): loosely-speaking, when applied in 3D imaging contexts, this penalty identifies voxels of interest by shrinking entries ofβ β β corresponding to "non-significant" voxels to zero.
The sparse regression problem becomes arg min where the hyperparameter λ 1 controls the balance between model fidelity and model sparsity. We use the same hyperparameter λ 1 for both β β β 1 and β β β 2 since X 1 and X 2 are standardized. While the sparse regression problem in (1) is typically encountered in models for predicting y, that is not our primary interest here. We are instead concerned with associations of X 1 and X 2 with y y y: in a neuroimaging context, we wish to identify regions with MRI measures X that vary together with age y y y. In particular, the unpenalized, least squares model maximizes the square sample correlation between y i and x x x ⊤ 1i β β β 1 + x x x ⊤ 2i β β β 2 , but requires further regularization to become practically usable.

TV Regularization
While the previous sparsity assumption solves initial problems in estimating β β β, it does not suitably address estimate interpretability. In neuroimaging contexts, since measures from nearby voxels tend to be correlated,β β β will select isolated voxels rather than contiguous, compact regions that are more amenable to interpretation (Kandel et al., 2013;Dubois et al., 2014;Eickenberg et al., 2015).

Sparse Group Lasso Regularization
The TV penalty in (2) addresses the interpretability of β β β 1 and β β β 2 estimates separately rather than jointly. The latter is a natural consideration in neuroimaging when interested in the joint support, e.g., regions that are associated with age for both R2* and QS modalities, or if prior knowledge suggests that the effect supports should overlap. We introduce an additional sparse group lasso regularization term to this end. Group lasso methods apply a penalty to pre-specified groups of variables and are widely used. To encourage overlap in the two estimate supports, we consider p groups of size 2, each composed of voxel-wise measures at the same location. The sparse group lasso penalty (Simon et al., 2013) in this case is given by In contrast, the standard group lasso (Tibshirani, 1996) applies an L 1 penalty to the estimates in each group rather than the L 2 penalty above, forcing all estimates in a group to be both either zero or non-zero. This is an undesireably hard constraint in our setting. The resulting penalized problem and our proposed estimator is given by arg min

Optimization
The optimization problem in (3) is difficult to solve due to the non-differentiability of the L 1 norm appearing in the lasso and TV penalties, to which standard gradient-based methods do not apply. Various numerical methods and have been explored for settings with these two penalties, including for logistic regression (Dohmatob et al., 2014).
In this work, we apply the ADMM algorithm (Boyd et al., 2011), an efficient convex optimization scheme amenable to parallel computing. Its computational benefits rely on splitting a given problem into two convex sub-problems. In the present case, denoting β β β = (β β β ⊤ 1 , β β β ⊤ 2 ) ⊤ , (3) can be written as arg min and D is the 3-dimensional differential operator (Rohr, 1997).
The β β β-update step minimizes a smooth function (noting that α α α and η η η are held fixed) and can be performed using standard firstorder methods. The α α α-update step, on the other hand, features a non-smooth objective function containing both L 2 and L 1 penalties on α α α. This update admits a closed-form solution using the soft thresholding operator S λ (x) = sgn(x)(|x| − λ), namely, For termination criterion, we follow the suggestion of Boyd et al. (2011) based on the problem's primal residuals r r r (t) primal = α α α (t) − Aβ β β (t) and dual residuals r r r (t) for positive primal and dual tolerances ǫ primal and ǫ primal , respectively.

Models and Hyperparameter Tuning
In all subsequent analyses, we compare performance between three models. The first is an independent sparse (IS) model, composed of one model for each imaging modality, each fit with lasso and TV regularization. For this model, "predictions" for y y y are taken as averages across the independent models. Second, we consider a joint sparse (JS) model, given in (2), including all imaging modalities together with lasso and TV penalties. Third, we consider our proposed sparse group lasso (SGL) method in (3) including all imaging modalities and lasso, TV, and sparse group lasso regularization. Generally, three hyperparameters λ 1 , λ 2 , and λ 3 require tuning. We reparameterize these in terms of λ, r 1,2 , and r 3 as (λ 1 , λ 2 , λ 3 ) = λ(r 1,2 (1 − r 3 ), (1 − r 1,2 )(1 − r 3 ), r 3 ) and use the Bayesian information criterion (BIC) (Fan and Tang, 2013), calculated on the full training dataset, as a performance criterion to choose optimal hyperparameter values. In all analyses, results using the generalized information criterion (GIC) (Fan and Tang, 2013) are similar.

Synthetic Data and Simulation Study
Because ground truth for the support regions is not available in real-world imaging, we use synthetic data to more comprehensively investigate the performance of our proposed method. Our setup generalizes that in Zhang et al. (2018) by including a second imaging modality and inter-modality correlation. In all simulation studies, we use training datasets of size n = 100, which is comparable to the n = 113 used in our real-world MRI application.
Imaging data for the first mode is generated with size 32×32× 8. The true support is represented by four disjoint 8×8×4 regions (which we call "blocks"). Imaging data for the second mode is generated similarly: however, to simulate (partial) overlap in the support between the two modes, we allow block size for the second mode to vary uniform randomly over {6, 7, 8, 9, 10} 2 × {4, 5}. Example supports are provided in Figure 1.
Background noise is independently generated from a N (0, I) distribution. Voxel measures within a block are correlated only with voxels from the same block and from the corresponding block in the other mode. This inter-voxel correlation is controlled by a spatial coherence parameter ρ that decreases with L 0 distance, and is further multiplied by an inter-mode correlation parameter η for measures in different modes.
We consider two small-sample (n = 100, p = 32 × 32 × 8 × 2 = 16384) simulation settings: a "low correlation" (ρ = 0.5, η = 0.2) and a "high correlation" setting (ρ = 0.8, η = 0.5). In each case, we run two analyses. The first examines overall performance averaged over 10 simulations, each using independently-generated true supports. The second examines estimate stability over five independent datasets generated using the same true support (i.e., for 10 possible inter-simulation comparisons). From here on, we refer to these two analyses as simulation study #1 and simulation study #2, respectively.
Intuitively, simulation study #1 assesses average performance across a variety of ground truths. On the other hand, simulation study #2 assesses estimate stability across different training datasets for a single, fixed ground truth.

Neuroimaging Data, Pre-processing, and Analyses
The neuroimaging data used in this work is from an in-house study of multiple sclerosis and controls (Walsh et al., 2014;Elkady et al., 2017), similar to that used in Zhang et al. (2018) but composed of only n = 113 control scans (mean age 40.25 with SD 10.91, minimum 21.8, and maximum 65.1), of which 40 were obtained from male subjects. Our primary interest is in FIGURE 2 | Deep gray matter (DGM) structures considered in this study. The DGM mask applied to the data is shown in gray in all sub-panels. The caudate (blue), putamen (red), thalamus (green), and globus pallidus (purple) are highlighted in each sub-panel.
identifying regions of the brain in which variability in iron levels is associated with age. We focus on four subcortical deep gray matter (DGM) structures: the caudate, putamen, thalamus, and globus pallidus, shown in Figure 2 (Zhang et al., 2017(Zhang et al., , 2018. We use R2* and QS modalities, as both are known to be highly sensitive to changes in non-heme iron (Wang and Liu, 2015).
Prior to analysis, the MRI data was pre-processed and aligned with an in-house unbiased template using ANTs (ANTS, 2011), built from T1w, R2*, and QS data from 10 healthy controls. Preprocessing involved intra-subject alignment of the R2* and QS "LC" and "HC" denote the low correlation (ρ = 0.5, η = 0.5) and high correlation (ρ = 0.8, η = 0.5) settings, respectively. "IS," "JS," and "SGL" refer to the independent sparse, joint sparse, and proposed sparse group lasso models, respectively. All results are presented in the form "mean (SD)." Entries with the best mean results for the LC and HC settings in each column are in bold.
maps with the T1w map. Non-linear registration in the template space was accomplished by applying SyN (ANTS, 2011) to the multimodal MRI data. Observation row vectors x x x ⊤ i were obtained by taking voxels within a DGM mask manually traced on the atlas, shown in Figure 5. Data matrix columns were standardized before analysis.

Evaluation Methodology
For simulation study #1, we report prediction mean absolute error (MAE) on the training set (n = 100) and both MAE and R 2 on the independent testing set (n = 500). Since the true support is known for synthetic data, we also report Dice scores between the estimated and true supports (called "full Dice") as well as between the estimated and true joint supports (called "joint Dice"). Here, "joint" refers to the region of overlap in the supports of the two imaging modes.
For simulation study #2, we additionally report Dice scores between estimated supports obtained from different training datasets. This performance measure quantifies the stability of the estimated support with respect to training set variability.
In the neuroimaging analysis, we consider a 23-fold crossvalidation approach due to the small sample size. We report training and testing set MAE, testing set R 2 , and pairwise Dice scores between the supports estimated using each training dataset.

Simulation Studies
Results for simulation study #1 are provided in Table 1. Figure 3 gives example visualizations of the estimated and true supports for one simulation. These results are similar between the high and low correlation settings. Full and joint Dice scores are significantly higher for the proposed SGL model, with 1.4-4.9 times lower variability across simulations, suggesting more accurate and stable support estimation when using the proposed SGL method. Figure 3 supports these conclusions and illustrates how the SGL estimator more clearly and accurately identifies the joint support: in contrast, IS and JS estimates suggest higher falsepositive voxel selection rates. These points are visually evident though the differences in the size of the joint region (shown in purple) detected by the SGL and the IS/JS estimates, the noise present in the IS (and JS, to a lesser extent) estimate unique to one imaging modality (indicated in red and cyan), and the notable deterioration of IS/JS estimate quality when the true effect size is small. While not the primary goal, the proposed SGL model retains excellent association with the response, shown in testing set MAE and R 2 : as hypothesized, this suggests that the proposed SGL model generalizes better to new data and that the  "LC" and "HC" denote the low correlation (ρ = 0.5, η = 0.5) and high correlation (ρ = 0.8, η = 0.5) settings, respectively. "IS," "JS," and "SGL" refer to the independent sparse, joint sparse, and proposed sparse group lasso models, respectively. All results are presented in the form "mean (SD)." Entries with the best mean results for the LC and HC settings in each column are in bold.
estimated relationship (inβ β β) between voxel-wise measures and the response is less-affected by over-fitting to the training dataset, relative to the IS and JS models. Results for simulation study #2 are provided in Table 2. Figure 4 gives example visualizations of the estimated supports (obtained from different training datasets) and the true support for one simulation. These results agree with the conclusions of simulation study #1: the proposed SGL method more accurately and stably selects the true support (either full or joint) while retaining high association with the response. Despite a more complex true support, Figure 4 demonstrates the stability of SGL estimates across independent training sets. This contrasts with IS and JS estimates, which again show higher false-positive voxel selection rates (visually, with red, cyan, or purple outside of the gray true support) and struggle to identify the joint support (visually, with estimates in purple), particularly when the true signal is weak.

Neuroimaging Data Analysis
Results for the neuroimaging analysis are given in Table 3. Examples of the estimated supports are provided for one training fold in Figure 5 and for multiple folds in Figure 6. While no ground truth is available, these results support the simulation studies' conclusions. Figure 5 illustrates the expected behavior of typical sparse methods in real-world settings: both the IS and JS estimates show many small, non-contiguous regions of selected voxels (although the JS estimates are somewhat smoother). More importantly, the IS and JS estimates suggest little to no joint support (in purple), which is clearly not consistent with findings from previous neuroimaging studies. In contrast, the SGL estimates show a substantial joint support attributable to the proposed sparse group penalty. Furthermore, this region is reasonably smooth, amenable to clinical interpretation, and stable across the (nearly identical) training folds (shown in Figure 6), unlike the IS and JS estimates. This conclusion regarding the stability of SGL estimates is supported in Table 3 by the higher Dice scores (both full and joint) between estimates obtained from different training folds. Numerically, all estimated models retain a comparable and moderate association between predicted and true age (shown by validation set MAE and R 2 in Table 3).

Summary and Key Findings
Our simulation studies and analysis of a real-world neuroimaging dataset consistently supported our proposed SGL approach over both IS and JS models. In the simulation studies, SGL estimates of the full support gave notably higher (estimate-truth) Dice scores with lower variation across simulations. This result was strengthened when considering joint support detection and can be attributed to our inclusion of a sparse group lasso penalty. Figures 3, 4 illustrate the high false-positive rate and instability of IS and JS estimates and the difficulty they have in obtaining the joint support, especially for weaker signals. SGL estimates do not suffer in these ways and still recover the joint support even if the signal is weaker. A second set of simulation studies over independent datasets with the same ground truth similarly confirmed that our proposed method provides more stable estimates.
The neuroimaging analysis highlighted a major disadvantage of typical sparse methods (e.g., IS and JS), namely, their tendency to select a small number of voxels in non-contiguous regions, shown clearly in Figure 5. Furthermore, neither IS nor JS estimates suggested the presence of a joint support region, inconsistent with previous neuroimaging studies (Peters, 2002;Langkammer et al., 2012). These estimates again showed high instability, even with minimal variation in the training dataset during 23-fold cross validation (with total sample size n = 113). In contrast, our SGL method estimates a substantial joint support that was reasonably smooth, compact, and stable across training folds.
In both the simulation and neuroimaging analyses, all methods demonstrate a difference between the training and testing (or validation) set MAE, as expected. We attribute variation in this difference between the IS/JS and SGL models to the effect of overfitting when using the former methods, which is emphasized in the simulation studies (with about 82-90% of truly non-predictive voxels, much greater than the 32-50% estimated by the SGL model in the neuroimaging analysis). We suspect this variation is lessened in the neuroimaging analysis due to the naturally more complex data for which our simulation is a simple imitation.

Related Work and Impact
Existing approaches to multimodal neuroimaging data analysis commonly use techniques requiring a large sample size (Fan et al., 2007), analyze modes separately (Cherubini et al., 2009), aggregate voxel-wise measures to the structural level , or combine modalities in only an initial feature selection stage (Elkady et al., 2017). These approaches are either impractical or come at the cost of statistical power.
In this article, we proposed a unified, multimodal framework for joint region detection associated with a numerical variable of interest. Our work addresses the above limitations in existing "IS," "JS," and "SGL" refer to the independent sparse, joint sparse, and proposed sparse group lasso models, respectively. All results are presented in the form "mean (SD)". Entries with the best mean results in each column are in bold. neuroimaging analytic approaches as well as disadvantages of methods that employ sparse regularization. We are primarily interested in settings where overlap in the effect supports between modes is encouraged (rather than enforced) due to prior knowledge or a need for estimate interpretability. Our proposed method combines lasso, TV, and sparse group lasso penalties to this end. Our primary motivating example concerns the association of R2* and QS MRI maps with age: in particular, previous works have shown that both modalities need to be considered simultaneously to delineate the complex association between iron and myelin levels (Hametner et al., 2018). An understanding of iron accumulation in healthy aging can give insights into the association between brain iron levels and neurological disease such as multiple sclerosis (Elkady et al., 2019). In this article, we considered the problem of identifying deep gray matter regions where variation in R2* and QS measures is jointly associated with aging in healthy controls. Although the main focus of this study was the detection of specific iron changes in deep gray matter, our work may be extended to delineate iron and myelin changes in myelin-rich white matter tissue.
The neuroimaging literature demonstrates a need for methods that can simultaneously accommodate multiple modalities and yield interpretable results in joint region detection problems, such as in our central R2*-QS-age example. Methods employing sparse regularization suffer due to spatial correlation in imaging data and tend to select a non-compact set of disjoint voxels. Image-based (e.g., TV) penalties help address this, but do not consider relationships between modalities. In practice, this may be problematic when it is expected that relevant brain regions are similar across modalities.
Our introduction of a sparse group lasso penalty is novel in this neuroimaging context and addresses the above gap in the current literature. We have shown that our SGL model is capable of obtaining estimates that are compact, interpretable, and stable, both within and between imaging modalities. Our work here is a first step in more general approaches to the multimodal analysis of age-or disease-related alterations in brain structure and function.

Limitations and Future Developments
In this article, we only considered a continuous response due to its wide applicability and our primary motivation to study agerelated variation in DGM iron levels. However, our approach is readily applied to other tasks by modification of the objective function in (3). For example, Zhang et al. (2018) considers a discriminative anatomy detection problem based on logistic regression using a single MRI modality. This could be developed to include multiple modalities and a sparse group penalty for joint discriminative anatomy detection. To focus on our approach to multiple imaging modalities in this work, we also did not consider subject-level scalar predictors z z z i (e.g., demographic or clinical variables). By changing the y y y − X 1 β β β 1 − X 2 β β β 2 portion of the objective function in (3) to y y y − Zγ γ γ − X 1 β β β 1 − X 2 β β β 2 , where Z has row vectors z z z ⊤ i , scalar covariates can be included with only straightforward changes to the ADMM algorithm. In both cases, our work here is a general framework that supports other standard tools employed in regression analysis.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found at: https://github.com/ pietrosa/qsr2-sim.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by Health Research Ethics Board, University of Alberta. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
MP performed the analyses presented in this article, designed the simulation study, and was responsible for the text, tables, and figures in this work. LZ authored the optimization code used for model-fitting, provided some initial analytic neuroimaging results (not presented), and authored a previous version of this manuscript. PS was heavily involved in subject recruitment, acquired all neuroimaging data used in this study, and provided continued support with neuroimaging data handling and all pre-processing tasks. AE contributed to writing the introduction of the manuscript. AHW designed the image acquisition protocol, funded data collection, and led initial project discussions. LK provided methodological development, modified the general ADMM algorithm for this problem, and provided senior project supervision. DC provided methodological support and senior supervision, motivated this work, and contributed to previous versions of the manuscript. All authors contributed to manuscript editing.