Predicting Brain Age at Slice Level: Convolutional Neural Networks and Consequences for Interpretability

Problem: Chronological aging in later life is associated with brain degeneration processes and increased risk for disease such as stroke and dementia. With a worldwide tendency of aging populations and increased longevity, mental health, and psychiatric research have paid increasing attention to understanding brain-related changes of aging. Recent findings suggest there is a brain age gap (a difference between chronological age and brain age predicted by brain imaging indices); the magnitude of the gap may indicate early onset of brain aging processes and disease. Artificial intelligence has allowed for a narrowing of the gap in chronological and predicted brain age. However, the factors that drive model predictions of brain age are still unknown, and there is not much about these factors that can be gleaned from the black-box nature of machine learning models. The goal of the present study was to test a brain age regression approach that is more amenable to interpretation by researchers and clinicians. Methods: Using convolutional neural networks we trained multiple regressor models to predict brain age based on single slices of magnetic resonance imaging, which included gray matter- or white matter-segmented inputs. We evaluated the trained models in all brain image slices to generate a final prediction of brain age. Unlike whole-brain approaches to classification, the slice-level predictions allows for the identification of which brain slices and associated regions have the largest difference between chronological and neuroimaging-derived brain age. We also evaluated how model predictions were influenced by slice index and plane, participant age and sex, and MRI data collection site. Results: The results show, first, that the specific slice used for prediction affects prediction error (i.e., difference between chronological age and neuroimaging-derived brain age); second, the MRI site-stratified separation of training and test sets removed site effects and also minimized sex effects; third, the choice of MRI slice plane influences the overall error of the model. Conclusion: Compared to whole brain-based predictive models of neuroimaging-derived brain age, slice-based approach improves the interpretability and therefore the reliability of the prediction of brain age using MRI data.


INTRODUCTION
Brain age prediction involves estimating chronological age based on information typically gleaned from neuroimaging data. The prediction may be referred to as the biological or neuroanatomical age of the brain. Although brain age can be computed from other approaches, such as the epigenetic clock from brain tissue (1), in this paper we use brain age as a synonym for neuroimaging-derived brain age. The difference between the predicted age and the actual chronological age is called brain age gap, which has been associated with a number of lifestyle factors (2) [e.g., tobacco and alcohol consumption (3), obesity (4), diabetes, schooling, physical activity (5), higher mortality risk (6), lower fluid intelligence, psychiatric disorders (7), and neurological diseases (8)].
Recent advances in machine learning, specifically on deep convolutional neural networks, have gradually improved brain age prediction by lowering prediction error (9). However, brain age prediction methods still receive criticism due to the lack of interpretability (10). The criticism stems from the limited information about what the model uses to predict brain age, and which regions might bias findings. Hidden biases and poor generalization are a recurrent theme in machine learning and deep learning research (11), including its medical imaging applications (12). Thus to fulfill the promise of translational research, AI needs to establish reliable and reproducible prediction methods, and to generate models that are more amenable to clinical interpretation (10). Identification of clinical neural markers and association with clinical and behavioral data may render AI applications more meaningful (10,13,14).
In this article, we report on a model developed for the PAC-2019 brain age prediction competition. Our goal was to generate competitive predictions using meaningful neuroanatomical information. We developed a deep learning framework whose predictions draw on features from every single slice of brain imaging combined with average or linear regression models. The resulting model associates each slice with an independent age prediction for the same patient, allowing researchers to scrutinize the areas of the brain responsible for the overall brain-age gap. Our hypothesis was that our approach would help understand the behavior of brain age prediction at each part of the brain. We also believe that this method, alongside other approaches that try to move away from single predictions of brain age (16), may help us get a comprehensive picture of the parts and characteristics of the aging brain that inform prediction. Such picture should allow for identification of diverse, slice-level, and eventually voxel-level, neuroanatomical traits of age-related diseases.

BACKGROUND
The known patterns of brain development associated with aging, such as a decline in gray matter volume (17), are readily identifiable by magnetic resonance imaging (MRI). These images are extensively used for diagnostic and research of disorders associated with brain tissue loss, such as Alzheimer's disease, Parkinsonian dementias, and Frontotemporal lobe degeneration (18). More recently, machine learning techniques have been used to draw on the rich MR images to predict the brain age of healthy people (19), and the aging processes of neurodegenerative disorders (20). The mismatch in chronological and brain age has been investigated in schizophrenia (8), bipolar disorder (21), and in association with factors associated with mortality risk, physical and mental fitness, and biological health (6).
Recent advances in deep learning models, specifically Convolutional Neural Networks (CNNs) achieve state-of-theart performance in computer vision tasks (22), while requiring little to no prior hand-engineering of data. CNN architectures using 3D convolutions have been used to predict brain age with segmented GM and white matter (WM), and raw T1-weighted MRI scans (10,23). The use of 3D convolutions allows the model to take in whole-volume information for convolutional filtering operations, which, given enough data, learn feature detection and extraction. CNN models provide highly accurate predictions for regression and classification tasks on multiple medical imaging datasets (22,24). CNN models have remarkable predictive power, but the results are typically difficult to interpret. Whereas manual feature selection in classic machine learning simplifies interpretation of the model's results, CNNs require further processing steps to interpret the model's decision processes due to the use of less processed data (15). Examples of interpretationseeking mechanisms include saliency maps (25) and activation mappings (26), which aim to identify the regions in an image that are responsible for assisting model predictions, thus allowing for some visualization of key input features. These maps trace network outputs back to the input image voxels through the computation of their partial derivatives. For example, regression activation mapping applied to age prediction models on newborn structural MRI generated brain maps of rapid growth during early development (27). Saliency maps have three key limitations. First, they depend on human validation, a time consuming task that also entails the potential for confirmation bias. a Second, these methods can produce results that are independent of model and data, and thus inadequate for model debugging and inspection (28). Finally, additional techniques must be used for combining individual subject saliency maps of into populationlevel visualizations (29).

METHOD
We tested several models and found that the ones with a RESNET18 architecture had a good trade-off between size and prediction error (30). In order to use it in our context with the dimensions of our input, we modified it in three simple ways: (1) the input size had one or two channels, depending on the experiment, (2) the kernel size from the average pool was changed from 7 to 4, (3) the final fully connected layer was changed from 512 to 1,024. The code to build this architecture as well as the steps to reproduce all experiments are available on GitHub (see data availability statement). The input was one brain slice with segmented GM in the first channel and white matter in the second channel. The segmentation was provided by PHOTON-AI 1 and we made no adjustments to it apart from scaling. The output of each model was a brain age estimation for a single segmented slice (GM, WM, or both) from the structural MRI. We illustrate our framework in Figure 1. The framework relies on three different, simultaneously-run models that are combined by three linear regression models and a final average. Each model is trained independently to predict brain age from a single MRI slice and draws on different MRI slice orientation (coronal, sagittal, or axial) as input. Each of these models used the validation set to generate error estimates for each slice and each volume. We then used the error estimates to determine the importance of each slice for the model. All three views were 1 https://photon-ai.com/ combined to locate specific regions in the brain responsible for a particular classification based on the contribution of each slice for the brain age prediction, which provides an additional source of interpretation.
The final age prediction for a single individual was calculated as follows: where x ∈ a, s, c represents the axial, sagittal, and coronal views, a x is the age vector for each slice for the subject, M x is one of the CNN models, L x is the linear regression model, e x is the error of model M x in the validation set, and S x is the total number of slices for an orientation x. Our rationale was that each model's contribution was inversely proportional to the error in the validation set with a weighted average. The influence of each slice for the final prediction is weighted by the linear regression model. This slice-level rationale can also be applied to understand the independent contributions of gray and white matter to the brain age estimate. While using gray or white matter alone causes the model to lose predictive power, it improves interpretability by estimating the independent gray and white matter contributions to the age prediction. All segmented MRI scans for the dataset provided by the competition were shaped (121, 145, 121). To reduce the amount of empty space on the corners of the input, we removed 20% of the image corners, resulting in a (72, 88, 72) image. The input for the model is thus shaped (batch, c, 72, 88) where c = 1 when using either gray or white matter alone or c = 2 when using both types of brain tissue. For the coronal view, we zeropad the image so that the (72, 72) slice also becomes (72, 88) to keep the consistency across all models. The participants from the PHOTON-AI dataset included healthy individuals from a wide age range, males and females, and from 17 different centers. We included basic demographic information about the sample in Table 1.
We trained the CNN models using an Adam Optimizer set with the learning rate at 6e − 4 and weight decay of 6e − 4. The training also used a sigmoid learning rate rampup for 20 epochs followed by a cosine rampdown until a total of 100 epochs. The batch size was set to 64. We conducted a data augmentation with Elastic Transform (31) with an α range between [28,30], σ with a range of [3.5, 4.0], and p = 0.3, Some examples of the augmentation procedure can be seen in Figure 2. All data augmentation procedures were implemented in the medicaltorch framework 2 .

RESULTS
For the competition, we achieved a mean absolute error of 4.44 years on the test set, with a Spearman correlation of −0.25 between the age estimates and chronological age. The model that won the competition achieved 2.90. Due to time restrictions, we employed axial slice predictions only, combining gray and white matter. In what follows, we present the results for the combined gray and white matter models for each separate orientation, the combined predictions; we also present how predictions improve interpretability and decrease model errors. The results of this article are based on predictions made on the competition's validation set. We did not have access to the test set's ground truth at any point during our experiments, prior to nor after the competition. Instead of using the validation set to train the linear regression, we applied the regression to the training set, in order to avoid circular analysis. However, we believe this can limit the accuracy that otherwise would be achievable with the linear regression model. We leave the comparison of using the validation dataset and reusing the training set to train the linear regression model for future work with more data available, as we restricted ourselves to use data exclusively from the challenge for this study.

Combined Gray and White Matter
Our approach used slices for both gray and white matter in individual channels. The use of the two tissues simultaneously may have sacrificed obtaining more fine-grained information about brain aging from the each independent tissue, but it was done in favor of feeding additional data to the model. Gray and white matter were concatenated on the first channel, resulting in an input of (2, 72, 88). We then trained and evaluated the model, and then assessed the effects of age, sex, and site on its predictions. In the sections, we explain the key findings from our experimental analysis.

Models for Different Views of the Brain Have Different Errors
We trained three independent models, taking the input from either axial, coronal, or sagittal views (Note: for the competition, in the interest of time, we used only the axial orientation information). After training, we estimated the error for each slice. The estimate is used to gauge how much each slice contributes to the prediction of brain age. We identified a pattern, most present in axial and coronal models, but to some extent also visible in the sagittal. With more distal slices, the average prediction error seems to be higher. This could arguably be attributed to several reasons: (1) differences of brain matter across regions of the brain, (2) more age-related changes in some regions than others, and (3) the tendency of noise from the scanner to be concentrated on the extremities of the image (and this tendency is fairly visible in Figure 3).
The three models afforded different final prediction errors ( Table 2, which we will further discuss in section 4.1.2. The models also resulted in different estimates for the contribution of slices for predicting brain age. Figure 3 shows the error variation for slices for the axial, coronal and sagittal slice models. We hypothesize that the differences in mean error may stem from: intrinsic and extrinsic factors that contributed to a  poorer segmentation (and therefore an input of lesser quality); the randomness of the modeling process; and from sample heterogeneity for each of the regions with respect to age.

Final Predictions
Using Equations (1) and (2), we generated the predictions for the final dataset. After pre-training each of three models (one for each view), we evaluated every slice in the dataset and generated a dataset of predictions with a row for each individual and a column for model predictions of every slice. We trained the linear regression for the generated dataset using scikit-learn 3 library. Table 2 shows the results for the three models. The sagittal slice prediction showed the lowest error, and it outperformed the outputs combined using Equation (2). We argue that this discrepancy in the error is due to the lack of another validation set for proper out-of-sample error estimation. Additionally, the 3 https://scikit-learn.org/ differences present in a model trained with this dataset may not accurately translate to other data sources. To ensure that the sagittal slice prediction is actually superior to the axial and coronal slice predictions, the model needs to be further validated and trained using yet another dataset.

Age Effects
Brain age prediction methods usually perform better around the mean chronological age of the dataset. Research shows that models tend to overestimate brain age for participants younger than the mean, and underestimate it for participants older than the mean (32,33).
The behavior of our model's prediction error changed with respect to age, as shown in Figure 4. Not surprisingly, we found the same pattern of over and underestimation of age reported in the literature. One major implication of this type of error is the possibility of inserting biases when trained models are applied to external test sets that have a different age distribution than the training set. Thus, brain age prediction models should

Site Effects
The dataset included contributions from 17 different sites. To investigate any biases associated with the different sites, we extracted the age and predicted age for each of our models and each of the 17 sites. We ran a two-tailed dependent ttest to compare age and predicted age among sites and found statistically significant differences for just 6 out of the 17 centers. The data are presented in Table 3 and Figure 5. In four out of the six significantly different centers, all views had significant differences simultaneously, thus suggesting that the models tend to operate in similar manner, and that an actual superiority of the sagittal slice model needs to be further investigated.

Sex Effects
Previous papers suggest that sex can play a role in biasing brain age prediction models. For that reason, we assessed how sex influenced our predictions. We executed a two-tailed paired ttest for age and model predictions for both males and females independently and found no significant differences (p < 0.03).
We ran a two-tailed unpaired t-test for males and females to see whether predictions or age was significantly different between groups, also with negative findings (p < 0.03). Table 4 summarizes our tests. we applied a two-tailed dependent t-test to compare group means, the test showed no significant differences.

Voxel-WiseLevel Brain Age Predictions
We investigated whether a voxel-level model could benefit from the three slice predictions. For each axial slice, information was gathered from each intersecting coronal and sagittal slice, and the combined with the axial prediction to generate an average for each voxel. We show an example of its use in Figure 6. We believe there is a multitude of applications and benefits that come with this sort of approach, such as: (1) the exploration of distinct brain  aging patterns in different regions of the brain; these patterns do not need to be bounded by anatomically or functionally defined regions, (2) the investigation of region-specific brain aging in neuropsychiatric disorders that present a brain age gap, and (3) the identification of voxel-or region-specific prediction biases (e.g., regions that consistently present higher error). But this approach led to unstable predictions, possibly because the model was not being trained using voxel-level information. There were high frequency changes in predictions for neighboring slices; it is expected that neighboring slices be the opposite. This problem has been documented in deep learning patch-based prediction methods (35). A possible simple but sub-optimal FIGURE 6 | Lightbox view of axial slices age predictions following the voxel-level approach. The value for σ indicates the amount of gaussian spatial smoothing applied to the predictions. Images on a range from 20 (red) to 60 (yellow). solution is to use gaussian spatial smoothing to remove the highfrequency changes, as we demonstrate in Figure 6. We contend that future approaches may aim to develop means to mitigate the artifacts that come "stitching" slice-level predictions into voxel-level ones. Such a successful approach could improve the current model.

Independent Gray and White Matter
To estimate the contribution of gray and white matter tissue to the prediction, we created another model (for axial slices) to test age predictions using gray matter and white matter separately. Based on the results in Figure 7, the regions that are best predictors of brain age are similar between gray and white matter. The model using solely white matter presented higher errors in most slices. Previous studies have shown that gray matter is a better predictor of brain age than white matter (9).

DISCUSSION AND CONCLUSION
Identifying the most predictive regions of the brain for brain age models may help interpreting results, but may also introduce model biases that are unrelated to a neurological condition. Our study presents a model that attempted to balance accuracy and interpretability of the results. Our model provides a level of scrutability for the decision-making process, and can thus help researchers and clinicians understand its limitations. Given the number of perspectives on interpretability for machine learning (36, 37), we must clarify exactly what we mean by interpretability. Although the interpretability or explainability are commonly used to refer to strategies that explain model predictions, we use it more broadly to define that the final predicted brain age can be attributed to specific slices or regions of the brain. In case additional explanation is needed for a single slice, the usual strategies, carrying the limitations we discussed in section 2, should be applied. We believe that for research purposes, knowing the influence of each part of the brain in the final prediction is imperative to guide and interpret findings of research using the brain age gap.
As a whole, neuropsychiatric research on the brain age gap has mostly focused on associating the difference between the chronological and predicted age to clinical populations. Moreover, most of this research was conducted on the assumption that the brain age gap actually encompasses a larger brain-wide phenomenon responsible for accelerated brain aging.
The body of work on the prediction of brain age at specific regions is growing, two contributions of which are of note in comparison to our contribution. First, a preprint paper (37) uses a similar approach to ours, but instead of using slices, they rely on 3d patches that are later combined with averages or linear regression. Second, a recently published approach (38) uses slice-level predictions, but instead of combining it with linear regression, uses a recurrent neural network for the job.
Attention-based, especially Transformers, models, originally purposed for text-based modeling (39), have recently shown to offer substantial improvement on classification accuracy for image-based problems (40). However, at the time we developed our approach, attention models were more popular for textbased than image-based tasks. More importantly, we were not aware of any work with neuroimaging data that had shown substantial improvements using those approaches. Indeed, future work should focus on comparing the explanations we provide in terms of slides to the attention maps generated by attentionbased models.
Our experiments corroborate some findings from the field. First, we showed that age effects are significant and should need to be accounted for in predictive models of brain age. Second, our results suggest that proper training and test splits that keep site data proportional may mitigate site effects. Third, gray matter seems to be more predictive of age than white matter. Interestingly, our model had similar performance in both male and female sex, although sex is not explicitly used by the model and no separate models were trained.
Since the competition dataset was preprocessed by segmenting gray and white matter, future work should look at less processed data to try to replicate these results. Differences between the unprocessed and segmented inputs might help us understand the extent that possible segmentation errors may influence the behavior of models of brain age.

LIMITATIONS
The convolutional neural network fails to combine information from different regions of the brain due to its 2D nature. Although aggregated at later stages of our framework, 3D CNNs might be able to capture patterns that our proposed method misses.
We also found that the aggregated information from each slice prediction to form a voxel-level age prediction to be noisy enough to be unusable. Adjacent voxels had usually the same error but occasionally, while changing the slice index, the prediction had a drastic change. This behavior can probably be attributable to several issues, such as lack of regularization for more stable predictions or possible unidentified problems with the segmentation maps.

ETHICS STATEMENT
Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.

AUTHOR CONTRIBUTIONS
PB, LT, MM, and NE designed the study. PB implemented the framework and ran experiments. FM and AB supervised the implementation and engineering of the work. AB and BF helped interpreting the findings and provided neuroimaging-related insights. All authors contributed to writing the manuscript.