Learning from pseudo-labels: deep networks improve consistency in longitudinal brain volume estimation

Introduction Brain atrophy is a critical biomarker of disease progression and treatment response in neurodegenerative diseases such as multiple sclerosis (MS). Confounding factors such as inconsistent imaging acquisitions hamper the accurate measurement of brain atrophy in the clinic. This study aims to develop and validate a robust deep learning model to overcome these challenges; and to evaluate its impact on the measurement of disease progression. Methods Voxel-wise pseudo-atrophy labels were generated using SIENA, a widely adopted tool for the measurement of brain atrophy in MS. Deformation maps were produced for 195 pairs of longitudinal 3D T1 scans from patients with MS. A 3D U-Net, namely DeepBVC, was specifically developed overcome common variances in resolution, signal-to-noise ratio and contrast ratio between baseline and follow up scans. The performance of DeepBVC was compared against SIENA using McLaren test-retest dataset and 233 in-house MS subjects with MRI from multiple time points. Clinical evaluation included disability assessment with the Expanded Disability Status Scale (EDSS) and traditional imaging metrics such as lesion burden. Results For 3 subjects in test-retest experiments, the median percent brain volume change (PBVC) for DeepBVC and SIENA was 0.105 vs. 0.198% (subject 1), 0.061 vs. 0.084% (subject 2), 0.104 vs. 0.408% (subject 3). For testing consistency across multiple time points in individual MS subjects, the mean (± standard deviation) PBVC difference of DeepBVC and SIENA were 0.028% (± 0.145%) and 0.031% (±0.154%), respectively. The linear correlation with baseline T2 lesion volume were r = −0.288 (p < 0.05) and r = −0.249 (p < 0.05) for DeepBVC and SIENA, respectively. There was no significant correlation of disability progression with PBVC as estimated by either method (p = 0.86, p = 0.84). Discussion DeepBVC is a deep learning powered brain volume change estimation method for assessing brain atrophy used T1-weighted images. Compared to SIENA, DeepBVC demonstrates superior performance in reproducibility and in the context of common clinical scan variances such as imaging contrast, voxel resolution, random bias field, and signal-to-noise ratio. Enhanced measurement robustness, automation, and processing speed of DeepBVC indicate its potential for utilisation in both research and clinical environments for monitoring disease progression and, potentially, evaluating treatment effectiveness.

Introduction: Brain atrophy is a critical biomarker of disease progression and treatment response in neurodegenerative diseases such as multiple sclerosis (MS). Confounding factors such as inconsistent imaging acquisitions hamper the accurate measurement of brain atrophy in the clinic. This study aims to develop and validate a robust deep learning model to overcome these challenges; and to evaluate its impact on the measurement of disease progression.
Methods: Voxel-wise pseudo-atrophy labels were generated using SIENA, a widely adopted tool for the measurement of brain atrophy in MS. Deformation maps were produced for pairs of longitudinal D T scans from patients with MS. A D U-Net, namely DeepBVC, was specifically developed overcome common variances in resolution, signal-to-noise ratio and contrast ratio between baseline and follow up scans. The performance of DeepBVC was compared against SIENA using McLaren test-retest dataset and in-house MS subjects with MRI from multiple time points. Clinical evaluation included disability assessment with the Expanded Disability Status Scale (EDSS) and traditional imaging metrics such as lesion burden.
Results: For subjects in test-retest experiments, the median percent brain volume change (PBVC) for DeepBVC and SIENA was .
Discussion: DeepBVC is a deep learning powered brain volume change estimation method for assessing brain atrophy used T -weighted images. Compared to SIENA, DeepBVC demonstrates superior performance in reproducibility and in the context of common clinical scan variances such as imaging contrast, voxel resolution, random bias field, and signal-to-noise ratio. Enhanced measurement robustness, automation, and processing speed of DeepBVC indicate its potential for utilisation in both research and clinical environments for monitoring disease progression and, potentially, evaluating treatment e ectiveness. KEYWORDS brain volume change, longitudinal analysis, multiple sclerosis, atrophy measurement, deep learning, MRI

. Introduction
Brain atrophy is a clinically relevant biomarker of disease progression in patients with multiple sclerosis (MS) that reflects irreversible tissue damage due to neuro-axonal destruction, demyelination and gliosis (Bermel and Bakshi, 2006). Accelerated brain tissue loss can be detected in MS cohorts compared to a healthy control population (De Stefano et al., 2016). Clinically, brain atrophy is a key predictor of future disease worsening and cognitive impairment in patients with MS (Popescu et al., 2013;De Stefano et al., 2014;Jacobsen et al., 2014); and has been used frequently in MS clinical trials as a secondary measure of treatment efficacy (Filippi et al., 2004;Cadavid et al., 2017).
The incorporation of brain atrophy into monitoring paradigms for individual patients requires significantly improved accuracy, precision and robustness. Over the past two decades, numerous methods (Hajnal et al., 1995;Rudick et al., 1999;Collins et al., 2001;Bermel et al., 2003;Friston, 2003;Horsfield et al., 2003;Prados et al., 2015;Smeets et al., 2016) have been developed for quantifying longitudinal brain volume change (BVC). Boundary Shift Integral (BSI) (Freeborough and Fox, 1997), gBSI (Prados et al., 2015), NeuroSTREAM (Dwyer et al., 2017), Jacobian integration (Nakamura et al., 2014), and SIENA (Smith et al., 2002) use different measurements to track the movement of the brain boundary. IPCA (Chen et al., 2004) uses an iterative principal component analysis method to find the outliers reflecting between-scan differences. MSmetrix (Smeets et al., 2016) adopts nonrigid registration and Jacobian integration of deformation fields to produce atrophy measures. Similarly, Quantitative Anatomical Regional Change (QUARC) (Holland et al., 2011) utilises non-rigid registration, but directly calculates hexahedral volumes to yield the fractional volume change. FreeSurfer-longitudinal (FS) (Reuter et al., 2012) is a segmentation-based method that performs tissue segmentation at each time point. Among those methods, SIENA (Smith et al., 2002) is arguably the most widely used algorithm in MS clinical trials.
Although continuous efforts have been made to develop new methods that improve the accuracy and reliability of BVC quantification, longitudinal MRI measurement is susceptible to inconsistency of imaging acquisition at baseline and followup (Lee et al., 2019;Medawar et al., 2021), particularly in routine clinical practise where scanner upgrades and protocol variations are normal. Common examples of inconsistencies that impact the reliability of quantitative BVC measurement, regardless of methodology, include image contrast differences (Preboske et al., 2006), intensity non-uniformity (Takao et al., 2010), noise, or different resolutions and voxel spacing (Vrenken et al., 2013). Additionally, there are few comparative studies that assess the performance of newer (versus older) measurement methods, such as SIENA and Boundary Shift Integral (Freeborough and Fox, 1997), respectively, in the presence of such acquisition inconsistencies.
However, several approaches have been proposed to ameliorate the influence of acquisition inconsistencies and thereby improve the accuracy of BVC measurement algorithms.
The first approach aims to reduce scan inhomogeneity during pre-processing (Smith et al., 2002;Learned-miller and Ahammad, 2004;Lewis and Fox, 2004;Vovk et al., 2004;Vemuri et al., 2005;Duffy et al., 2018;Higaki et al., 2019). By removing the bias field from the longitudinal input scans, these methods aim to reduce the variance in BVC estimates. Several other data harmonisation methods (Dewey et al., 2019(Dewey et al., , 2020Beer et al., 2020;Garcia-Dias et al., 2020;Liu et al., 2021) aim to improve the qualitative and quantitative consistency of differently acquired MRI scans. In practise, these methods assign one scan as a reference and process images in the second scan to narrow the differences attributable to protocol inconsistency. Although many of these methods aim to improve the quantitative utility of MRI in long-term or multisite studies, most are not specifically designed for consistent BVC measurement. Rather, they focus on harmonisation of images to a reference image or providing segmentation masks with greater consistency (as determined by Dice, Coefficient of Joint Variation or similar) with those derived from a reference image in a test-retest setting. Therefore, the effectiveness of these methods for producing consistent BVC measurements is not directly measured.
The second approach focuses on correcting BVC estimates during post-processing. Lee et al. (2019) estimates the fixed effect of scanner changes with a linear model and accounts for this factor during measurement of BVC. Sinnecker et al. (2022) estimates an additive fixed corrective term for scanner change by comparing BVC rates during scanner change and no scanner change for healthy control subjects. These methods are subject to scannerspecific variation and require group level test results to estimate the correction factor.
The addition of both pre-processing and post-processing steps to BVC measurement algorithms increase overall computational complexity and processing time. Additionally, while these methods may result in qualitative and quantitative improvements, most are confined to the research domain and their integration with (and effectiveness in) clinical workflows is not clear.
To improve the reliability of measurements while also considering computational cost, we introduce DeepBVC, a deep learning-based approach that addresses the impact of protocol inconsistency in longitudinal brain volume measurements. Extensive experimentation demonstrates its effectiveness in clinical settings, providing superior performance for potential large-scale clinical applications. Our DeepBVC combines a deep neural network with data augmentation to provide fully automated and robust BVC assessment. A deep neural network offers generalisation to common brain atrophy patterns; and comprehensive data augmentation (Shorten and Khoshgoftaar, 2019) provides robustness and mitigates protocol and other acquisition-related inconsistencies. Specifically, the deep neural network module is used to estimate shift at the brain boundary from baseline to follow-up; and the data augmentation module synthesises images that contain common image distortions and acquisition differences during the training stage of the deep neural network. We also propose a novel training regime for DeepBVC. In general, supervised neural network training requires a large-scale dataset with accurate labels. However, it is not practical to acquire accurate sub-voxel level atrophy estimations from images. Existing BVC measurement algorithms contain known or unknown bias and random variation factors for different scans (Thanellas and Pollari, 2010). Therefore, we used the brain .
/fnins. . boundary shift produced by SIENA as the pseudo-label in training, noting precedents for the use of software-generated annotations as the label for tasks such as brain parcellation (Henschel et al., 2020). We then utilised the inherent regularisation properties of convolutional neural networks to tackle the underlying label noise (Goodfellow et al., 2014;Zhang H. et al., 2017;Zhang S. et al., 2017).
In summary, we have developed DeepBVC, a deep learning based method, for longitudinal BVC assessment with the following contributions: • Data augmentation is introduced to cope with inconsistent imaging acquisitions that are largely unavoidable in clinical settings. • Our deep learning model is trained with pseudo-labels from SIENA and the impact of label noise is ameliorated by a regularisation method. • Experiments in two datasets demonstrate that DeepBVC has better accuracy, robustness, reliability, and consistency compared with SIENA, even though the intermediate outputs from SIENA were used for training.

. Materials and methods
This section is organised as follows: We describe the data used in this study in Section 2.1 and the data preprocessing in Section 2.2. In Section 2.3, we illustrate the details of our method, including the model structure, training details, and the integration our model into the pipeline of BVC estimation. Next, we introduce the settings and metrics of evaluation experiments in Section 2.4.

. . Data acquisition
We use two data sources for this study: an in-house dataset (MS Clinical Dataset) comprising clinical data and matched MRI scans from patients with relapsing remitting multiple sclerosis (RRMS) subjects; and a public test-retest dataset (Maclaren testretest dataset) (Maclaren et al., 2014) comprising scans from three healthy control subjects. The study was approved by the University of Sydney and followed the tenets of the Declaration of Helsinki.

. . . MS Clinical Dataset
Written informed consent was obtained from all participants. In total, 2,457 T1-weighted MRI exams from 648 patients diagnosed with RRMS were included in this study (Table 1). All patients were recruited from the MS Clinic based at the Brain and Mind Centre, University of Sydney; and clinical MRI exams were acquired with one of three 3T MRI scanners (Table 2) between 2010 and 2020. Longitudinal scans for each patient were acquired with the same scanner using a consistent protocol. MRI acquisition parameters are summarised in Table 2.

. . . Test-retest data
We use the Maclaren test-retest dataset (Maclaren et al., 2014) to test measurement reliability. The dataset comprises 120 T1-weighted scans from three healthy subjects aged 26-31 years, acquired with a GE MR750 3T scanner. Each subject was scanned twice on 20 different days within a 31-day period. The acquisition protocol (TE: 3 ms, TI: 0.4 s, TR: 7.3 ms, 1.2 mm slice thickness) followed the recommendations of the Alzheimer's Disease Neuroimaging Initiative (ADNI) (Jack et al., 2008). . . Data pre-processing . . . Format conversion and data selection For the MS Clinical Dataset, images were originally stored in DICOM format. All acquired DICOM data were converted to NIFTI format by dcm2nii (Li et al., 2016). The quality of all MRI data was visually assessed by experienced neuroimaging analysts at the Sydney Neuroimaging Analysis Centre (Sydney, Australia). Images that failed quality assessment (incomplete brain coverage, severe imaging artifacts) were excluded from further study. N4 Bias Field Correction (Lowekamp et al., 2013) was applied to remove the bias field from all scans meeting quality criteria.

. . . Brain and skull segmentation
Brain and skull segmentation was performed with BET (Jenkinson et al., 2005) for all eligible scans and the masks were manually refined by experienced neuroimaging analysts. Skull stripping was undertaken to generate images only containing brain tissues for subsequent analysis.
. . . SIENA analysis . . . . Co-registration For the longitudinal scan pairs of each subject, T1-weighted scans at baseline and follow-up were aligned using two-step affine registration (Freeborough and Fox, 1997;Smith et al., 2002). First, the skull masks were used to optimise the scale and skew; then, the brain images were used to optimise image translation and rotation. All registration results were manually checked by trained neuroimaging analysts, and poorly aligned pairs excluded from further analysis.
. . . . Brain edge point segmentation After brain alignment, FAST (Zhang et al., 2001) was used to segment longitudinal scans into the principal brain tissue compartments: white matter (WM), grey matter (GM), and cerebrospinal fluid (CSF). As whole brain volume change includes changes for both WM and GM, the probabilistic maps from FAST were binarised (using a threshold of 0.5) and defined the union of WM and GM as the foreground segmentation, and the remainder of the image as background. Consequently, brain edge points were defined by the edges of the foreground mask.

. . . . Change analysis
Voxel-wise atrophy/growth was estimated for all brain edge points from baseline to follow-up time points; and the mean edge point motion converted into PBVC (a single number) with a selfcalibration step.

. . Model . . . Model structure
We used a 3D-Unet (Çiçek et al., 2016) as the backbone network for feature extraction, followed by a single convolutional layer as the prediction head. The main blocks of the networks use residual convolutional layers (He et al., 2016) with group normalisation (Wu and He, 2018). The model inputs comprise a pair of baseline and follow-up T1 images. The label is a voxelwise estimation of brain boundary shift produced from SIENA. Because SIENA estimates the brain boundary shift for each voxel location in the input, the label and the model output are 3-D tensors with the same dimensions as the inputs. During both training and inference, the model output is masked with the brain boundary segmentation and only the outputs at the boundary locations are retained (outputs are set to 0 for non-boundary locations).

. . . Training details
An overview of our method is shown in Figure 1. For each iteration, eight scan pairs were randomly selected from all training scan pairs. Each pair was randomly cropped into a pair of patches of size 64 × 64 × 64 due to GPU memory constraints. The patches were then fed into the model and the loss calculated accordingly to update the model weights.
We used mean squared error (MSE) as the loss function L MSE to evaluate the model's deformation prediction when compared with the pseudo-labels from SIENA (edge point motion). Furthermore, to render the model insensitive to differences in imaging quality during training, we adopted a consistency regularisation loss. Specifically, the loss minimised the difference between the predictions derived from the original images and the augmented images, such that brain volume differences were maintained in the context of an isolated change in imaging acquisition conditions.
Formally, for a data point x, the regularisation loss term was defined as: where Augment(x) is a stochastic transformation, whose effect is not identical for each training sample, but rather follows a distribution. As the regularisation term requires the underlying brain volume change to remain constant, spatial transformations that deformed the original brains were not permitted. Therefore, we included both random spacing anisotropy re-sampling and random contrast alterations as possible augmentation steps. For clarity, only one type of augmentation was randomly selected for each sample. Finally, the overall loss L was defined as: where we set λ = 0.01. The final loss function was optimised using the Adam optimiser (Kingma and Ba, 2014) with an initial learning rate of 0.001 that was reduced by a factor of 0.5 when the loss did not drop for three consecutive epochs. The model was trained for 2,500 iterations per epoch and for 50 epochs in total. Model optimisation was performed with two NVIDIA V100 GPU cards on an NVIDIA DGX-1 station.

. . . Training data
To train the model, we collected 195 pairs of scans (one pair per subject) from the MS Clinical Dataset (Section 2.1.1, Figure 2), from which 70% (137 pairs) were randomly selected for training and the remaining 30% (58 pairs) for validation. Two additional, independent imaging datasets were used for testing as described in Table 1. Testing datasets did not overlap with training or validation datasets; and all 195 subjects involved in model development were excluded from evaluation experiments.

. . Experimental setup
Five experiments were undertaken to evaluate the performance of DeepBVC with respect to test-retest consistency, .

FIGURE
An overview of our method. Given the original inputs (a pair of baseline and follow-up scans, and the mask of baseline brain edge points), new inputs are generated by augmentation (including re-sampling and random contrast) without spatial transformation. As a result, the newly generated inputs share the same labels as the original inputs. Both the generated and original inputs are fed into the same network. Then, the output of the original inputs is compared to the label to obtain the loss L MSE . Similarly, the output of the original inputs is compared to that of the generated inputs to yield the regularisation term L reg . Finally, the total loss is defined as in Equation .
multi-step longitudinal consistency, protocol-agnostic testretest consistency, relevance to T2 lesion and correlation with disability.

. . . Consistency with test-retest data
The Maclaren test-retest data used for this experiment is described in Section 2.1.2. We grouped the data into baseline follow-up pairs as follows: the two scans from the same day and subject were used as a longitudinal pair with no atrophy. We used 60 pairs in total (20 pairs per subject).
For this experiment, we assumed that there would be no atrophy because the pairs were scanned back-to-back (i.e. the brain volume difference should be 0%). We ran SIENA and DeepBVC methods to measure the brain volume difference for each pair. We then grouped the results by subject and report each method's mean and standard deviation. We also report the mean absolute error for the results of each subject, where the error was acquired by comparing the BVC measurements against the 0% BVC.

. . . Influence of the protocol inconsistency
To test the influence of various acquisition protocol inconsistencies, we used the Maclaren test-retest data (Section 2.1.2), coupled with imaging processing techniques to synthesise new image pairs with protocol pseudo-inconsistencies, as described below and shown in Figure 3. We then compared the experimental results with those derived from the original test-retest data. Experiments followed the design described in Section 2.4.1, with the addition of a pre-processing step to include synthesised images as follows.
. . . . Random contrast adjustments Image intensities were adjusted by the parameter γ . Each voxel intensity x was updated as where x min is the minimal voxel value in the original brain image, and x range is the difference between the maximal and minimal voxel value in the scan, excluding the background.

. . . . Random bias field
The bias field was generated from a linear combination of smoothly varying basis functions, as described in Van Leemput et al. (1999). In practise, we observed that intensity inhomogeneity was more likely to occur along one of three axes (sagittal, coronal, and axial). Therefore, we synthesised the random bias field along one of the (randomly selected) axes.
. . . . Random spatial anisotropy Spacing anisotropy was introduced by down sampling an image along an axis and then re-sampling back to its original spacing. In our experiment, we simultaneously performed this random transformation along all three axes (sagittal, coronal, and axial).

FIGURE
Baseline data for the MS Clinical Dataset. Non-overlapping constraints were applied between the subjects included in training and testing, but overlapping was permitted for data involved in the three testing experiments.

. . . . Gaussian noise
The noise for each voxel was sampled from a normal distribution with random parameters, and added to the original image.
To systematically explore the impact of protocol inconsistency, we analysed the performance of both DeepBVC and SIENA for different levels of the four types of inconsistency, controlled by different parameters during synthesis. For protocol inconsistency in Gaussian noise, spatial resolution anisotropy and bias field, the value of the parameters is positively correlated to the degree of inconsistency. For the inconsistency in image contrast, the measurement follows a different relationship (Equation 3), namely image contrast is controlled by γ . The larger the difference between γ and 1, the higher the inconsistency between synthesised and original images.

. . . Multi-step consistency with three time points
Inspired by the experiments of Smith et al. (2002), we included data from three time points in our analysis, enabling both single-step and a multi-step measurement strategies. For the single step approach, we estimated brain atrophy between the first (t0) and last (t2) time points, while for the multi-step approach, we combined the estimated intermediate atrophy between the t0 and the second timepoint (t1), and t1 and t2. A smaller difference between the single-step and multi-step approaches indicates a more consistent measurement.
For these experiments, we use data selected from the MS Clinical Dataset (Section 2.1.1), restricting inclusion to those subjects with three available brain scans with an interval of at least . /fnins. .

FIGURE
Examples of synthesised scans for the test-retest dataset. All scans are shown in sagittal, axial, and coronal planes. The upper-left images are slices from the original scan; and the following two columns are synthesised from the original scan. Each row (from top to bottom) represents a di erent acquisition artifact: contrast, bias field, spacing anisotropy, and Gaussian noise. The images on the left are less distorted than those shown on the right.
1 year between successive time points. In total, 233 subjects (three scans each subject) were used for this experiment.

. . . Brain T lesion volume and brain volume change
To further investigate the predictions of our method and their clinical impact, we analysed the correlation between brain volume change and baseline lesion volume as the rate of brain volume loss has been found to correlate with baseline T2 lesion volume (Tedeschi et al., 2005). We assumed that improved accuracy of BVC measurement would enhance this correlation. For this experiment, scans were selected from the MS Clinical Dataset (Section 2.1.1), based on availability of T1-w and FLAIR sequences at baseline and T1-w images at follow up (3-4 years subsequent to the baseline time point). The baseline lesion volume was derived from an in-house deep learning lesion segmentation algorithm  followed by manual review (and, if required, correction) of lesion masks by trained neuroimaging analysts at the Sydney Neuroimaging Analysis Centre. We report linear correlation and partial correlation (controlled for age and disease duration) for DeepBVC and SIENA. SPSS (Field, 2013) was used for this analysis.

. . . Disability and brain volume change
Sustained progression of the expanded disability status scale (EDSS) score has reported to correlate with higher rates of brain atrophy (Rudick et al., 2000;Bermel and Bakshi, 2006). To investigate the clinical relevance of the BVC, we therefore compared the annualised BVC with EDSS progression over 1-3 years in subjects with both T1-w scans and clinical data available at . /fnins. .

FIGURE
Box plots for SIENA and DeepBVC on test-retest data. Results are grouped and shown for each subject. The median brain volume di erence is reported above each corresponding box plot.
two time points with an interval of at least 12 months. Subjects were firstly grouped into EDSS progressors and non-progressors, where EDSS progression was determined by three strata as previously described by Kalincik et al. (2015) and confirmed over 3 months. BVC was then determined by both DeepBVC and SIENA and reported for each group. We also analysed BVC for matched subjects from each group, using 1-to-multiple propensity score matching based on age and disease duration.

. Results
For each validation experiment involving the MS Clinical data, we included all subjects that met the relevant inclusion criteria (Sections 2.3, 2.4 and Figure 2). The demographic and clinical characteristics of the subjects eligible for training and each validation are listed in Table 1. Among 2,457 scans from 648 subjects, 134 scans from 94 subjects were first excluded from all experiments due to poor imaging quality. For multi-step consistency experiments (Section 2.4.3), 233 eligible subjects (77% female) with three available scan timepoints were used. At the time of baseline imaging, mean age and disease duration was 41.5 and 10.4 years, respectively; and mean EDSS was 1.6, in this group. For the lesion experiment, 120 pairs of scans were available. In this group, 81% of the patients were female, with a mean age and disease duration of 40.3 and 8.7 years, respectively; and a mean EDSS of 1.9. The remaining subjects (195 subjects/scan pairs, 74% female) were used for training. In this group, mean age and disease duration was 41.6 and 9.4 years, respectively; and mean EDSS was 2.1. The numbers are reported as average (± standard deviation) across all the sessions for each subject.

. . Consistency with test-retest data
We illustrate the performance of SIENA and DeepBVC in Figure 4 and Table 3. The subject-wise means and standard deviations of PBVC measured by DeepBVC were smaller than by SIENA (Table 3). For all three subjects, the BVC measured by DeepBVC was less dispersed and was closer to 0 ( Figure 4). The median PBVC for DeepBVC smaller than the equivalent plot for SIENA for all subjects (0.105 vs. 0.198, 0.061 vs. 0.084, 0.104 vs. 0.408, respectively). One outlier with a large BVC was found for DeepBVC (subject 1) and another for SIENA (subject 2).

. . Influence of the protocol inconsistency
The expected PBVC for both methods was 0 for pairs of identical scans.
For contrast inconsistencies (Figure 5), the box plot for DeepBVC showed a much lower distribution of errors than SIENA; and, unlike SIENA, no significant increase as gamma Frontiers in Neuroscience frontiersin.org . /fnins. .

FIGURE
Brain volume di erence estimation for stratified protocol inconsistency. For random contrast, the imaging quality is unchanged when the parameter is . The larger the di erence between the parameter and , the lower the image quality and more inconsistent the protocol. The x-values are parameters determining the image processing step and the output image quality. For random Gaussian noise, bias field and anisotropy, parameters are negatively correlated to the quality of the output image (the higher the value, the lower the image quality, the more inconsistent the protocol).
The box plots at the left-most of each figure represent the imaging quality is unchanged.
Frontiers in Neuroscience frontiersin.org . /fnins. . changed (up to ±0.5). Furthermore, the median brain volume difference of DeepBVC was lower than SIENA by one order of magnitude. Finally, the variance in the error was similar for both methods when λ was 0.75, 1.25, and 1.5. However, when λ = 0.5, the errors for DeepBVC showed a wider distribution than SIENA.
For bias field inconsistencies, the medians and interquartile ranges for DeepBVC were higher than SIENA, though brain volume difference was low for both techniques (range 0.00-0.29 and 0.00-0.17, respectively).
For spatial resolution anisotropy, the differences measured by SIENA became greater as the inconsistency level increased. As shown in Figure 5, the median difference gradually increased from 0.89% (γ = 2) to 2.8% (γ = 5). For DeepBVC, the median difference remained low at the different levels of inconsistency tested. For the highest level of inconsistency, the median difference measured by SIENA reached 2.8%, while DeepBVC remained as low as 0.29%. The error distribution of DeepBVC was less scattered and closer to 0 than the equivalent of SIENA for all levels of inconsistency tested.
For Gaussian noise, the error distribution of DeepBVC was less dispersed and closer to 0 than the equivalent of SIENA when the Gaussian noise level was 0.2, 0.3 and 0.4. The error distribution of DeepBVC was wider than SIENA and the median was 0.3% for both methods when Gaussian noise level was 0.1.
We report the difference in single-step and two-step BVC measurements for the two methods in Table 4 and Figure 6.
When comparing the direct and indirect measurement for each method using Bland-Altman plots, the points are scattered randomly above and below 0 for both SIENA and DeepBVC ( Figure 6). The 1.96 SD and −1.96 SD for SIENA is 0.27 and −0.38, respectively (approximating the values reported by Smith et al., 2002), whereas the 1.96 SD and −1.96 SD for DeepBVC is 0.26 and −0.31, respectively. DeepBVC was less likely than SIENA to generate an output that suggested brain volume growth (biologically less likely) over time; and there was no obvious bias between two methods. For points with large atrophy (brain loss > 2%), most points are between SD 1.96 = 0.53 and −SD 1.96 = −0.45 with only one point outside this range. The di erence between direct measurements (t → t ) and two-step measurement (t → t and t → t ).
. . Brain T lesion volume and brain volume change We report the correlation between the annualised brain atrophy rate and the total lesion volume at baseline. As shown in Figure 7, our method had a slightly stronger linear correlation with baseline lesion volume (r s = −0.288, p < 0.05) than PBVC-SIENA (r s = −0.249, p < 0.05). A similar trend was observed on partial correlation controlled for age and disease duration (Table 5), with r s = −0.373(p < 0.05) for the deep learning model, and r s = −0.339(p < 0.05) for SIENA.

. . Disability and brain volume change
The annualised BVC distribution is shown in Figure 8 for subjects with and without sustained EDSS progression. For both DeepBVC and SIENA, the average annualised BVC is slightly larger for the group with sustained EDSS progression (for DeepBVC p = 0.86, for SIENA p = 0.84). The annualised BVC for matched subjects with and without sustained EDSS progression is shown in Figure 9. For both DeepBVC and SIENA, the average annualised BVC for the subjects with progression was larger (for DeepBVC p = 0.31, for SIENA p = 0.35). For both experiments, there was no significant correlation of disability progression with BVC as estimated by either method.

. Discussion
Deep learning with pseudo labels has been previously used in the field of neuroimaging. For example, FastSurfer's deep learning algorithms were trained on outputs generated by the conventional neuroimaging pipelines that underpin Freesurfer (Henschel et al., 2020). The reliability, sensitivity, and time efficiency of FastSurfer is proven to be superior to FreeSurfer (Fischl, 2012). While FastSurfer and DeepBVC share the concept of using pseudo-labels for training, there are two significant differences. First, the output of FastSurfer is a segmentation mask and is generated by classification, while the output of our method is produced by regression. Second, FastSurfer focuses on designing a novel deep learning network to improve efficiency, whereas our method uses a simple but effective network and regularisation technique to reduce the impact of noise in the pseudo-label.
The data augmentation and consistency regularisation of our method are only used during training; therefore the efficiency of the model is not affected by those techniques. Data augmentation .
/fnins. . improves the model's generalisability to unseen data (Zhou et al., 2022). In our case, augmentation simulates inconsistent protocols during the acquisition of scan pairs. Therefore, the DeepBVC model is adapted to those types of protocol inconsistency and potentially generalises well to similar data. Though we only included random contrast and spacing anisotropy in training, the model demonstrated improved reliability on test-retest data with noise-related inconsistencies. Consistency regularisation renders the predictions invariant to noise applied to the input, and is widely used in semi-supervised learning and learning from noisy labels (Sajjadi et al., 2016;Clark et al., 2018;Miyato et al., 2018). Specifically, the incorporation of regularisation into our model maintains an identical brain volume change between scan pairs acquired with either a consistent protocol or a (synthetically generated) inconsistent protocol. These two techniques enable the model to learn clean predictions from noisy pseudo-labels; and improve the estimation reliability, especially for longitudinal scan pairs with an inconsistent acquisition protocol.
In general, smaller BVC on the test-retest data indicates better reliability. As in Table 3, for each of the three subjects, our method estimated a smaller test-retest brain volume difference (range 0.066-0.126%) and smaller standard deviations than SIENA (range 0.118-0.351%) across sessions (p = 0.03, p = 0.08, p < 0.001 for subject 1, 2, and 3, respectively). The reliability of DeepBVC also appears to be less sensitive to subject-related factors, based on the results of subjects 1 and 3 from the test-retest dataset. For example, the mean BVC estimated by SIENA for subject 3 was greater than three times that of subject 1, whereas estimation by our method was less than two times. Based on these observations, the reliability of DeepBVC is superior to SIENA for the test-retest data with a consistent protocol. We simulated a limited array of protocol inconsistencies commonly observed in clinical practise in the test-retest dataset. Specifically, we investigated the impact of contrast variance, which may be introduced by changes in head coil or sequence parameters such as TE (Constable et al., 1992); bias field, which may relate to spatial variance in coil sensitivity and the interaction between the scanner and the subject (Kim et al., 2011); and image resolution, which is determined by scanner/sequence settings. DeepBVC demonstrated superior or at least equivalent performance when compared to SIENA in all scenarios other than inconsistency in the context of bias field. Among the four types of inconsistency tested, contrast and spacing anisotropy variance had the greatest impact on SIENA measurements, followed by Gaussian noise and bias field. DeepBVC showed significantly better performance in the context of contrast and spacing anisotropy pseudo-inconsistencies, despite the lack of spacing anisotropy variation and far less extreme contrast variation in training data augmentation, indicating generalisabilty of the method to unseen scenarios. Surprisingly, Gaussian noise and random bias field did not significantly impact the measurement for SIENA. DeepBVC demonstrates better reliability against most levels of Gaussian noise. However, potentially reflecting the fact that SIENA .

FIGURE
The correlation between annualised brain atrophy rate (over -years) and baseline lesion volume.
relies on edge-enhanced image profiles, which are robust to local intensity differences between images. For DeepBVC, input images are pre-processed with voxel intensity normalisation and bias field alters the distribution of the input voxel intensities after the normalisation step. Although SIENA estimates are more robust to random bias field fluctuations, the largest median BVC for . /fnins. . DeepBVC was only 0.29%, a fraction of the magnitude of error introduced into SIENA estimates by other types of inconsistency.
In general, DeepBVC is therefore more reliable and less sensitive to protocol inconsistency. Using patient data from three time points, a smaller difference between one-step and multi-step measurements indicates better consistency. The application of both SIENA and DeepBVC in this experimental paradigm yielded a small difference (≈0.03%) between one-step (t 0 → t 2 ) and two-step (t 0 → t 1 and t 1 → t 2 ) measurements, indicating high consistency for both methods, marginally in favour of DeepBVC. The difference between one-step and multi-step measurements can also reveal systematic errors; a smaller difference therefore indicates that less effort is required for calibration for studies that involve multiple (≥3) data timepoints. For both SIENA and DeepBVC, the differences between the two measurements randomly scattered above and below 0 in the Bland Altman comparison (Figure 6), suggesting that there were no significant accumulative or systematic measurement errors. For both SIENA and DeepBVC, a high agreement between the direct and indirect measurements was observed (+1.96 SD = 0.27 and −1.96 SD = −0.38 for SIENA; +1.96 SD = 0.26 and −1.96 SD = −0.31 for DeepBVC), especially for the subjects with large mean atrophy.
Correlation experiments (Figure 7) illustrated that DeepBVC estimates of BVC had a marginally stronger linear correlation with baseline brain lesion volume (DeepBVC: r = 0.288; SIENA: r = 0.249), an advantage that was retained when confounding variables (age and disease duration) were controlled (DeepBVC: r = 0.373; SIENA: r = 0.339). These findings suggest fewer subjects may be required to power group-level studies that use our tool to estimate BVC as an endpoint.
EDSS experiments demonstrate that a higher annualised BVC was observed amongst EDSS progressors compared with nonprogressors group, but the difference was not significant for either DeepBVC (p = 0.86) or SIENA (p = 0.84). Similarly, analysis of propensity score matched subjects showed higher, but not significant, annualised BVC amongst EDSS progressors for both methods (DeepBVC: p = 0.31; SIENA: p = 0.35) The magnitude of group level differences in annualised BVC between EDSS progressors and non-progressors wereless than in previous reports (Rudick et al., 2000;Bermel and Bakshi, 2006). Patient populations in these earlier studies differed from the modern MS cohort, in which the majority of patients are treated with high efficacy disease modifying agents (that are known to reduce BVC), studied in the present work. Additionally, these studies employed different measures to determine brain volume (change), such as brain parenchymal fraction and normalised whole brain grey-matter volume.

. . Study limitations and future directions
While our experiments suggest that DeepBVC more consistently and reliably estimates BVC than the classical tool, SIENA, in several scenarios, there are a number of limitations.
First, model training requires pseudo-labels from SIENA. While the use of pseudo-labels generates improved performance, the overall framework and concept follow the principles of the classical method, namely the requirements for co-registration of baseline and follow-up scans, segmentation of the brain to find edge points, and a calibration step for the final volume change estimate. As a consequence, our experimental results may be confounded by errors propagated from each pre-processing step. For example, the registration step potentially changes the scale and skew of the brain image, which can in turn affect the final BVC estimation. Additionally, lesion inpainting changes the image context and affects brain edge point segmentation, impacting the edge locations at which voxel-wise atrophy/growth is subsequently estimated. Future pipeline optimisation, in which each step is integrated as a component of the learning process, may mitigate this cascading effect and enhance the performance of deep learning based solutions for BVC.
Second, we simulated four protocol inconsistencies and independently tested their impact on BVC estimates using DeepBVC. In real-world clinical imaging acquisitions, protocol inconsistencies are more complex, overlapping and generated by different and, at times, multiple sources. Decomposing these inconsistencies into isolated categories is challenging. Similarly, the availability of inconsistently acquired scans, acquired back-to-back on the same subject, would be required to complete the test-retest study in the absence of simulated data.
Third, it is challenging to validate the actual measurement accuracy of tools such as SIENA or DeepBVC, because the ground truth BVC is unknown, other than for test-retest subjects in whom brain volume should essentially remain static when scans are acquired back to back, thereby avoiding changes in hydration or diurnal factors that could impact brain volume. Although DeepBVC did not inappropriately detect BVC in the test-retest cohort, this does not necessarily demonstrate capacity to accurately determine atrophy estimates in cases with true brain tissue volume differences. In this regard, BVC estimates from SIENA are noisy: they can be used to produce pseudo-labels for the training phase, but they cannot be used as ground truth during the validation phase, particularly without rigorous manual quality control checks. We therefore resorted to validating DeepBVC using proxy techniques (Section 2.4) in the absence of 'clean' labels for testing model performance. While these methods were comprehensive and approximated real world imaging scenarios, multiple timeconsuming steps in the experimental pipeline hindered rapid model development.
Lastly, the limitations of this study also include the lack of a "true" multi-centre dataset. While the study integrated data from three different scanners, as depicted in Table 2, all images were obtained from a single MS clinic. This approach, to a certain extent, normalised the scan quality while prescribing the MRI exams. Future investigations may find value in validating DeepBVC using a variety of scanners from different clinical centers.

FIGURE
The violin plots of annualised BVC for two groups: sustained EDSS progression and no sustained EDSS progression.

FIGURE
The violin plots of annualised BVC for matched subjects from two groups: sustained EDSS progression and no sustained EDSS progression. The matching is one-to-multiple propensity score matching on age and disease duration.

. Conclusions
We demonstrate that a deep learning model trained for whole BVC estimation with pseudo-labels derived from SIENA can achieve better performance in terms of consistency, invariance to protocol change, and correlation between BVC and baseline lesion volume in a cohort of subjects with MS. Brain atrophy is a common endpoint in MS clinical trials that will become more relevant as neuroprotective and pro-reparative therapies are developed. Similarly, there is a need for robust monitoring of brain atrophy in neurodegenerative disease, the imperative for which has been heightened by the recent advent of disease modifying therapies in this patient population. DeepBVC is a fast and robust method for estimating brain atrophy that may have particular application in both clinical trials and precision medicine.

Data availability statement
The in-house datasets presented in this article are availability upon request in written and relevant institutional approval.

Ethics statement
The studies involving human participants were reviewed and approved by the University of Sydney. The patients/participants provided their written informed consent to participate in this study.

Author contributions
GZ was responsible for processing data, conducting experiments, and drafting the paper. DW was responsible for processing data, conducting experiments, and revising the paper. MC was responsible for discussing the idea and implementation and revising the paper. LB and WO were responsible for discussing the idea and experimental design. KK was responsible for preparing and processing the data. MB and CW were responsible for discussing the idea and revising the paper. All authors contributed to the article and approved the submitted version.

Conflict of interest
GZ, DW, KK, and CW are a part-time employee at the Sydney Neuroimaging Analysis Centre. MB has received institutional support for research, speaking and/or participation in advisory boards for Biogen, Merck, Novartis, Roche, and Sanofi Genzyme, and is a research consultant to RxPx and research director for the Sydney Neuroimaging Analysis Centre.
The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.