Longitudinal Prediction of Infant MR Images With Multi-Contrast Perceptual Adversarial Learning

The infant brain undergoes a remarkable period of neural development that is crucial for the development of cognitive and behavioral capacities (Hasegawa et al., 2018). Longitudinal magnetic resonance imaging (MRI) is able to characterize the developmental trajectories and is critical in neuroimaging studies of early brain development. However, missing data at different time points is an unavoidable occurrence in longitudinal studies owing to participant attrition and scan failure. Compared to dropping incomplete data, data imputation is considered a better solution to address such missing data in order to preserve all available samples. In this paper, we adapt generative adversarial networks (GAN) to a new application: longitudinal image prediction of structural MRI in the first year of life. In contrast to existing medical image-to-image translation applications of GANs, where inputs and outputs share a very close anatomical structure, our task is more challenging as brain size, shape and tissue contrast vary significantly between the input data and the predicted data. Several improvements over existing GAN approaches are proposed to address these challenges in our task. To enhance the realism, crispness, and accuracy of the predicted images, we incorporate both a traditional voxel-wise reconstruction loss as well as a perceptual loss term into the adversarial learning scheme. As the differing contrast changes in T1w and T2w MR images in the first year of life, we incorporate multi-contrast images leading to our proposed 3D multi-contrast perceptual adversarial network (MPGAN). Extensive evaluations are performed to assess the qualityand fidelity of the predicted images, including qualitative and quantitative assessments of the image appearance, as well as quantitative assessment on two segmentation tasks. Our experimental results show that our MPGAN is an effective solution for longitudinal MR image data imputation in the infant brain. We further apply our predicted/imputed images to two practical tasks, a regression task and a classification task, in order to highlight the enhanced task-related performance following image imputation. The results show that the model performance in both tasks is improved by including the additional imputed data, demonstrating the usability of the predicted images generated from our approach.


INTRODUCTION
The early postnatal period (neonate to one year of age) is a period of dynamic and rapid brain development with dramatic appearance changes in magnetic resonance images (MRI). This period has been associated with early atypical developmental trajectories in neurodevelopmental disorders, such as autism spectrum disorder (ASD) and schizophrenia (Hazlett et al., 2017;Gilmore et al., 2018). Longitudinal MRI allows the quantification of developmental trajectories over time and plays a critical role in neuroimaging studies of early brain development (Gilmore et al., 2012). However, missing data points is a common issue in longitudinal studies due to MRI scan failure, scheduling issues, or general participant attrition (Laird, 1988). Discarding those study participants with incomplete data significantly reduces the sample size and may even lead to unacceptable levels of bias (Matta et al., 2018). One solution to deal with the issue of missing data is to interpolate/extrapolate the missing data, called data imputation, from the data that is available. Such data imputation can be performed either at the image or at the measurement level.
Low rank matrix completion is a commonly proposed approach for measurement level imputation, for example, Thung et al. (2016) employed it to impute the missing volumetric features. A series of machine learning based approaches also have been proposed in this field, such as Meng et al. (2017) proposed the Dynamically-Assembled Regression Forests (DARF) in order to predict cortical thickness maps at missing time points. Rekik et al. (2016) developed a 4D varifold-based learning framework to predict the cortical shape at the time point in the first year of life using the cortical surface shape at birth. Additionally, a lot of variants of geodesic models (Fishbaugh et al., 2013(Fishbaugh et al., , 2014Fletcher, 2013;Singh et al., 2013a) were proposed for longitudinal shape imputation and regression. Compared with measurementlevel methods, image-level methods directly predict the image appearance at a missing time point. In (Niethammer et al., 2011;Singh et al., 2013b), the geodesic models were used for longitudinal regression of related image appearance. And Rekik et al. (2015) proposed a sparse patch-based metamorphosis learning framework for regression of MRI appearance and anatomical structures with promising yet limited results.
In this paper, we focus on image-level approaches for infant longitudinal MRI prediction and we treat it as an image synthesis problem, i.e., synthesizing/predicting a missing MR image from an existing image of the same subject at a later or earlier time point. Recently, generative adversarial networks (GANs) have shown great potential in generating visually-realistic images for both natural image synthesis, e.g., in image-to-image translation Liu et al., 2017;Yi et al., 2017;Zhu et al., 2017;Huang et al., 2018;Xiong et al., 2019;Emami et al., 2021), generating new plausible samples (Goodfellow et al., 2014;Zhang et al., 2019), generating photographs of human faces (Karras et al., 2017), and medical image synthesis, e.g., cross-modality synthesis (MR-to- CT Nie et al., 2017;Wolterink et al., 2017;Jin et al., 2019, MR-to-PET Pan et al., 2018, PET-to-MR Choi and Lee, 2018, CT-to-PET Ben-Cohen et al., 2017Bi et al., 2017;Armanious et al., 2020, 3T-to-7T Qu et al., 2019, cross-site synthesis (Zhao et al., 2019), and multi-contrast MRI synthesis (Dar et al., 2019;Yang et al., 2020). Recently, GANs have also been applied to longitudinal MR image prediction. For example, Xia et al. (2019) proposed a conditional GAN that conditioned on age and health state (status of Alzheimer's Disease) to predict brain aging trajectories. In (Bowles et al., 2018;Ravi et al., 2019), a GAN is used to predict the Alzheimer's related brain degeneration from existing MR images, where biological constraints associated with disease progression are integrated into the framework. These longitudinal prediction approaches are limited to 2D T1w MRI that are hard to generalize to 3D, as well as having been designed for adult brain images related to Alzheimer's disease. Besides, we also notice that GAN-architectures with perceptual loss have been used in a few medical image applications. For example, Armanious et al. applied GAN with perceptual loss to PET-CT translation, MR motion correction and the PET denoising (Armanious et al., 2020). Dar et al. used a GAN with VGGNet-based perceptual loss for a multi-contrast MRI synthesis task (mapping among T1w MRI and T2w MRI) (Dar et al., 2019). Due to the nature of their tasks, they only focus on single modality 2D data. Also, they utilized VGGNet for the perceptual loss computation. Since VGGNet is a pretrained model based on 2D natural images, it may be not that appropriate for medical image tasks.
In this work, we propose a novel GAN adaptation for a new application: the longitudinal prediction of infant MR images in the first year of life. Since human brain size and shape changes rapidly in the first year of life, 2D methods are not suitable in our task. Thus, we present a fully 3D-based approach for the prediction of infant MR images. In addition, because of the myelination process, the infant brain shows a dramatic change of tissue contrast and anatomical structural shape, which further poses difficulties for prediction. While generative adversarial networks can produce images with realistic textures by enforcing the outputs from the generator to be close to the real data distribution, it cannot ensure the consistency between the outputs and the desired ground-truth images, so that the appearance of predicted image may look different from the ground-truth image. To handle the large variation in appearance, we add a voxel-wise reconstruction constraint, i.e., an L1 loss, to explicitly guide the generator to produce images that match ground-truth images at the voxel level. Although global structures can be well-preserved by harnessing L1 loss, it often results in an over-smoothed output (Pathak et al., 2016). Hence, to alleviate this issue, we also enhance our GAN with a perceptual loss term to maintain appearance consistency at the feature level. We propose to utilize Model Genesis (Zhou et al., 2019), which is a pre-trained model for 3D medical images, for this feature extraction. Finally, in order to tackle the reduced tissue contrast during the first year of life, particularly at about 6 months of age, we propose a multi-contrast framework, so that the complementary information of different contrasts (T1w and T2w images) can be exploited. The source code of our method will be released to the public upon acceptance of this manuscript at https://github.com/liying-peng/MPGAN.
Our main contributions are summarized as follows: • To the best of our knowledge, this is the first application of deep generative methods for longitudinal prediction of structural MRI in the first year of life. • Unlike previous 2D-based methods, our method is based on a 3D MRI prediction, where the volumetric spatial information is fully considered. • To predict sharp, realistic and accurate images, we adopt a GAN with adversarial, pixel-wise reconstruction and perceptual loss. The perceptual loss is computed via features extracted from an application-specific model that has been pre-trained on 3D medical images. • To leverage complementary information from multi-contrast data, we propose a novel multi-contrast framework to jointly predict T1w and T2w images. • Extensive experiments demonstrate the effectiveness of our approach for use in longitudinal MRI prediction and imputation in the developing infant brain. We show that when using these imputed MR images to expand the training data in two practical machine learning tasks, we improve the model performance.
The remainder of this paper is organized as follows. In section 2, we introduce the experimental datasets and describe the methodological details of our proposed approach. Experimental results are presented in section 3 and then discussed in section 4. The conclusions are shown in section 5.

METHODS
In this section, we first introduce a brief background of generative adversarial networks. Subsequently, we formulate our problem and then define our objective functions. Finally, our network architectures are discussed.

Generative Adversarial Network
The Generative Adversarial Network (GAN) is a generative deep learning model that was proposed by Goodfellow et al. (2014). The aim of it on image-to-image translation tasks is to learn a mapping from the input image x to the target image y, i.e., x → y. It consists of two separate components, each a neural network, specifically a generator G and a discriminator D network. In the training stage, these two networks compete with each other, where a) G attempts to fool D by generating a fake image G(x) that looks similar to a real target image y, and b) D aims to distinguish between the real image y and the fake image G(x). As the two networks face off, G(x) generates more realistic images that get closer to the real data distribution and D(x) becomes more skilled at differentiating images. At the end, the algorithm will converge to a Nash equilibrium (Nash et al., 1950). This twoplayer minmax game is formulated as: min G max D L adv (G, D), where the adversarial loss L adv can be defined as.

Objective Design
We consider two settings in this work: (1) a single-input-singleoutput setting when using single contrast images and (2) a multiinput-multi-output setting when using multiple contrasts jointly. In the former setting, suppose {x i , y i } N i=1 is a series of paired instances, where x i is a T1w or a T2w image at age a 1 , y i is the corresponding T1w or T2w image at age a 2 and N is the number of paired subjects in the training set. Our goal is to learn the mapping G : x → y. In the latter setting, assume indicate the T1w and T2w images at age a 1 and y T1 i , y T2 i stand for the corresponding T1w and T2w images at age a 2 . The aim is then to learn two mapping functions: G T1 :{x T1 , x T2 } → y T1 and G T2 :{x T1 , x T2 } → y T2 .

Adversarial Loss
In the single-input-single-output setting, in order to learn the mapping G : x → y, we can employ the adversarial loss function of the original GAN (see Equation 1). In the multi-input-multioutput setting, the basic idea is same as the original GAN, but here we define two generators, i.e., G T1 :{x T1 , x T2 } → y T1 and G T2 :{x T1 , x T2 } → y T2 . G T1 and G T2 aim at generating fake T1w and T2w images that look similar as real images, respectively. We also define two discriminators D T1 and D T2 , where the intention of D T1 is to differentiate the real T1w image y T1 from the generated T1w image G T1 (x T1 , x T2 ). Similarly, D T2 attempts to distinguish between y T2 and G T2 (x T1 , x T2 ). With respect to generator G T1 and its discriminator D T1 , the adversarial loss can be formulated as The adversarial loss L adv (G T2 , D T2 ) can be expressed similarly.

Voxel-Wise Reconstruction Loss
While the adversarial loss can optimize the generated output images closer to the real data distribution, it cannot ensure consistency between the outputs and the desired ground-truth images, so that a predicted image may not share the details of its corresponding ground-truth image. To deal with this problem, we further restrict the generator with a voxel-wise reconstruction loss. Here we choose a traditional L1 loss, as recommended in Zhao et al. (2015), which directly penalizes the voxel-wise differences between the two images. For the single-input-singleoutput setting, the voxel-wise reconstruction loss is given by For the multi-input-multi-output setting, the voxel-wise reconstruction loss is expressed as.

Perceptual Loss
Although the voxel-wise reconstruction loss enforces voxelwise consistency between the real and generated images, it prefers an over-smoothed solution (Pathak et al., 2016). In other words, this loss commonly leads to outputs with well-preserved lowfrequency information, e.g., global structures, at the expense of the high-frequency crispness. To alleviate this problem, we add a perceptual loss (Johnson et al., 2016) to the generator, which results in sharper images. The perceptual loss calculates the difference between two images in feature space in place of voxel space. Thus, it forces the generated images to be perceptually similar to the real images, instead of matching intensities exactly at the voxel level. Suppose φ m (x) is the output from the m-th layer of a feature extractor φ when processing the image x. For the single-input-single-output setting, the perceptual loss can be written as For the multi-input-multi-output setting, the perceptual loss is formulated as Overall objective: By combining the above loss functions, we can define the final objective in the single-input-single-output setting as Similar, we can define the total objective in the multi-inputmulti-output setting as where α and β are the coefficients to weight the loss contributions. Figure 1A illustrates the architecture of the perceptual adversarial network (PGAN) that is designed for the singleinput-single-output setting. It consists of a generator G, a discriminator D and a feature extractor φ. We utilize a traditional 3D-Unet (Çiçek et al., 2016) as generator. The 3D-Unet is an end-to-end convolutional neural network that was originally developed for medical image segmentation. It includes an analysis path (encoder) and a synthesis path (decoder). The encoder part contains four convolutional layers each of which includes two repeated 3 × 3 × 3 convolution operations, followed by a 2 × 2 × 2 max pooling for downsampling (except for the last layer). The decoder part is basically the same as the encoder part, but it replaces all downsampling with upsampling. Skip-connections are build between the layers of the encoder and their counterparts of the decoder. The architecture of our discriminator D is the same as the one in . It contains four stride-2 convolutional layers, with 64, 128, 256, and 512 channels, respectively. The output layer of it is a stride-1 convolutional layer with one channel, followed by a sigmoid activation function. Instance normalization (Ulyanov et al., 2016) is applied to the convolutional layers in both generator and discriminator. For natural images, pretrained VGG networks are often adopted as the feature extractors. However, for medical image tasks, feature extractors based on pretrained VGG-Nets have the following limitations: (1) 3D medical images would have to be reformulated into a 2D format to fit VGG-Nets (2D networks), leading to the loss of rich 3D anatomical information.

Perceptual Adversarial Network
(2) Perceptual differences between natural projection images (such as photos) and 3D tomographic, medical images, are not captured. To overcome these limitations, we employ an existing application-specific model as our feature extractor φ, specifically Model Genesis (Zhou et al., 2019), which is built directly from 3D medical images. Note that Model Genesis is a U-Net style network and here we only use its encoder part for feature extraction. The details of PGAN are shown in Table 1.

Multi-Contrast Perceptual Adversarial Network
As shown in Figure 1C, the multi-contrast perceptual adversarial network (MPGAN) contains two generators G T1 and G T2 , two   discriminators D T1 and D T2 and one feature extractor φ. The feature extractor φ and the architectures of D T1 and D T2 are the same as for PGAN. G T1 and G T2 are both based on 3D-Unets that utilize a shared encoder and two independent decoders with skip-connections. The shared encoder learns complementary information from both T1w and T2w images and skip connections are used to transfer this information from the shared encoder to different decoders. We combine T1w and T2w images before feeding them into generators by applying a channel-wise concatenation.

Materials
The data used in this work is collected from the "Infant Brain Imaging Study" (IBIS) database (https://www.ibis-network.org) and the raw MR images are available on NDA (https:// nda.nih.gov). All MR images were clinically evaluated by an expert neuroradiologist (RCM) and subjects with visible clinical pathology were excluded from the study. Data collection sites had approved study protocols by their Institutional Review Boards (IRB), and all enrolled subjects had informed consent provided by their parent/guardian. MR imaging parameters are as follows: (1) 3T Siemens Tim Trio at 4 sites; (2) T1w MRI: TR/TE = 2,400/3.16 ms, 256 × 256 × 160, 1 mm 3 resolution; (3) T2w MRI: TR/TE = 3,200/499 ms, same matrix and resolution as T1w. A series of preprocessing steps were adopted, i.e., ICBM alignment, bias correction, geometry correction, skull stripping (see Hazlett et al., 2017 for details), and intensity normalization to range (−1,1). For our main dataset which is used for longitudinal prediction, a total of 289 subjects with two complete scans at 6 and 12 months were selected. The dataset was split into three sets: training set (231 subjects), validation set (29 subjects), and test set (29 subjects). We also build two additional datasets to evaluate the applicability of our predicted/imputed images. In the first application, we aim at classifying subject image data into different Autism Diagnostic Observation Schedule social affect (ADOS-SA-CSS) based groups. Thus, only those subjects with valid ADOS-SA-CSS measures were employed. In addition, we reduced the size of the typical developing group for group size balancing. This resulted in 77 subjects with complete scans at 6 and 12 months and 103 subjects with scans either at 6 or 12 months. In the second application, we estimated a subject's gestational age (GA) at birth from its MRI data. Only subjects with known GA were selected and this resulted in 134 subjects with complete scan pairs at 6 and 12 months, as well as 76 subjects with scans either at 6 or 12 months. In both applications, we employ the imputed datasets as additional training data. No imputed images are used in the testing datasets.

Implementation Details
Our experiments were performed on a lambdalab GPU server with four NVIDIA TITAN RTX GPU with 24GB oncard memory. All the networks were implemented in Tensorflow and trained via Adam optimization (Kingma and Ba, 2014). The batch size was set to 1. The learning rate was initially set to 2e-4 for the first 44 epochs and decayed every 22 epochs with a base of 0.5 for an additional 176 epochs. The trade-off parameters α and β in Equation (7) and (8) were set to 25 and φ 1 (x) was used for computation of the perceptual loss based on a grid search. The details of the grid search are shown in the Supplementary Materials. Two longitudinal prediction tasks were performed in this work, i.e., prediction of 6-month images from 12-month images and prediction of 12-month images from 6-month images.

Alternative Networks for Comparison
In this paper, we also trained five additional networks for the purpose of comparison: (1) CycleGAN: 3D extension of original CycleGAN . (2) Unet(L vr ): 3D-Unet (Çiçek et al., 2016) trained with L vr . (3) Unet (L vr + L p ): 3D-Unet trained with both L vr and L p . (4) GAN: original GAN (Goodfellow et al., 2014). (5) GAN+L vr : original GAN with additional L vr term. To enable fair comparisons, we implemented these networks with parameters optimized the same way as our proposed methods. Further, the 3D-Unet was used as the backbone of Unet variant methods, i.e., (2) and (3), and it was also used as the generator of cycleGAN, GANs and our methods. The discriminators for (1), (4), and (5) are the same as for our models.

Evaluation via Appearance Based Metrics
In this section, images predicted by different methods are evaluated both in qualitative and quantitative fashion, focusing on the image appearance. In addition, we conducted a human perceptual study, where the participants were required to rate the predicted images based on visual realism and closeness to the ground truth images.

Qualitative Results
The qualitative results for different methods are given in Figure 2 (6-to-12 months prediction task) and Figure 3 (12-to-6 months prediction task). The following findings were obtained from both tasks. (1) The images predicted by Unet(L vr ) and GAN+L vr are globally consistent with the ground-truth images, but they appear overly smoothed, resulting in a poor visual quality. (2) Unet(L vr + L p ) outperforms Unet(L vr ) with the resultant images showing more high-frequency details. This indicates that adding the perceptual loss L p into training process helps the model to produce sharper details. However, the visual quality of the images generated by Unet(L vr + L p ) is still unsatisfactory due to unrealistic textured appearance. (3) GAN produces the least anatomically accurate images, albeit with sharp details. This may be due to the reason that GAN is trained without any additional constraints to enforce appearance consistency between groundtruth and generated images. (4) Our PGAN and MPGAN show a superior performance compared with the other methods. They produce more realistic images with sharp and refined details from a visual perspective. (5) Compared to PGAN, MPGAN can predict finer details, especially for T2w images. This implies that multi-contrast learning can further improve the image quality by combining complementary information from T1w and T2w images.

Quantitative Results
The development of optimal evaluation metrics for generated images is an challenging problem. Recently, a new learning-based metric, i.e., Learned Perceptual Image Patch Similarity (LPIPS), was proposed (Zhang et al., 2018) to assess the similarity between two images, which shows a superior performance compared to the traditional metrics. In this section, all of the methods are quantitatively compared based on LPIPS, which is shown in Figure 4. Note that LPIPS is a "similarity distance" calculated between the ground-truth image and the predicted image and lower value reflects a higher similarity. One can see that our PGAN and MPGAN give a notable improvement of LPIPS compared to other approaches, for both 6-to-12 months and 12to-6 months prediction tasks. Specifically, MPGAN achieves the best performance. Paired t-tests showed statistically significant improvements (p < 0.05) of MPGAN over all other methods.

Human Perceptual Study
We performed a perceptual study based on 116 sets of images, including 29 sets of 6-month T1w images, 29 sets of 6-month T2w images, 29 sets of 12-month T1w images, and 29 sets of 12-month T2w images. For each image set, the ground-truth image and the predicted images of seven different methods were shown to human raters for visual assessment. We asked 22 human raters (6 radiologists, 5 neuroscientists, 3 biomedical researchers, and 8 computer scientists with medical imaging background) to rate the image quality of the predicted images using a 7-point score, with 7 being the most realistic and closest to the ground-truth image (ties are allowed). All the images were shown initially in a random order and presented in axial, coronal, and sagittal views. The visualization order was continuously updated by sorting according to the current scores. The results of the perceptual study are shown in Table 2. Of all the studied methods, our MPGAN achieves the highest quality score across different images, with a statistical significance in Wilcoxon signed-rank test (p < 0.05 vs. other methods). The second-best performance is yielded by PGAN (p < 0.05 vs. other methods). While MPGAN are PGAN are close for T1w image prediction, MPGAN outperforms PGAN by a large margin for predicting T2w images (both 6 and 12 months), demonstrating the benefits of the multi-contrast architecture.

Evaluation on Segmentation Task
In this section, we assess the quality of the predicted images in two segmentation tasks. We conducted subcortical and tissue segmentation on both predicted and ground-truth images at 12 months using an existing multi-atlas segmentation method (Wang et al., 2014). For the tissue segmentation task, the brain  was segmented into four types of tissue, i.e., white matter, cortical gray matter, deep gray matter, and cerebrospinal fluid (CSF). For the subcortical segmentation task, 12 subcortical structure labels were computed: left and right hemispheric caudate, putamen, pallidum, thalamus, amygdala, and hippocampus. Examples of tissue and subcortical segmentation results are shown in Figures 5, 6, respectively. Our quantitative evaluation is based on following three metrics that measure the segmentation similarity between two segmentation results (S 1 and S 2 ): relative absolute volume difference (AVD, in %), average symmetric surface distance (ASD, in mm), and Dice coefficient. The relative absolute volume difference is as where V S 1 is the volume of S 1 and V S 2 can be defined similarly. Suppose B S 1 and B S 2 are the borders of S 1 and S 2 , respectively. The average symmetric surface distance (ASD) (Van Ginneken  et al., 2007) is the mean of the closest distances from voxels on B S 1 to B S 2 and from voxels on B S 2 to B S 1 , respectively. It can be defined as The Dice coefficient evaluates the spatial overlap between two segmentation results, which is defined as In order to obtain a overall evaluation criterion, we also combine the multiple metrics into a single fused score  (FS). We follow (Van Ginneken et al., 2007) for the fused score, and thus use the TanimotoError as a measure of overlap instead of Dice coefficient when calculating FS. FS is formulated as where TanimotoError is As in Van Ginneken et al. (2007), refAVD, refASD, and refTanimotoError are set to 5.6%, 0.27 mm, and 15.8%, respectively, based on the manual segmentation variance among human experts.

Segmentation Consistency Analysis
In this section, we aim at assessing the quality of predicted images by evaluating how well the automatic segmentations of the predicted images match the ones of the ground-truth images. The intuition is that if two images are segmented by the same algorithm, the more similar the two images are, the more similar their segmentation results should be. The comparison results on subcortical segmentation task are presented in Table 3. We observe that, with respect to AVD, our PGAN and MPGAN significantly outperform all the other methods (p < 0.05) and MPGAN achieves the best performance. We can also see that, for ASD and Dice, there are no significant differences among Unet(L vr + L p ), PGAN, and MPGAN, but our PGAN and MPGAN show a superior performance (p < 0.05) compared to the remaining four methods. While the images generated by Unet(L vr + L p ) are visually of lower quality owing to blurred and unrealistic details (see Figures 2, 3), in this segmentation analysis, Unet(L vr + L p ) performs at acceptable level for ASD and Dice coefficient. A possible explanation for this is that the segmentation algorithm we applied is robust to the image quality, to some extent, for this subcortical segmentation task. As for the fused score, our PGAN and MPGAN methods significantly outperform than other methods and MPGAN achieves the best score. Table 4 lists the comparison results on the brain tissue segmentation task. The results confirmed the statistically significant better performance of our proposed methods (both PGAN and MPGAN) vs. the other methods, with respect to all metrics. The results clearly demonstrate the effectiveness of  our proposed methods. Furthermore, MPGAN offers superior performance to PGAN in this task, which indicates that MPGAN benefits from using complementary information from multiple contrast data when performing a full brain tissue segmentation.

Segmentation Accuracy Analysis
In this section, we computed the AVD, ASD, and Dice coefficient between the reference manual segmentation ("Ref ") and the automatic segmentation of the ground-truth images or the predicted images. The values of these metrics here reflect the segmentation accuracy of the given image. Our goal is to evaluate the predicted images by comparing their segmentation accuracy with the one of the ground-truth image. Intuitively, the image which is more similar to the ground-truth image should have a closer segmentation accuracy to the ground-truth image. The segmentation accuracy comparison results on the subcortical segmentation task are shown in Table 5. We can observe that the images predicted by MPGAN achieve the closest AVD, Dice coefficient and fused score to the ground-truth images. Table 6 illustrates the segmentation accuracy comparison results on the tissue segmentation task. Similar to the subcortical segmentation task, MPGAN outperforms the other methods across most of the metrics (except AVD).

Efficacy of Data Imputation
The issue of missing scans is a common, practical problem in longitudinal studies. Subjects with incomplete scans cannot be used as training samples for machine learning applications as well as also for statistical methods that need complete data. Thus, the training size is significantly reduced due to these missing scans. Intuitively, using a larger training set is expected to improve performance, because adding training samples can bring more information and increase the diversity of the dataset. In this section, we use our method to predict missing subject scans for the purpose of machine learning tasks. After completing the data, these subjects can then be added to the training set for methods necessitating complete longitudinal subject data. While increasing the size of training samples via such imputation can improve the model performance, the imputed data has to be of high quality and representing the longitudinal data distribution appropriately. Poorly imputed data is expected to reduce model performance. Here, we investigated whether adding our imputed data to increase the size of the training set is beneficial to two practical tasks: 1. Classification of the severity group according to the social affect (SA) calibrated severity score (CSS) of the Autism Diagnostic Observation Schedule (ADOS, second edition) (Lord et al., 2012) at 24 months of age from prior longitudinal image data at 6 and 12 months 2. Regression of the gestational age at birth from later longitudinal image data at 6 and 12 months.
We compared the results of two settings: "non-imputed" and "imputed." For the "non-imputed" setting, the classifier/regressor was trained with only real image pairs, i.e., both 6-month and 12-month images are real. For the "imputed" setting, in addition to the real training pairs used in the first setting, the classifier/regressor was also trained on "mixed" image pairs, i.e., real 6-month and predicted 12-month images, or predicted 6-month and real 12-month images. No imputed data was employed in the testing set. Thus, any differences in performance would stem from the additional inclusion of the imputed/generated datasets in the training set.
In our experiments, the real image pairs were divided into 4 folds and a 4-fold cross-validation was employed to evaluate the performance. Each time one fold was used for testing, and the other three were used for training. For the "imputed" setting, an additional set of "mixed" image pairs were included in all training folds of the cross-validation scheme. Since MPGAN performed the best in previous experiments, here we only employed MPGAN for image imputation. The Extreme Gradient Boosting (Xgboost) algorithm (Chen and Guestrin, 2016) was applied in both classification and regression tasks, which was implemented using the scikit-learn Python libraries (Raschka, 2015). Instead of directly feeding the raw images into the Xgboost model, which would result in an extremely high dimensional feature space, we employed the features extracted by the model genesis encoding (the 3D deep learning model previously used for the perceptual loss in section 2.4.1) as our inputs.

ADOS-SA-CSS Group Classification at 24 Months With Imputed Data
Our goal in this experiment was to classify subject image data into one of three social affect severity groups (typical: score 1-2, low: score 3-4, moderate-to-high: score 5- 10 Hus et al., 2014) at 24 months of age using 6 and 12 months MR image pairs. The ADOS-SA-CSS is a calibrated score that was developed to capture the severity of symptoms in social affect in children with ASD. We selected ADOS-SA-CSS as prediction measure instead of other calibrated ADOS scores (Restricted and Repetitive Behavior, RRB, and total severity score) as the ADOS-SA-CSS was observed to have a smoother distribution in our sample, as well as prior work indicate wide-scale associations with atypical social behavior in ASD (Sato and Uono, 2019).
We assessed whether data imputation improves the classification performance via F1 score, AUC score, and balanced accuracy metrics. Before data imputation, 77 subjects with complete scans at 6 and 12 month can be used for training. After data imputation, an additional 103 subjects can be added to the training set. Table 7 summarizes the cross-validated results in the "non-imputed" and "imputed" setting. We found that adding imputed data into training process can improve the model performance, such that the F1 score is increased from 0.590 to 0.694, the AUC score is increased from 0.655 to 0.671, and the balanced accuracy is increased from 0.594 to 0.699. We show the confusion matrices of the "non-imputed" and "imputed" setting in Figure 7. It is observed that the number of correctly classified samples is clearly improved across all groups, but especially in the moderate-to-high ADOS-SA-CSS group.

Regression of Gestational Age at Birth With Imputed Data
In this part, we employed the 6 and 12 months MR images to regress the gestational age at birth (39.03 ± 1.50 weeks). The mean absolute error (MAE) and relative error (RE) of the regressed GA were used as metrics. The 4-fold cross-validated results of "non-imputed" and "imputed" setting are shown in Table 8. Before data imputation, 134 subjects with complete scans at 6 and 12 month are used for training. After data imputation, an additional 76 subjects are added to the training set. Incorporating imputed data into the training process offers a slight improvement on the regression performance, though this improvement is of smaller magnitude than in the ADOS-SA-CSS classification.

DISCUSSION
In this paper, we present a novel adaptation of GAN to a new application: the longitudinal prediction of infant MR images. We validated and compared our technique with five alternative networks from multiple perspectives: qualitative and quantitative assessments of the image appearance, as well as quantitative assessment on two segmentation tasks. A consistent superior performance of our method has been shown in these evaluations, indicating its effectiveness. The images predicted by our method were then used for expanding the train set for ADOS-SA-CSS group classification and gestational age regression experiments. These experiments show that the imputed data brings a performance boost, highlighting the potential of our image prediction method when applied to a practical task.
Looking at Figures 2, 3, one can see that the 3D-Unet trained with only the voxel-wise reconstruction loss (L vr ) does a good job in keeping low-frequency information, and thus global structures are well-preserved. However, the images produced by  it appear blurry and significantly lack high-frequency details, which is also reflected quantitatively with it achieving the worst LPIPS scores in Figure 4. We found that integrating the perceptual loss L p into the training can effectively alleviate this problem, as Unet(L vr + L p ) predicts sharper images, compared to Unet(L vr ). Nevertheless, the images generated by Unet(L vr + L p ) still have unrealistic appearance from visual perspective. Using an adversarial training scheme, our proposed PGAN employs a voxel-wise reconstruction loss, a perceptual loss, and an adversarial loss jointly to produce sharper and more realistic images. This results in a statistically significant improvement in the quantitative assessment (see Figure 4). This may be due to the following reason: with respect to the adversarial learning strategy, the discriminator is optimized to differentiate the real and fake images. In order to fool the discriminator, the generator has to push the output distribution closer to the distribution of real data. As a result, the outputs of the generator are visually realistic. As T1w and T2w images encompass rich information that is different and complementary to each other, we further propose a multi-contrast version, called MPGAN, which produces even finer details as well as achieves a better quantitative score, compared to PGAN. In particular, we observe a loss in the cortical contrast information in T2w images in most of evaluated methods, while it appears well-preserved for MPGAN. In summary, the combination of the voxel-wise reconstruction loss, the perceptual loss, the adversarial loss and the use of multicontrast information allows our MPGAN to produce realistic images with accurate details where both low and high frequency information is well-preserved.
Also, we report analyses of the segmentation consistency and accuracy for the different methods in both subcortical and tissue segmentation tasks. As shown in Tables 3, 4, our MPGAN achieves the best segmentation consistency across most of the used metrics, where the differences are significant at the p < 0.05 level. The second-best performance is yielded by our single contrast PGAN method. As shown in Tables 5, 6, our MPGAN results in the closest segmentation accuracy to the ground truth images across most of the considered metrics. Besides, we also found that, with respect to ASD and Dice coefficient, there is no significant difference between PGAN and MPGAN for both segmentation consistency and accuracy analysis on the subcortical segmentation task. However, MPGAN clearly performs better than PGAN for tissue segmentation task. This may be attributed to the reason that subcortical segmentation is relatively simple because of consistent shape of subcortical structures, compared to the folded, complex cortex assessed in the tissue segmentation task.
To investigate the applicability of our predicted/imputed images, we employed our predicted image data for data imputation in two practical tasks, i.e., one on ADOS-SA-CSS group classification and the other on regression of gestational age. The results show that the model performance can be boosted with the help of imputed data for the both tasks. We did not employ the imputed data for testing purposes, so all gains in classification/regression are due to the inclusion of the imputed data. This finding indicates that the predicted data was sufficiently close to the true data that it provided valuable information to the training process.
It is further noteworthy that any image data prediction is biased by the training data. Thus, we expect our method to potentially perform poorly for brain images with atypical morphometry or neuropathology that are unknown to the trained model as such data was not included in the training. This indicates the necessity to develop additional safeguards to ensure that the input data and the trained model are appropriately matched. In the results presented here, we apply our methods in a fairly narrow subject population (typically developing children and children at familial risk for ASD) and all MRI data was inspected by a neuroradiologist for the presence of visible neuropathology.
While our method shows very promising results, it is not without limitations. For one, the current approach needs paired longitudinal data of the same subject for training. A further computation limitation is that we are directly feeding 3D data into our networks, which requires large amounts of memory and thus a high performance GPU server.

CONCLUSION
This paper introduces a novel multi-contrast perceptual generative network (MPGAN) for longitudinal prediction of infant MRI data. To the best of our knowledge, this is the first time that deep generative methods are applied for longitudinal prediction of structural MRI in the first year of life. Our approach improves the realism, sharpness, and accuracy of predicted images by merging the adversarial learning scheme with the voxel-wise reconstruction loss and the perceptual loss, as well as taking the multi-contrast information into account. In our qualitative and quantitative assessments, our method yielded a better performance than the alternative approaches studied in this work.
Longitudinal data is crucial to capture appropriate developmental trajectories in studies of the first year of life. Missing data is a major issue and our proposed method achieves highly promising results to impute such missing data for training data augmentation in classification or regression tasks. The improvement in performance when classifying subjects into categories of severity of social affect symptoms from image data only is quite impressive.
Our future work will focus on extending the paired approach at consistent time points (here 6 and 12 months of age) to a time regression based approach to overcome our current limitation of discrete time points and model imputation along the full first year of life. Furthermore, additional experiments to quantify the value of adding real data (i.e., acquiring additional subjects) vs. adding imputed data (i.e., imputing incomplete data as performed here) will need to be performed.

DATA AVAILABILITY STATEMENT
Code: The source code of our MPGAN method is available as open source at https://github.com/liying-peng/MPGAN. Data: All raw MRI datasets and associated demographic information employed in this study is available at NIH/NDA: https://nda.nih. gov/edit_collection.html?id=19.

ETHICS STATEMENT
This study was reviewed and approved by the Institutional Review Boards (IRB) of all data collection sites (University of North 1654 Carolina, University of Washington, Children's Hospital of 1655 Pennsylvania, Washington University). Written informed consent was provided by the parent/guardian of all enrolled subjects. The data used in this work is collected from the Infant Brain Imaging Study (IBIS) database (https://www.ibisnetwork.org).

AUTHOR CONTRIBUTIONS
LP, LL, YL, Y-wC, GG, and MAS contributed to methodological development. LP, MAS, HH, CB, RG, MDS, and JG contributed to the study design. ACE, SD, AME, RM, KB, RS, HH, and JP contributed to the image data collection. LP, ZM, RV, SK, and MAS contributed to the experiment data collection. LP and MAS contributed to the statistical analysis. LP and MAS wrote the first draft of the manuscript. All authors contributed to manuscript revision, read, and approved the submitted version.

FUNDING
This study was supported by grants from the Major Scientific Project of Zhejiang Lab (No. 2018DG0ZX01), the National Institutes of Health (R01-HD055741, T32-HD040127, U54-HD079124, U54-HD086984, R01-EB021391, and P50-HD103573), Autism Speaks, and the Simons Foundation (140209). MDS was supported by NIH career development award K12-HD001441, as is JG K01-MH122779. The sponsors had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.