# ARTIFICIAL INTELLIGENCE FOR MEDICAL IMAGE ANALYSIS OF NEUROIMAGING DATA

EDITED BY : Nianyin Zeng, Siyang Zuo, Guoyan Zheng, Yangming Ou and Tong Tong PUBLISHED IN : Frontiers in Neuroscience, Frontiers in Computational Neuroscience and Frontiers in Psychiatry

#### Frontiers eBook Copyright Statement

The copyright in the text of individual articles in this eBook is the property of their respective authors or their respective institutions or funders. The copyright in graphics and images within each article may be subject to copyright of other parties. In both cases this is subject to a license granted to Frontiers. The compilation of articles constituting this eBook is the property of Frontiers.

Each article within this eBook, and the eBook itself, are published under the most recent version of the Creative Commons CC-BY licence. The version current at the date of publication of this eBook is CC-BY 4.0. If the CC-BY licence is updated, the licence granted by Frontiers is automatically updated to the new version.

When exercising any right under the CC-BY licence, Frontiers must be attributed as the original publisher of the article or eBook, as applicable.

Authors have the responsibility of ensuring that any graphics or other materials which are the property of others may be included in the CC-BY licence, but this should be checked before relying on the CC-BY licence to reproduce those materials. Any copyright notices relating to those materials must be complied with.

Copyright and source acknowledgement notices may not be removed and must be displayed in any copy, derivative work or partial copy which includes the elements in question.

All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary only. For further information please read Frontiers' Conditions for Website Use and Copyright Statement, and the applicable CC-BY licence.

ISSN 1664-8714 ISBN 978-2-88963-826-0 DOI 10.3389/978-2-88963-826-0

### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# ARTIFICIAL INTELLIGENCE FOR MEDICAL IMAGE ANALYSIS OF NEUROIMAGING DATA

Topic Editors: Nianyin Zeng, Xiamen University, China Siyang Zuo, Tianjin University, China Guoyan Zheng, Shanghai Jiao Tong University, China Yangming Ou, Harvard Medical School, United States Tong Tong, Independent researcher, China

Citation: Zeng, N., Zuo, S., Zheng, G., Ou, Y., Tong, T., eds. (2020). Artificial Intelligence for Medical Image Analysis of NeuroImaging Data. Lausanne: Frontiers Media SA. doi: 10.3389/978-2-88963-826-0

# Table of Contents

*05 Editorial: Artificial Intelligence for Medical Image Analysis of Neuroimaging Data*

Nianyin Zeng, Siyang Zuo, Guoyan Zheng, Yangming Ou and Tong Tong

*07 Cerebral Micro-Bleeding Detection Based on Densely Connected Neural Network*

Shuihua Wang, Chaosheng Tang, Junding Sun and Yudong Zhang


Jiahang Xu, Fangyang Jiao, Yechong Huang, Xinzhe Luo, Qian Xu, Ling Li, Xueling Liu, Chuantao Zuo, Ping Wu and Xiahai Zhuang

*40 Brain White Matter Hyperintensity Lesion Characterization in T2 Fluid-Attenuated Inversion Recovery Magnetic Resonance Images: Shape, Texture, and Potential Growth*

Chih-Ying Gwo, David C. Zhu and Rong Zhang

*55 Nested Dilation Networks for Brain Tumor Segmentation Based on Magnetic Resonance Imaging*

Liansheng Wang, Shuxin Wang, Rongzhen Chen, Xiaobo Qu, Yiping Chen, Shaohui Huang and Changhua Liu

*69 Diagnosis of Alzheimer's Disease via Multi-Modality 3D Convolutional Neural Network*

Yechong Huang, Jiahang Xu, Yuncheng Zhou, Tong Tong, Xiahai Zhuang and the Alzheimer's Disease Neuroimaging Initiative (ADNI)

*81 Supervised Brain Tumor Segmentation Based on Gradient and Context-Sensitive Features*

Junting Zhao, Zhaopeng Meng, Leyi Wei, Changming Sun, Quan Zou and Ran Su

*92 Prediction and Classification of Alzheimer's Disease Based on Combined Features From Apolipoprotein-E Genotype, Cerebrospinal Fluid, MR, and FDG-PET Imaging Biomarkers*

Yubraj Gupta, Ramesh Kumar Lama, Goo-Rak Kwon and the Alzheimer's Disease Neuroimaging Initiative


Chuanlu Lin, Yi Wang, Tianfu Wang and Dong Ni


# Editorial: Artificial Intelligence for Medical Image Analysis of Neuroimaging Data

#### Nianyin Zeng<sup>1</sup> \*, Siyang Zuo<sup>2</sup> , Guoyan Zheng<sup>3</sup> , Yangming Ou<sup>4</sup> and Tong Tong<sup>5</sup>

*<sup>1</sup> Department of Instrumental and Electrical Engineering, Xiamen University, Xiamen, China, <sup>2</sup> Key Laboratory of Mechanism Theory and Equipment Design, Ministry of Education, Tianjin University, Tianjin, China, <sup>3</sup> School of Biomedical Engineering, Institute of Medical Robotics, Shanghai Jiao Tong University, Shanghai, China, <sup>4</sup> Department of Radiology, and Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, United States, <sup>5</sup> College of Physics and Information Engineering, Fuzhou University, Fuzhou, China*

Keywords: medical image analysis, artificial intelligence, machine learning, pattern recognition, computational intelligence

### **Editorial on the Research Topic**

### **Artificial Intelligence for Medical Image Analysis of Neuroimaging Data**

With the development of advanced medical imaging techniques, a huge amount of medical images have been produced in various healthcare institutes and hospitals. Especially, there is a growing research interest in a more multidisciplinary approach for investigating brain structure and function in living humans and animals. In order to better interpret brain images, there is an increasing demand to introduce artificial intelligence methods such as machine learning, expert systems, robotics and perception, and evolutionary computation to automatically exploit useful information besides visual features. It should be pointed out that brain images themselves exhibit several distinguishing features that add to the difficulties in their analysis. In recent years, there have been many new research achievements in each aspect of artificial intelligence for brain image analysis. This Research Topic sought original contributions that address the challenges of artificial intelligence for brain image analysis and welcomed researchers in this field to share their experiences and new research achievements.

We were pleased to receive many submissions from authors of their latest research results on artificial intelligence methods for medical image analysis. Nineteen papers are finally accepted from a total of 29 submissions after rigorous reviews. They were contributed from different countries and regions, including China, the United Kingdom, the United States, Germany, South Korea, Denmark, Canada, and more.

Here, a brief introduction of the 19 accepted papers is given. We refer the readers to the papers in this topic and the references therein for more details. Lin W. et al. established a deep learning approach based on convolutional neural networks (CNN) to accurately predict MCI-to-AD conversion with magnetic resonance imaging (MRI) data. Kazeminejad and Sotero introduced a new biomarker extraction pipeline for Autism Spectrum Disorder that relies on the use of graph-theoretical metrics of fMRI-based functional connectivity to inform a support vector machine. Bi et al. proposed an advanced method, namely an evolutionary weighted random support vector machine cluster, for analysis of Alzheimer's disease. Ladefoged et al. focused on the problem of attenuation correction of PET/MRI in pediatric brain tumor patients based on a deep learning method. Livne et al. established a U-Net deep learning framework for high-performance vessel segmentation in patients with cerebrovascular disease. Wang, Sun et al. proposed a 14-layer convolutional neural network for the identification of multiple sclerosis. Huang C. et al. developed

Edited and reviewed by: *Vince D. Calhoun, Georgia State University, United States*

> \*Correspondence: *Nianyin Zeng zny@xmu.edu.cn*

#### Specialty section:

*This article was submitted to Brain Imaging Methods, a section of the journal Frontiers in Neuroscience*

Received: *24 November 2019* Accepted: *17 April 2020* Published: *21 May 2020*

#### Citation:

*Zeng N, Zuo S, Zheng G, Ou Y and Tong T (2020) Editorial: Artificial Intelligence for Medical Image Analysis of Neuroimaging Data. Front. Neurosci. 14:480. doi: 10.3389/fnins.2020.00480*

**5**

a new fusion method based on the combination of the shuffled frog leaping algorithm and a pulse coupled neural network for the fusion of SPECT images and CT images to improve the quality of fused brain images. Xin et al. utilized a deep learning method to find differences between the brains of men and women. Zhang Y. et al. proposed an improved wavelet threshold for image de-noising. Lin C. et al. proposed a novel low-rank method for the simultaneous recovery and segmentation of pathological MR brain images. Zhang Z. et al. developed a multiscale time-series model for the diagnosis of brain diseases. Gupta et al. proposed a novel machine learning-based framework to discriminate subjects with AD or MCI, utilizing a combination of four different biomarkers. Zhao et al. proposed a supervised brain tumor segmentation method based on gradient and contextsensitive features. Huang Y. et al. developed a multi-modality 3D convolutional neural network for the diagnosis of Alzheimer's disease. Wang L. et al. presented the use of Nested Dilation Networks for brain tumor segmentation. Gwo et al. developed a method to characterize and quantify the shape, texture, and potential growth of white matter hyperintensity lesions. Xu et al. introduced a fully automatic framework for Parkinson's disease diagnosis. Wang, Xie et al. proposed an AlexNet transfer learning model for alcoholism identification. Wang, Tang et al. developed a densely connected neural network for analysis of cerebral micro-bleeding.

In the end, we strongly hope that this Research Topic will attract more research attention to artificial intelligence methods for medical image analysis. We thank the reviewers for their efforts to guarantee the high quality of this collection. We also thank all of the authors who have contributed.

## AUTHOR CONTRIBUTIONS

NZ wrote the editorial. SZ, GZ, YO, and TT edited the editorial.

### FUNDING

This work was supported in part by the International Science and Technology Cooperation Project of Fujian Province of China under Grant 2019I0003, in part by the UK-China Industry Academia Partnership Programme under Grant UK-CIAPP-276, in part by the Korea Foundation for Advanced Studies, in part by the Fundamental Research Funds for the Central Universities under Grant 20720190009, in part by The Open Fund of Provincial Key Laboratory of Eco-Industrial Green Technology-Wuyi University, in part by the Open Fund of Engineering Research Center of Big Data Application in Private Health Medicine of Fujian Province University under Grant KF2020002, and in part by the Harvard Medical School/Boston Children's Hospital Faculty Development Award (YO), and the St Baldrick Research Scholar Award (YO).

**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Zeng, Zuo, Zheng, Ou and Tong. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Cerebral Micro-Bleeding Detection Based on Densely Connected Neural Network

#### Shuihua Wang<sup>1</sup> \* † , Chaosheng Tang<sup>1</sup>† , Junding Sun<sup>1</sup> \* and Yudong Zhang1,2 \* †

<sup>1</sup> School of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China, <sup>2</sup> Department of Informatics, University of Leicester, Leicester, United Kingdom

#### Edited by:

Nianyin Zeng, Xiamen University, China

#### Reviewed by:

Haiou Liu, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany Zhi Yong Zeng, Fujian Normal University, China

#### \*Correspondence:

Shuihua Wang shuihuawang@ieee.org Junding Sun sunjd@hpu.edu.cn Yudong Zhang yudongzhang@ieee.org †These authors have contributed

equally to this work

### Specialty section:

This article was submitted to Brain Imaging Methods, a section of the journal Frontiers in Neuroscience

Received: 14 February 2019 Accepted: 12 April 2019 Published: 17 May 2019

### Citation:

Wang S, Tang C, Sun J and Zhang Y (2019) Cerebral Micro-Bleeding Detection Based on Densely Connected Neural Network. Front. Neurosci. 13:422. doi: 10.3389/fnins.2019.00422 Cerebral micro-bleedings (CMBs) are small chronic brain hemorrhages that have many side effects. For example, CMBs can result in long-term disability, neurologic dysfunction, cognitive impairment and side effects from other medications and treatment. Therefore, it is important and essential to detect CMBs timely and in an early stage for prompt treatment. In this research, because of the limited labeled samples, it is hard to train a classifier to achieve high accuracy. Therefore, we proposed employing Densely connected neural network (DenseNet) as the basic algorithm for transfer learning to detect CMBs. To generate the subsamples for training and test, we used a sliding window to cover the whole original images from left to right and from top to bottom. Based on the central pixel of the subsamples, we could decide the target value. Considering the data imbalance, the cost matrix was also employed. Then, based on the new model, we tested the classification accuracy, and it achieved 97.71%, which provided better performance than the state of art methods.

Keywords: DenseNet, CMB detection, transfer learning, cost matrix, deep learning

## INTRODUCTION

Cerebral micro-bleeding (CMB) are small chronic brain hemorrhages that can be caused by structural abnormalities of the small vessels of the brain. According to the recent research reports, the causes of CMBs also can be some other common reasons, including high blood pressure, head trauma, aneurysm, blood vessel abnormalities, liver disease, blood or bleeding disorders and brain tumors (Martinez-Ramirez et al., 2014). It also can be caused by some unusual etiologies, such as cocaine abuse, posterior reversible encephalopathy, brain radiation therapy, intravascular lymphomatosis, thrombotic thrombocytopenic purpura, moyamoya disease, infective endocarditis, sickle cell anemia/β-thalassemia, proliferating angio-endotheliomatosis, cerebral autosomal dominant arteriopathy subcortical infarcts, leukoencephalopathy (CADASIL), genetic syndromes, or obstructive sleep apnea (Noorbakhsh-Sabet et al., 2017). The patients suffering from CMBs can have symptoms where the corresponding area that is controlled by the bleeding area malfunctions, resulting in a rise in intracranial pressure due to the large mass putting pressure on the brain and so on. CMBs could be easily ignored as the similar symptoms and signs of the subarachnoid hemorrhages, unless the patients have more obvious symptoms, such as a headache followed by vomiting. Those symptoms can eventually become worse or occur suddenly, based on the distribution and intensity of the CMBs. Patients suffering from CMBs can result in cognitive impairment, neurologic dysfunction and long-term disability. CMBs could also induce side effects

from medication or treatments. The worse thing is that the death is possible and can happen quickly. Therefore, the early and prompt diagnosis of CMBs is essential and helpful in timely medical treatment.

Due to the paramagnetic susceptibility of the hemosiderin (Allen et al., 2000), CMBs can be visualized by T2<sup>∗</sup> -gradient recalled echo (GRE) imaging or susceptibility weighted imaging (SWI). Traditionally, CMBs are manually interpreted based on criteria including shapes, diameters and signal characteristics after imaging. However, the criteria were varied as reported in different studies (Cordonnier et al., 2007), until 2009 when Greenberg et al. (2009) published the consensus on standard criteria for CMB identification. However, manual detection methods involve the human interventions, which can bring biases. Meanwhile, the manual detection is labor intensive, hard to reproduce and difficult to exclude the mimics, which can lead to misdiagnosis.

Therefore, the development of automatic CMB detection is important and essential for the accurate detection and early diagnosis of CMBs. Due to the benefits of advanced imaging technologies, massive computer vision aided systems have been developed for automatic CMB detection. For example, Ateeq et al. (2018) proposed a system based on an ensemble classifier. Their system consisted of three steps: first the brain was extracted, then the initial candidates were detected based on the filter and threshold, and finally, feature extraction and classification model were built to remove the false alarms. Fazlollahi et al. (2015) proposed using a multi-scale Laplacian of Gaussian (msLoG) technique to detect the potential CMB candidates, followed by extracting a set of 3-dimensional Radon and Hessian based shape descriptors within each bounding box to train a cascade of binary random forests. Barnes et al. (2011) proposed a statistical thresholding algorithm to recognize the potential hypointensities. Then, a supervised classification model based on the support vector machine was employed to distinguish true CMBs from other marked hypo-intensities. van den Heuvel et al. (2016) proposed an automatic detection system for microbleeds in MRIs of patients with trauma based on twelve characteristics related with the dark and spherical characteristics of CMBs and the random forest classifier. Bian et al. (2013) proposed a 2D fast radial symmetry transform (RST) based method to roughly detect the possible CMBs. Then the 3D region growing on the possible CMBs was utilized to exclude the falsely identified CMBs. Ghafaryasl et al. (2012) proposed a computer aided system based on following three steps: skull-stripping, initial candidate selection and reduction of false-positives (FPs) by a two-layer classifier. Zhang et al. (2017) proposed voxel-vise detection based on a single hidden layer feed-forward neural network with scaled conjugate gradient. Chen (2017) proposed a seven-layer deep neural network based on the sparse autoencoder for voxel detection of CMBs. Seghier et al. (2011) proposed a system named MIDAS for automatic CMB detection.

All above methods have reached great progress in CMB detection. However, their detection accuracy and robustness are still in need of improvement.

Therefore, in this paper, we employed the SWI for CMB imaging, which was because SWI could provide high resolution as reported in Haacke et al. (2009) and work as the most sensitive techniques to visualize CMBs. Considering the limited amounts of labeled images, and knowledge to recognize representative characters about the medical images, we considered utilizing the DenseNet as the basic algorithm for transfer learning. The reason for this is because the amount of labeled CMB images is typically very limited, and it is hard to effectively train a classifier to get high detection accuracy. In summary, we proposed using transfer learning of DenseNet for CMB detection based on the collected images, which means we use the knowledges obtained from training the related tasks by DenseNet for CMB detection.

The remainder of this paper is organized in a structure as follows: "Materials and Methods" section describes the method used in this research, "Transfer Learning" section explains why we employed the transfer learning, "CMB Detection Based on the Transfer Learning and DenseNet" section describes the research materials used in this paper, including the training set and test set, and also offers the experiment results, and finally, "Discussion" section provides the conclusion and discussion.

### MATERIALS AND METHODS

In recent years, Deep Learning (DL) has achieved great progress in object recognition (Tong et al., 2019; Xie et al., 2019), prediction (Yan et al., 2018; Hao et al., 2019), speech analysis (Cummins et al., 2018), noise reduction (Islam et al., 2018), monitoring (Li et al., 2018; Wang et al., 2018), medicine (Raja et al., 2015; Safdar et al., 2018), the recommendation system (Zhang and Liu, 2014), biometrics (Xing et al., 2017) and so on. Traditionally, DL consists of multiple layers of nonlinear processing units to obtain the features. The cascaded layers take the output from their previous layer as input. In order to explore the potential of DL, many researchers tried to make the network deeper and wider. However, it suffers from either exploding or the vanishing gradient problem (VGP). Therefore, multiple different structures of DL were proposed. For example, AlexNet, the winner of ImageNet Large Scale Visual Recognition Competition (ILSVRC) 2012, was proposed by Krizhevsky et al. (2012) and has the same structure as LeNet but has max pooling and ReLU non-linearity. VGGNet, proposed by Karen Simonyan (2015), won the second place in ILSVRC 2014 and consisted of deeper networks (19 layers) compared to AlexNet. GoogLeNet, the winner of ILSVRC 2014, provides a deeper and wider network that incorporates 1 × 1 convolutions to reduce the dimensionality of the feature maps before expensive convolutions, together with parallel paths with different receptive field sizes to capture sparse patterns of correlations in the feature map stacks. ResNet, the winner of ILSVRC 2015, offers a 152-layer network that introduces a skip at least two layers or shortcut connections (He et al., 2016). Huang et al. (2017) proposed DenseNet where each layer takes the output from all previous layers in a feed-forward fashion and offers L(L+1)/2 connections for L layers, while traditional convolution networks with L layers provide L connections. According to the report in Huang et al. (2017), DenseNet

can beat the state-of-the-art ResNet structure on ImageNet classification task.

Considering the outstanding performance of DenseNet, we proposed employing DenseNet for cerebral microbleed detection in this paper. The detail of DenseNet was introduced as follows. However, before providing the illustration of the DenseNet, we would first introduce the traditional convolution neural network (CNN) and figure out the difference between CNN and DenseNet later.

### Traditional Convolution Neural Network

The traditional CNN usually includes convolution layer, ReLU Layer, pooling layer, fully connected layer and softmax layer (Zeng et al., 2014, 2016a,c, 2017a,b). The functions of different layers are introduced as follows:

Convolution layer works as the core session of a CNN. The feature maps are generated via the convolution of the input with different kernels. Mathematically, it can be expressed as **Figure 1**, which shows a toy example of convolution operation.

Then, following the convolution layer, we have non-linear activation function, named ReLU, which works to obtain the nonlinear features. The purpose of the ReLU layer is to introduce non-linearity into the network. The mathematic expression of ReLU is shown as Eq. 1:

$$f(\mathbf{x}) = \mathbf{x}^+ = \max(0, \mathbf{x}) \tag{1}$$

The pooling layer works by resizing the feature maps spatially to decrease the number of parameters, memory footprint and to make the computation less intensive in the network. The pooling function works on each feature map, the main approaches used for pooling are max pooling as Eq. 2, average pooling as Eq. 3:

$$a\_j = \max\_{i \in R\_f} (M\_i) \tag{2}$$

$$a\_j = \frac{1}{|R\_j|} \sum\_{i \in R\_j} M\_i \tag{3}$$

In which M stands for the pooling region and Rj represents for the number of elements within the pooling region.

Fully connected layers will calculate the confidential scores, which are stored in a volume of size 1 × 1 × n. Here, n means the number of categories, and each element stands for class scores.

Every neuron of the fully connected layer is connected to all the neurons in the earlier layers.

### Structure Revision of the CNN

In the traditional CNN, all layers are connected gradually as in Eq. 4:

$$\mathbf{x}\_{l} = H\_{l}(\mathbf{x}\_{l-1}) \tag{4}$$

However, as the network becomes deeper and wider, the networks may suffer from either exploding or gradient vanishing. Therefore, researchers proposed different network structures to overcome this problem. For example, ResNet revised this behavior by short connection, and the equation is reformulated as (5).

$$\mathbf{x}\_{l} = H\_{l}(\mathbf{x}\_{l-1}) + \mathbf{x}\_{l-1} \tag{5}$$

Instead of making the sum of the output feature maps of the layer with the incoming feature maps, DenseNet concatenates them sequentially. The expression is reformulated into Eq. 6:

$$\mathbf{x}\_{l} = H\_{l}(\left[\mathbf{x}\_{0}, \boldsymbol{\varkappa}\_{1}, \boldsymbol{\varkappa}\_{2}, \dots, \boldsymbol{\varkappa}\_{l-1}\right])\tag{6}$$

In which l means the index of the layer number, H stands for a non-linear operation and x<sup>l</sup> stands for the output of the lth layer.

### DenseNet

As expressed in Eq. 6, DenseNet introduces straight forward connections from any layers to all following layers. In other words, the lth layer receives feature-maps from all previous l – 1 layers. However, if the feature maps' size changes, the concatenation operation is not feasible. Therefore, downsampling to change the size of the feature maps are introduced. In order to make the down-sampling in the structure of DenseNet possible, multiple densely connected dense blocks are introduced to divide the network. The layers between the blocks are named as transition layers that have batch normalization, convolution and pooling operations, as shown in **Figure 2**. **Figure 2** describes a case of DenseBlock, in which the layer number is 5 and the growth rate is set as k. Each layer receives feature maps from all earlier layers.

For each operation H<sup>l</sup> , it generates k feature maps, which is defined as growth rate. Therefore, the l th layer will have k<sup>0</sup> + k(l − 1) feature maps, and k<sup>0</sup> is the number of channels in the

FIGURE 5 | CMB samples.

input layer. As the network typically has a large number of inputs, a 1 × 1 convolution is employed as the bottleneck layer before the 3 × 3 convolution layer to reduce the feature maps and improve the computation efficiency.

To further compress the model to improve the model compactness, the feature maps are further reduced by the transition layer. For example, if a dense block generates m feature maps and the compression factor is set as θ ∈ (0, 1], then the feature maps will be reduced to bθmc via the followed transition layer. If θ = 1, the number of feature maps will be the same. **Figure 3** shows the structure of DenseNet, which is composed of three DenseBlocks, an input layer and transition layers. The cropped samples are used as the input, the final layer will tell us whether it is CMB or Non-CMB in this research.

### TRANSFER LEARNING

DenseNet has been widely applied in the medical research. For example, Gottapu and Dagli (2018) proposed using DenseNet for Anatomical Brain Segmentation. Khened et al. (2019) proposed cardiac segmentation based on fully convolutional multi-scale residual DenseNets. Wang H. et al. (2019) offered a system for recognition of mild cognitive impairment (MCI) and Alzheimer's disease (AD), based on the ensemble of 3D densely connected convolution network. Considering the limited amounts of labeled training samples, it is far way from enough to retrain the whole network of DenseNet from scratch to get a high classification accuracy. Therefore, in this paper, we proposed transfer learning, which means frozen the earlier layers and retrain the later layers of DenseNet for CMB detection task. The structure of DenseNet used here is DenseNet 201.

In order to make the pretrained DenseNet 201 for CMB detection feasible, which was a binary classification of CMB or non-CMB, the fully connected (FC) layer with

TABLE 1 | Dividing of the dataset for training and testing.


1000 neuron was replaced by a new FC layer with 2 neurons. The structure of the remaining part of DenseNet 201 was unchanged.

### CMB DETECTION BASED ON THE TRANSFER LEARNING AND DENSENET

### Materials

The subjects used in this research are ten healthy controls and ten patients of CADASIL. Twenty 3D volumetric images were obtained from the 20 patients. Then, Software Sygno MR B17 was utilized to rebuild the 3D volumetric image. Each 3D volumetric image's size is uniformly set as 364<sup>∗</sup> 448<sup>∗</sup> 48.

In order to mark the CMBs from the subjects manually, we employed three neuron-radiologists with more than twentyyears' experience. The rules were set as follows: (1) via tracking the neighbor slices, blood vessels were first excluded, (2) lesions should be smaller than 10 mm in diameter. The potential CMBs were labeled as either "possible" or "Definite," Otherwise, regarded as non-CMB voxels. In case of the conflictions, we proposed to obey the rule that the minority should be subordinate to the majority.

The sample images were generated from the original image. We applied the sliding window whose size is set as 61 by 61 to the original image. The border pixels were discarded due to the fat and brain skull. All the pixels located within a sliding window were used as one input, and the point located in the center of the sliding window was used as the target value. It means that if the central pixel is true or 1, then the target value is 1, otherwise, the

**12**

target label is set as 0. It is expressed in the Eqs 7 and 8:

$$I = \mathcal{W}(\mathfrak{p})\tag{7}$$

$$\text{O}u = \begin{cases} 1, & \text{Central pixel } p \text{ is true (CMB)}\\ 0, & \text{Central pixel } p \text{ is false (non-CMB)} \end{cases} \tag{8}$$

Where I stands for the cropped sample images generated via the sliding window, p represents for the central pixel, W(p) means the pixels which centered on pixel p and were located inside the sliding window, and Ou means the label value. **Figures 4**, **5** show the sample of CMB and non-CMB centered images.

The sliding window was supposed to cover the image from left to right and top to bottom with the stride size as 1. Therefore, we got the total CMB voxels as 68, 847 and non-CMB voxels as 56, 582, 536. The training and test set was divided as **Table 1**. We randomly selected 10000 images for each category of the test, and the remaining images were used for training.

To make the images suitable for DenseNet, which should be resized as 224 × 224 × 3, we padded the images with zero. The preprocessed image sample is shown as **Figure 6**. Then, **Figure 7** shows the flowchart of the DenseNet, including number of feature maps generated by each layer.

From **Table 1**, we can find that the Non-CMB training data dominates the majority type CMB, which will cause the classifier more bias toward to the Non-CMB. Therefore, it may cause difficulties in controlling false positives and false negatives, which means the model is hard to find the CMB samples. Therefore, in order to overcome this challenge, we introduced cost matrix (Wu and Kim, 2010). The cost ratio ct was set as 961 via Eq. 9:

$$ct = \mathcal{N}\_{non-CMB} / \mathcal{N}\_{CMB} \tag{9}$$

In which Nnon−CMB means the number of non-CMB training samples and NCMB stands for number of CMB training samples. The reason for why we employ the cost matrix instead of over sampling or down sampling is mainly because we have more concerns about the false positives and false negatives, therefore it is better to highlight the imbalanced learning problem by using cost matrices instead of creating a balanced data distribution forcefully.

### Experiment Design

The goal of this research is to identify the input image as either CMB or Non-CMB. In order to achieve this goal, we proposed using DenseNet 201 as the basic algorithm for transfer learning, based on the excellent performance of DenseNet on ImageNet classification task. Section "Materials" stated the materials used in this research. Based on the original images, we created 68, 847 CMB subsamples and 56, 582, 536 Non-CMB subsamples. 10000 samples were randomly selected as test samples. The remaining sub-samples were used for training. In order to overcome the problem of data imbalance, we proposed cost matrix to show the more concerns in false positive and false negatives. The experiment is carried on by Matlab on the Windows 10 Operation System with 2.88 GHz processor and 8 GB memory. The following experiments were carried out: (1) CMB detection based on DenseNet. The measurements used here include accuracy,

TABLE 2 | Confusion matrix of detected CMB and Non-CMB.


sensitivity and specificity. The definition of the measurements can be found in Zhang et al. (2018b). (2) Different cutting points of transfer learning. (3) In order to show the performance of proposed methods, we compared with other state of art work. Considering the measurements provided in other research, we only used sensitivity, specificity and accuracy.

In order to provide better illustration of DenseNet, we added a flowchart with feature map size, learnable weights of each layer. As we only noted the size of the width, the length should be same with the width.

### CMB Detection Result Based on DenseNet

The rebuilt network was composed of four DenseBlocks, one input layer, three transition layers, one fully connected layer with two neurons, a softmax layer and a classification layer, as described in **Figure 8**.

**Table 2** provides the detection result. The correctly detected CMBs were 9777, and for Non-CMB they were 9764. 236 non-CMBs were incorrectly detected as CMBs, and 223 CMBs were wrongly detected as non-CMBs. The sensitivity was achieved as 97.78%, the specificity was 97.64%, the accuracy was 97.71% and the precision was 97.65%. Above measurements were obtained based on the average of 10 runs as shown in **Table 3**.

### Comparison to the Different Cases of Transfer Learning

In order to achieve the best performance of transfer learning, different cutting points for transfer learning were designed as shown in **Figure 8**. Due to the limited subjects, we mainly focused on retraining the later layers of DenseNet. Therefore, in case A, the DenseNet 201 except for the last three layers, was used as the feature extractor for this research, and we retrained the newly added three layers.

In case B, C, and D, we included extra layers for retraining. For example, case B retrained the DenseBlock 4, Batch normalization, Average pooling, Fully connected (FC) layer, softmax layer and

TABLE 3 | Measurements value CMB detection based on transfer learning of DenseNet (Units: %).


#### TABLE 4 | Comparison of different cases of transfer learning (Unit: %).


classification layer. It was implemented via setting the learning rate to 0 for earlier layers and setting learning rate factor to 10 for layers to be retrained. **Table 4** illustrates the comparison results.

From **Table 4**, we can find that Case A performed slightly better than the other three cases in the terms of sensitivity and

TABLE 5 | Comparison to the state of art methods.


accuracy. Considering that in medical research, we focus more on the sensitivity and accuracy than on the other two terms, we thought Case A provided the best performance among all the cases. **Figure 9** shows the error bar of the measurement values. From the point of storage consuming, all four cases take about the same RAM as we did when not using the precomputation method.

### Comparison to the State of Art Work

In order to validate our proposed method, we compared different state of the art methods, including traditional machine learning methods and DL methods.

From **Table 5**, we compared our method with single-hidden layer feed-forward neural-network (SLFN)+ (leaky) rectified linear unit, 4-layer sparse encoder, 7-layer sparse encoder, different layers of CNN, Naive Bayesian Classifier and so on. We can find that our proposed method offers the best performance. DenseNet works as a logical extension of ResNet but provides more compact models and fully uses the features.

**Figure 10** shows the bar chart of the comparison of the state of the state of art methods. It shows that our proposed method performs slightly better than the current best method, but largely improved compared to the traditional method naïve Bayes classifier (NBC).

### DISCUSSION

In this paper, we proposed to employ DenseNet to detect CMBs in patients with CADASIL. DenseNet was proposed by Huang et al. (2017) and competed the other DL methods for ImageNet classification task because of its model compactness and fully used features. DenseNet are quite similar with ResNet, however, instead of the summation, DenseNet proposed the concatenation of all feature maps from previous layers, which encourages the feature reuse, the VGP alleviation, and the decreased number of parameters.

Therefore, in this paper, we proposed using DenseNet for CMB detection by supposing CMB detection has similarity with ImageNet classification. However, because of the data imbalance, we utilized cost matrix to avoid the model bias toward non-CMB, which means the model would be hard to find CMBs if trained under the imbalanced dataset. As there are some other

### REFERENCES


methods for data imbalance, such as over sampling and down sampling, we have more concerns about the false negatives or false positives. Therefore, instead of enforcing the data into balanced distribution, we employed the cost matrix. In order to check the best cutting point, we test different cases of transfer learning and the results are shown in **Table 4**, however, the difference is not so obvious. On the other hand, training less layers can help us save time and decrease the computation cost if we import the strategy of precomputation.

In the future, we will try to collect more samples and test more different structures for CMB detection. Meanwhile, the training cost long term is very high (Zeng et al., 2018), therefore it is necessary to optimize the algorithm to make the training fast (Zeng et al., 2016b,d). We will consider other precomputation and some optimization methods (Zhang et al., 2018a).

### DATA AVAILABILITY

The datasets for this manuscript are not publicly available because due to the privacy of the subjects. Requests to access the datasets should be directed to shuihuawang@ieee.org.

### ETHICS STATEMENT

This research was approved by Institutional Review Board (IRB) of the First Affiliated Hospital of Nanjing Medical University. We obtained written informed consent from the participants of this study.

### AUTHOR CONTRIBUTIONS

SW proposed the study and wrote the draft. CT and JS designed the model and interpreted the results. CT and YZ analyzed the data. SW and YZ acquired the preprocessed data. All authors gave critical revision and consent for this submission.

### FUNDING

This project was financially supported by Natural Science Foundation of Jiangsu Province (No. BK20180727).

resonance images. Mag. Reson. Imag. 29, 844–852. doi: 10.1016/j.mri.2011. 02.028




**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Wang, Tang, Sun and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Alcoholism Identification Based on an AlexNet Transfer Learning Model

Shui-Hua Wang1,2,3†, Shipeng Xie4†, Xianqing Chen5†, David S. Guttery 3† , Chaosheng Tang<sup>1</sup> \*, Junding Sun<sup>1</sup> \* and Yu-Dong Zhang1,3,6 \* †

*<sup>1</sup> School of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China, <sup>2</sup> School of Architecture Building and Civil Engineering, Loughborough University, Loughborough, United Kingdom, <sup>3</sup> Department of Informatics, University of Leicester, Leicester, United Kingdom, <sup>4</sup> College of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing, China, <sup>5</sup> Department of Electrical Engineering, College of Engineering, Zhejiang Normal University, Jinhua, China, <sup>6</sup> Guangxi Key Laboratory of Manufacturing System and Advanced Manufacturing Technology, Guilin, China*

Aim: This paper proposes a novel alcoholism identification approach that can assist radiologists in patient diagnosis.

Edited by: *Nianyin Zeng, Xiamen University, China*

### Reviewed by:

*Liansheng Wang, Xiamen University, China Chen Yang, Hangzhou Dianzi University, China*

#### \*Correspondence:

*Chaosheng Tang tcs@hpu.edu.cn Junding Sun sunjd@hpu.edu.cn Yu-Dong Zhang yudongzhang@ieee.org*

*†These authors have contributed equally to this work*

#### Specialty section:

*This article was submitted to Neurodegeneration, a section of the journal Frontiers in Psychiatry*

Received: *14 February 2019* Accepted: *21 March 2019* Published: *11 April 2019*

#### Citation:

*Wang S-H, Xie S, Chen X, Guttery DS, Tang C, Sun J and Zhang Y-D (2019) Alcoholism Identification Based on an AlexNet Transfer Learning Model. Front. Psychiatry 10:205. doi: 10.3389/fpsyt.2019.00205* Method: AlexNet was used as the basic transfer learning model. The global learning rate was small, at 10−<sup>4</sup> , and the iteration epoch number was at 10. The learning rate factor of replaced layers was 10 times larger than that of the transferred layers. We tested five different replacement configurations of transfer learning.

Results: The experiment shows that the best performance was achieved by replacing the final fully connected layer. Our method yielded a sensitivity of 97.44%± 1.15%, a specificity of 97.41 ± 1.51%, a precision of 97.34 ± 1.49%, an accuracy of 97.42 ± 0.95%, and an F1 score of 97.37 ± 0.97% on the test set.

Conclusion: This method can assist radiologists in their routine alcoholism screening of brain magnetic resonance images.

Keywords: alcoholism, transfer learning, AlexNet, data augmentation, convolutional neural network, dropout, local response normalization, magnetic resonance imaging

### INTRODUCTION

Alcoholism (1) was previously composed of two types: alcohol abuse and alcohol dependence. According to current terminology, alcoholism differs from "harmful drinking" (2), which is an occasional pattern of drinking that contributes to increasing levels of alcohol-related ill-health. Today, it is defined depending on more than one of the following conditions: alcohol is strongly desired, usage results in social problems, drinking large amounts over a long time period, difficulty in reducing alcohol consumption, and usage resulting in non-fulfillment of everyday responsibilities.

Alcoholism affects all parts of the body, but it particularly affects the brain. The size of gray matter and white matter of alcoholism subjects are less than age-matched controls (3), and this shrinkage can be observed using magnetic resonance imaging (MRI). However, neuroradiological diagnosis using MR images is a laborious process, and it is difficult to detect minor alterations in the brain of alcoholic patient. Therefore, development of a computer vision-based automatic smart alcoholism identification program is highly desirable to assist doctors in making a diagnosis.

Within the last decade, studies have developed several promising alcoholism detection methods. Hou (4) put forward a novel algorithm called predator-prey adaptive-inertia chaotic particle swarm optimization (PAC-PSO), and applied it to identify alcoholism in MR brain images. Lima (5) proposed to use Haar wavelet transform (HWT) to extract features from brain images, and the authors used HWT to detect alcoholic patients. Macdonald (6) developed a logistic regression (LR) system. Qian (7) employed the cat swarm optimization (CSO) and obtained excellent results in the diagnosis of alcoholism. Han (8) used wavelet Renyi entropy (WRE) to generate a new biomarker; whereas Chen (9) used a support vector machine, which was trained using a genetic algorithm (SVM-GA) approach. Jenitta and Ravindran (10) proposed a local mesh vector co-occurrence pattern (LMCoP) feature for assisting diagnosis.

Recently, deep learning has attracted attention in many computer vision fields, e.g., synthesizing visual speech (11), liver cancer detection (12), brain abnormality detection (13), etc. As a result, studies are now focused on using deep learning techniques for alcoholism detection. Compared to manual feature extraction methods (14–18), deep learning can "learn" the features of alcoholism. For example, Lv (19) established a deep convolutional neural network (CNN) containing seven layers. Their experiments found that their model obtained promising results, and the stochastic pooling provided better performance than max pooling and average pooling. Moreover, Sangaiah (20) developed a ten-layer deep artificial neural network (i.e., three fully-connected layers and seven conv layers), which integrated advanced techniques, such as dropout and batch normalization, into their neural network.

Transfer learning (TL) is a new pattern recognition problemsolver (21–23). TL attempts to transfer knowledge learned using one or more source tasks (e.g., ImageNet dataset) and uses it to improve learning in a related target task (24). In perspective of realistic implementation, the advantages of TL compared to plain deep learning are: (i) TL uses a pretrained model as a starting point; (ii) fine-tuning a pretrained model is usually easier and faster than training a randomly-initialized deep neural network.

The contribution of this paper is that we may be the first to apply transfer learning in this field of alcoholism identification. We used AlexNet as the basic transfer learning model and tested different transfer configurations. Further, the experiments showed that the performance (sensitivity, specificity, precision, accuracy, and F1 score) of our model is >97%, which is superior to state-of-the-art approaches. We also validated the effectiveness of using data augmentation which further improves the performance of our model.

### DATA PREPROCESSING

### Datasets

This study was approved by the ethical committee of Henan Polytechnic University. Three hundred seventy-nine slices were obtained in which there are 188 alcoholic brain images and 191 non-alcoholic brain images. We divided the dataset into three parts: a training set containing 80 alcoholic brain images and 80 non-alcoholic brain images; A validation set containing 30 alcoholic brain images and 30 non-alcoholic brain images; a test set containing 78 alcoholic brain images and 81 non-alcoholic brain images. The division is shown in **Table 1**.

TABLE 1 | Dataset division into training, validation, and test sets.


TABLE 2 | Data augmentation.


### Data Augmentation

To improve the performance of deep learning, data augmentation (DA) (25) was introduced. This was done because our deep neural network model has many parameters, so we needed to show that our model contains a proportional amount of sample images to achieve optimal performance. For each original image, we generated a horizontally flipped image. Then, for both original and horizontal-flipped images, we applied the following five DA techniques: (i) noise injection, (ii) scaling, (iii) random translation, (iv) image rotation, and (v) gamma correction. Each of those methods produced 30 new images.

Gaussian noise with zero-mean and variance of 0.01 was applied to every image. Scaling was used with a scaling factor of 0.7–1.3, with an increase of 0.02. Random translation was utilized with a random shift within [−40 40] pixels. Image rotation with rotation angle varies from −30<sup>o</sup> to 30<sup>o</sup> and a step of 2<sup>o</sup> was employed. Gamma correction with gamma value varies from 0.4 to 1.6 with a step of 0.04 was utilized.

The DA result is shown in **Table 2**. Each image generated (1+30<sup>∗</sup> 5)∗ 2 = 302 images including itself. After DA, the training set had 24,160 alcoholism brain images and 24,160 healthy brain images. Altogether, the new training set consisted of a balanced 160<sup>∗</sup> 320 = 48,320 samples.

### METHODOLOGY

### Fundamentals of Transfer Learning

The core knowledge of transfer learning (TL) is shown in **Figure 1**. The core is to use a relatively complex and successful pre-trained model, trained from a large data source, e.g.,

ImageNet, which is the large visual database developed for visual object recognition research (26). It contains more than 14,000,000 hand-annotated images and at least one million images are provided with bounding boxes. ImageNet contains more than 20,000 categories (27). Usually, pretrained models are trained on a subset of ImageNet with 1,000 categories. Then we "transferred" the learnt knowledge to the relatively simplified tasks (e.g., classifying alcoholism and non-alcoholism in this study) with a small amount of private data.

Two attributes are important to help the transfer (28): (i) The success of the pretrained model can promote the exclusion of user intervention with the boring hyper-parameter tuning of new tasks; (ii) The early layers in pretrained models can be determined as feature extractors that help to extract low-level features, such as edges, tints, shades, and textures.

Traditional TL only retrains the new layers (29). In this study, we initially used the pretrained model, and then re-trained the whole structure of the neural network. Importantly, the global learning rate is fixed, and the transferred layers will have a low factor, while newly-added layers will have a high factor.

### AlexNet

AlexNet competed in the ImageNet challenge (30) in 2012, achieved a top-5 error of only 15.3%, more than 10.8% better than the result of the runner-up that used the shallow neural network. Original AlexNet was performed on two graphical processing units (GPUs). Nowadays, researchers tend to use only one GPU to implement AlexNet. **Figure 2** illustrates the structure of AlexNet. This study only counts layers associated with learnable weights. Hence, AlexNet contains five conv layers (CL) and three fullyconnected layers (FCL), totaling eight layers.

The details of learnable weights and biases of AlexNet are shown in **Table 3**. The total weights and biases of AlexNet are 60,954,656 + 10,568 = 60,965,224. In Matlab, the variable is stored in single-float type, taking four bytes for each variable. Hence, in total we needed to allocate 233 MB.

### Common Layers in AlexNet

Compared to traditional neural networks, there are several advanced techniques used in AlexNet. First, CLs contain a set of learnable filters. For example, the user has a 3D input with a size of PW×PH×D, a 3D filter with a size of QW×QH×D. As a consequence, the size of the output activation map is SW×SH. The value of S<sup>W</sup> and S<sup>H</sup> can be obtained by

$$S\_W = 1 + \frac{P\_W - Q\_W + 2\beta}{\mu} \tag{1}$$

$$S\_H = 1 + \frac{P\_H - Q\_H + 2\beta}{\mu} \tag{2}$$

where µ is the stride size and β represents the margin. Commonly, there may be T filters. One filter will generate one 2D feature map, and T filters will yield an activation map with a size of SW×SH×T. An illustration of convolutional procedure is shown in **Figure 3**. The "feature learning" in the filters here, can be regarded as a replacement of the "feature extraction" in traditional machine learning.

Second, the rectified linear unit (ReLU) function was employed to replace the traditional sigmoid function S(x) in terms of the activation function (31). The reason is because the sigmoid function may come across a gradient vanishing problem in deep neural network models.

$$S(\mathbf{x}) = \frac{1}{1 + \exp(-\mathbf{x})} \tag{3}$$

Therefore, the ReLU was proposed and defined as follows:

$$\text{ReLU}(\mathfrak{x}) = \max(0, \mathfrak{x}) \tag{4}$$

The gradient of ReLU is one at all times, when the input is larger than or equal to zero. Scholars have proven that the convergence speed of deep neural networks, with ReLU as the activation function, is 6x quicker than traditional activation functions. Therefore, the new ReLU function greatly accelerates the training procedure.

Third, a pooling operation is implemented with two advantages: (i) It can reduce the size of the feature map, and thus reduce the computation burden. (ii) It ensures that the representation becomes invariant to the small translation of the input. Map pooling (MP) is a common technique that chooses the maximum value among a 2 × 2 region of interest. **Figure 4** shows a toy example of MP, with a stride of 2 and kernel size of 2.

The fourth improvement is the "local response normalization (LRN)." Krizhevsky et al. (26) proposed the LRNs in order to aid generalization. Suppose that a<sup>i</sup> represents a neuron computed by applying kernel i and ReLU non-linearity, then the responsenormalized neuron b<sup>i</sup> will be expressed as:

$$b\_i = \frac{a\_i}{\left(m + \alpha \sum\_{s=\max(0, i-z/2)}^{\min(Z-1, i+z/2)} a\_s^2\right)^\beta} \tag{5}$$

where z is the window channel size, controlling the number of channels used for normalization of each element, and Z is the gross number of kernels in that layer. Hyperparameters are set as: β = 0.75, α = 10−<sup>4</sup> , m = 1, and z = 5.

TABLE 3 | Learnable layers in AlexNet.


Fifth, the fully connected layers (FCLs) have connections to all activations in the previous layer, so they can be modeled as multiplying the input by a weight matrix and then adds a bias vector. The last fully-connected layer includes the equal number of artificial neurons as the number of total classes C. Therefore, each neuron in the last FCL represents the score of that cognate class, as shown in **Figure 5**.

Sixth, the softmax layer (SL), utilizes the multiclass generalization of logistic regression (32), also known as softmax function. SL is commonly connected after the final FCL. From the perspective of the activation function, the sigmoid/ReLU function works on a single input single output mode, while the SL serves as a multiple input multiple output mode, as shown in **Figure 6**. A toy example can be imagined. Suppose we have a four input at the final SL layer with values of (1–4), then after a softmax layer, we have an output of [0.032, 0.087, 0.236, 0.643].

Suppose that T(f) symbolizes the prior class probability of class f, and T(h|f) means the conditional probability of sample h given class f. Then we can conclude that the likelihood of sample h belonging to class f is

$$T(f \mid h) = \frac{T(h \mid f) \times T(f)}{\sum\_{i=1}^{F} T(h \mid i) \times T(i)} \tag{6}$$

Here F stands for the whole number of classes. Let Ω<sup>f</sup> equals

$$
\Omega\_f = \ln \left[ T(h, f) \times T(f) \right] \tag{7}
$$

Afterwards, we get

$$T(f \mid h) = \frac{\exp\left(\Omega\_f(h)\right)}{\sum\_{i=1}^{F} \exp\left(\Omega\_i(h)\right)}\tag{8}$$

Finally, a dropout technique is used, since training a big neural network is too expensive. Dropout freezes neurons at random with a dropout probability (PD) of 0.5. During training phase, those dropped out neurons are not engaged in both a forward and backward pass. During the test phase, all neurons are used but with outputs multiplied by P<sup>D</sup> of 0.5 (33).

It can be regarded as taking a geometric mean of predictive distributions, generated by exponentially-many smallsize dropout neural networks. **Figure 7A** shows a plain neural network with numbers of neurons at each layer as (2, 4, 8, 10),


and **Figure 7B** shows the corresponding dropout neural network with P<sup>D</sup> of 0.5, where only (1, 2, 4, 5) neurons remain active at each layer.

### Transfer AlexNet to Alcoholism Identification

First, we needed to modify the structure. The last FCL was revised, since the original FCLs were developed to classify 1,000 categories. Twenty randomly selected classes were listed as: scale, barber chair, lorikeet, miniature poodle, Maltese dog, tabby, beer bottle, desktop computer, bow tie, trombone, crash helmet, cucumber, mailbox, pomegranate, Appenzeller, muzzle, snow leopard, mountain bike, padlock, diamondback. We observed that none of them are related to the brain image. Hence, we could not directly apply AlexNet as the feature extractor. Therefore, fine-tuning was necessary.

Since the length of output neurons in orthodox AlexNet (1000) is not equal to the number of classes in our task (2), we needed to revise the corresponding softmax layer and classification layer. The revision is shown in **Table 4**. In our transfer learning scheme, we used a new randomly-initialized fully connected layer with two neurons, a softmax layer, and a new classification layer with only two classes (alcoholism and non-alcoholism).

Next, we set the training options. Three subtleties were checked before training. First, the whole training epoch should be small for a transfer learning. In this study, we set the number of training epochs to 10. Second, the global learning rate was set to a small value of 10−<sup>4</sup> to slow learning down, since the early parts of this neural network were pre-trained. Third, the learning rate of new layers were 10 times that of the transferred layer, since the transferred layers with pre-trained weights/biases and new layers were with random-initialized weights/biases.

Third, we varied the numbers of transferred layers and tested different settings. The AlexNet consists of five conv layers (CL1, CL2, CL3, CL4, and CL5) and three fully-connected layers (FCL6, FL7, FL8). As a result, we tested five different transfer learning settings as shown in **Figure 8** in total, in all experiments. For example, here Setting A means that the layers from the first layer to layer A are transferred directly with learning rate as 10−<sup>4</sup> × 1 = 10−<sup>4</sup> . The late layers from layer A to the last layer are randomly initialized with a learning rate of 10−<sup>4</sup> × 10 = 10−<sup>3</sup> .

### Implementation and Measure

We ran the experiment many times. Each time, the trainingvalidation-test division was set at random again. The training procedure stopped when either the algorithm reached maximum



epoch, or the performance of validation decreased over a preset training epoch. We repeatedly tuned the hyperparameters and found those optimal hyper-parameters based on a validation set. After the hyperparameters were fixed, we ran the final model on the test set for 10 runs. The test set confusion matrix across all runs was recorded, and the following five measures were calculated: sensitivity (SEN), specificity (SPC), precision (PRC), accuracy (ACC), and F1 score. Assume TP, TN, FP, and FN stands for true positive, true negative, false positive, and false negative, respectively, all five measures were defined as:

$$\text{SEN} = \frac{TP}{TP + FN} \tag{9}$$

$$\text{SPC} = \frac{TN}{TN + FP} \tag{10}$$

$$\text{PRC} = \frac{\text{TP}}{\text{TP} + \text{FP}} \tag{11}$$

$$ACC = \frac{TP + TN}{TP + TN + FP + FN} \tag{12}$$

F<sup>1</sup> considers both the precision and the sensitivity to computer the score (34). That means, the measure of the "F1 score" is

the harmonic mean of the previous two measures: precision and sensitivity.

$$F\_1 = \left(\frac{SEN^{-1} + PRC^{-1}}{2}\right)^{-1} \tag{13}$$

Using simple mathematical knowledge, we can obtain:

$$\begin{array}{l} F\_1 = 2 / \left( \frac{T P + FN}{T P} + \frac{T P + FP}{T P} \right) \\ = 2 / \left( \frac{2 T P + FP + FN}{T P} \right) \\ = \frac{2 \times T P}{2 \times T P + FP + FN} \end{array} \tag{14}$$

Then, the average and standard deviation (SD) of all five measures of 10 runs of the test set were calculated and used for comparison. For ease of understanding, a pseudocode of our experiment is listed below in **Table 5**. The first block is to split the dataset into non-test and test sets. In the second block, the non-test set was split into training and validation randomly. The performance of the retrained AlexNet model was recorded and used to select the optimal transfer learning setting S<sup>∗</sup> . In the final block, the performance on the test set via the retrained AlexNet using setting S<sup>∗</sup> was recorded and outputted.

### RESULTS

### Data Augmentation Results

**Figure 9** shows the horizontally flipped image. Here, vertical flipping was not carried out because it can be seen as a combination of horizontal flipping with 180-degree rotation. TABLE 5 | Pseudocode of our experiment.

```
for S = [A, B, C, D, E]
```
for i = 1:10 [train(i), valid(i)] = split(NonTest), Model(S, i) = TrainNetwork(AlexNet, train(i), valid(i), Setting = S),

PerfValid(S, i) = Predict(Model(S, i), valid(i)),

end

PerfValid(S) = mean(PerfValid(S, i)),

End

S\* = argmax[Performance(S)],

for i = 1:10

[train(i), valid(i)] = split(NonTest),

Model(S\*, i) = TrainNetwork(AlexNet, train(i), valid(i), Setting = S\*),

PerfTest(S\*, i) = predict(Model(S\*, i), Test),

```
End
```
PerfTest(S\*) = mean(PerfTest(S\*, i)),

Output PerfTest(S\*),

FIGURE 9 | Data augmentation by horizontal flipping. (A) Original image. (B) Flipped image.

**Figure 10** shows the data augmentation results of five different techniques: (a) noise injection; (b) scaling; (c) random translation; (d) image rotation; (e) Gamma correction. Due to the page limit, the data augmentation results on the flipped image are not shown.

### Comparison of Setting of TL

In this experiment, we compared five different TL settings on the validation set. The results of Setting A are shown in **Table 6**, where the last row shows the mean and standard deviation value. The results of Setting E are shown in **Table 7**. Due to page limit, we only show the final results of Setting B, C, and D in **Table 8**.

Here, it can be seen from **Table 8** that Setting E, i.e., replacing the FCL8, achieves the greatest performance among all five settings with respect to all measures. The reason may be (i) we expanded a relatively small dataset to a large training set using data augmentation; and (ii) the dissimilarity of our data and the original 1,000-category dataset. The first fact ensures that retraining avoids overfitting; and the latter fact


TABLE 7 | Ten runs of validation performance of transfer learning using Setting E.


TABLE 8 | Comparison of different setting.


*Bold means the best.*

suggests that it is more practical to put most of the layers initialized with weights from a pretrained model, than freezing those layers. For clarity, we plotted the error bar and show it in **Figure 11**.

### Analysis of Optimized TL Setting

The structure of the optimal transfer learning model (Setting E) is listed in **Table 9**. Compared to the traditional AlexNet model, the weights and biases of FCL8 were reduced from 4,096,000 to 8,192, and from 1,000 to 2, respectively. The main reason is that we only had two categories in our classification task. Thus, the whole weight of the deep neural network reduced slightly from 60,954,656 to 56,866,848.

Nevertheless, we can observe that FCL6 and FCL7 still constitutes too many weights and biases. For example, FCL6 occupied 37,748,736/56,866,848 = 66.38% of the total weights in this optimal model, and FCL7 occupied 16,777,216/56,866,848 = 29.50% of the total weights. Additionally, the FCL subtotal comprised 95.90% of the total weights. This is the main limitation of our method. To solve it, we need to replace the fully connected layers with 1 × 1 conv layers. Another solution is to choose smallsize transfer learning models, such as SqueezeNet, ResNet, GoogleNet, etc.

### Effect of Data Augmentation

This experiment compared the performance of runs with data augmentation against runs without data augmentation (DA). Configuration of transfer learning was set to Setting E. All the other parameters and network structures were the same as the previous experiments. The performance of the 10 runs without using DA are shown in **Table 10**. The results in terms of all measures are equal to or slightly above 95%.

The comparison of using DA against not using DA is shown in **Table 11**. We can discern that DA indeed enhances the classification performance. The reason is that having a large dataset is crucial for good performance. The alcoholism image dataset is commonly of small size, and its size can be augmented to the order of tens of thousands (48,320 in this study). AlexNet can make full use of all its parameters with a big dataset. Without using DA, overfitting is likely to occur in the transferred model.

### Results of Proposed Method

In this experiment, we chose Setting E (replace the final block) as shown in **Figure 8**. Here, the retrained neural network was tested on the test set. The results over all 10 runs on

Name Weights Weights (%) Biases Biases (%) CL1 (Ours) 11\*11\*3\*96 = 34,848 0.06 1\*1\*96 = 96 1.00 CL2 (Ours) 5\*5\*48\*256 = 307,200 0.54 1\*1\*256 = 256 2.68 CL3 (Ours) 3\*3\*256\*384 = 884,736 1.56 1\*1\*384 = 384 4.01 CL4 (Ours) 3\*3\*192\*384 = 663,552 1.17 1\*1\*384 = 384 4.01 CL5 (Ours) 3\*3\*192\*256 = 442,368 0.78 1\*1\*256 = 256 2.68 FCL6 (Ours) 4096\*9216 = 37,748,736 66.38 4096\*1 = 4,096 42.80 FCL7 (Ours) 4096\*4096 = 16,777,216 29.50 4096\*1 = 4,096 42.80 FCL8 (AlexNet) 1000\*4096 = 4,096,000 1000\*1 = 1,000 FCL8 (Ours) 2\*4096 = 8,192 0.01 2\*1 = 2 0.02 CL Subtotal (AlexNet) 2,332,704 1,376 CL Subtotal (Ours) 2,332,704 4.10 1,376 14.38 FCL Subtotal (AlexNet) 58,621,952 9,192 FCL Subtotal (Ours) 54,534,144 95.90 8,194 85.62 Total (AlexNet) 60,954,656 10,568 Total (Ours) 56,866,848 100 9,570 100

TABLE 9 | Learnable layers in optimal transfer learning model.

the test set are listed in **Table 12** with details of sensitivity, specificity, precision, accuracy, and the F1 score of each run. Setting E yielded a sensitivity of 97.44 ± 1.15%, a specificity of 97.41 ± 1.51%, a precision of 97.34 ± 1.49%, an accuracy of 97.42 ± 0.95%, and an F1 score of 97.37% ± 0.97%. Comparing **Table 12** with **Table 7**, we can see that the mean value of test performance is slightly worse than that of the validation performance, but the standard deviation of the test performance is much smaller than that of the validation performance.

### Comparison to Alcoholism Identification Approaches

This proposed transfer learning approach was compared with seven state-of-the-art approaches: PAC-PSO (4), HWT (5), LR (6), CSO (7), WRE (8), SVM-GA (9), and LMCoP (10). The comparison results are shown in **Table 13**. The cognate bar plot is shown in **Figure 12**. We can observe that our AlexNet transfer learning model has more than 3% improvement compared to the next best approach.

The reason is that this proposed model did not need to find features manually; nevertheless, it only used a learned feature from a pretrained model as initialization, and utilized the enhanced training set to fine-tune those learned features. This has two advantages: First, the development is quite fast, which can be reduced to <1 day. Second, the features can be fine-tuned to be more appropriate to this alcoholism classification task than other manually-designated features.

The bioinspired-algorithm may help retraining our AlexNet model. Particle swarm optimization (PSO) (35–37) and other methods will be tested. Cloud computing (38) in particular can be integrated into our method, to help diagnosis of remote patients.

#### TABLE 10 | Ten runs without using data augmentation (Setting E).


#### TABLE 11 | Effect of using data augmentation technique.


TABLE 12 | Ten runs of proposed method on the test set (Setting E).


TABLE 13 | Comparison with state-of-the-art approaches.


### CONCLUSIONS

In this study, we proposed an AlexNet-based transfer learning method and applied it to the alcoholism identification task. This paper may be the first paper using transfer learning in the field of alcoholism identification. The results showed that this proposed approach achieved promising results with a sensitivity of 97.44 ± 1.15%, a specificity of 97.41 ± 1.51%, a precision of 97.34 ± 1.49%, an accuracy of 97.42 ± 0.95%, and an F1 score of 97.37 ± 0.97.

Future studies may include the following points: (i) other deeper transfer learning models, such as ResNet, DenseNet, GoogleNet, SqueezeNet, etc. should be tested; (ii) other data augmentation techniques should be attempted. Currently our dataset is small, so data augmentation may have a distinct effect on improving the performance; (iii) how to set the learning rate factor of each individual layer in the whole neural network, remains a challenge and needs to be solved; (iv) this method is ready to run on a larger dataset and can assist radiologists in their routine screening of brain MR images.

### DATA AVAILABILITY

The datasets for this manuscript are not publicly available because we need approval from our affiliations. Requests to access the datasets should be directed to yudongzhang@ieee.org.

### AUTHOR CONTRIBUTIONS

S-HW and Y-DZ conceived the study. SX and XC designed the model. CT, JS, and DG analyzed the data. S-HW, XC, and Y-DZ acquired the preprocessed data. SX and CT wrote the draft. S-HW, CT, JS, and Y-DZ interpreted the results. DG provided

### REFERENCES


English revision of this paper. All authors provided critical revision and consent for this submission.

### FUNDING

The authors are grateful for the financial support of the Zhejiang Provincial Natural Science Foundation of China (LY17F010003, Y18F010018), the National key research and development plan (2017YFB1103202), the Open Fund of Guangxi Key Laboratory of Manufacturing System and Advanced Manufacturing Technology (17-259-05-011K), the Natural Science Foundation of China (61602250, U1711263, U1811264), and the Henan Key Research and Development Project (182102310629).


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Wang, Xie, Chen, Guttery, Tang, Sun and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A Fully Automatic Framework for Parkinson's Disease Diagnosis by Multi-Modality Images

Jiahang Xu1,2, Fangyang Jiao<sup>3</sup> , Yechong Huang<sup>1</sup> , Xinzhe Luo<sup>1</sup> , Qian Xu<sup>4</sup> , Ling Li<sup>5</sup> , Xueling Liu<sup>6</sup> , Chuantao Zuo<sup>5</sup> , Ping Wu<sup>5</sup> \* and Xiahai Zhuang1,2 \*

<sup>1</sup> School of Data Science, Fudan University, Shanghai, China, <sup>2</sup> Fudan-Xinzailing Joint Research Center for Big Data, Fudan University, Shanghai, China, <sup>3</sup> Department of Nuclear Medicine, Daping Hospital, Army Medical University, Chongqing, China, <sup>4</sup> Department of Nuclear Medicine, North Huashan Hospital, Fudan University, Shanghai, China, <sup>5</sup> PET Center, Huashan Hospital, Fudan University, Shanghai, China, <sup>6</sup> Department of Radiology, Huashan Hospital, Fudan University, Shanghai, China

Background: Parkinson's disease (PD) is a prevalent long-term neurodegenerative disease. Though the criteria of PD diagnosis are relatively well defined, current diagnostic procedures using medical images are labor-intensive and expertise-demanding. Hence, highly integrated automatic diagnostic algorithms are desirable.

### Edited by:

Siyang Zuo, Tianjin University, China

#### Reviewed by:

Delia Cabrera DeBuc, University of Miami, United States Danny J. J. Wang, University of Southern California, United States

#### \*Correspondence:

Xiahai Zhuang zxh@fudan.edu.cn Ping Wu wupingpet@fudan.edu.cn

#### Specialty section:

This article was submitted to Brain Imaging Methods, a section of the journal Frontiers in Neuroscience

Received: 14 February 2019 Accepted: 05 August 2019 Published: 23 August 2019

#### Citation:

Xu J, Jiao F, Huang Y, Luo X, Xu Q, Li L, Liu X, Zuo C, Wu P and Zhuang X (2019) A Fully Automatic Framework for Parkinson's Disease Diagnosis by Multi-Modality Images. Front. Neurosci. 13:874. doi: 10.3389/fnins.2019.00874 Methods: In this work, we propose an end-to-end multi-modality diagnostic framework, including segmentation, registration, feature extraction and machine learning, to analyze the features of striatum for PD diagnosis. Multi-modality images, including T1-weighted MRI and <sup>11</sup>C-CFT PET, are integrated into the proposed framework. The reliability of this method is validated on a dataset with the paired images from 49 PD subjects and 18 Normal (NL) subjects.

Results: We obtained a promising diagnostic accuracy in the PD/NL classification task. Meanwhile, several comparative experiments were conducted to validate the performance of the proposed framework.

Conclusion: We demonstrated that (1) the automatic segmentation provides accurate results for the diagnostic framework, (2) the method combining multi-modality images generates a better prediction accuracy than the method with single-modality PET images, and (3) the volume of the striatum is proved to be irrelevant to PD diagnosis.

#### Keywords: Parkinson's disease, multi-modality, image classification, U-Net, striatum

### INTRODUCTION

Parkinson's disease (PD) is the second-most prevalent long-term neurodegenerative disease characterized by bradykinesia, rigidity and rest tremor (Postuma et al., 2015). At present, PD is responsible for about 346,000 deaths per year and is thus one of the major concerns of neuroscience community (Roth et al., 2018). The diagnosis of PD mainly refers to the Movement Disorder Society Clinical Diagnostic Criteria for Parkinson's disease (MDS-PD Criteria) (Postuma et al., 2015). According to the MDS-PD criteria, the motor symptoms of PD are linked with the loss of dopaminergic neurons, which mainly affects the anatomical regions of the striatum (SARs). Therefore, SARs, which include the caudate nucleus, the putamen and the pallidum, are commonly explored (Strafella et al., 2017).

Functional neuroimaging of the presynaptic dopaminergic system is highlighted according to the MDS-PD criteria (Liu et al., 2018). Positron-emission tomography (PET) is one of the neuroimaging modalities that indicate the regional activity of the tissues. Accordingly, PET tracers are developed to observe the activity of the dopamine transporter (DAT) in early stage of PD, such as <sup>11</sup>C-CFT, which is a biomarker of the presynaptic dopaminergic system with high sensitivity (Kazumata et al., 1998; Ilgin et al., 1999; Wang et al., 2013). However, due to the low resolution of the PET images, the anatomical and structural information related to the brain that PET can provide is limited. Therefore, the structural neuroimaging methods, such as T1 weighted magnetic resonance imaging (T1-MRI), are introduced to assist the multi-modality diagnosis of PD (Ibarretxe-Bilbao et al., 2011). Bu et al. (2018) worked on the subtypes of multiple system atrophy (MSA) utilizing T1-MRI and <sup>11</sup>C-CFT PET. Huang et al. (2019) combined these two modalities with <sup>18</sup>F-FDG PET and analyzed Rapid Eye Movement (REM) Sleep Behavior Disorder research. In both of their studies, T1-MR images were registered to PET images to identify the region of interest (ROIs) in the PET images.

Recently, researchers attempt to improve the accuracy in diagnostic methods with the help of machine learning algorithms, for example, the support vector machine (SVM) has been widely used. Long et al. (2012) used SVM to distinguish early PD patients from NL subjects utilizing resting-state functional MRI, and obtained an accuracy of 86.96%. Haller et al. (2012) used SVM and reached an accuracy of 97% when classifying PD from other atypical forms of Parkinsonism by combining Diffusion Tensor Imaging (DTI) and <sup>123</sup>I ioflupane Single-Photon Emission Computed Tomography (SPECT). These works combining multi-modality imaging have proved the reliability of artificial intelligence (AI)-assisted PD diagnosis, while few works are reported including <sup>11</sup>C-CFT PET, to the best of our knowledge.

In this work, we proposed an end-to-end multi-modality diagnostic framework for PD combining T1-MR and <sup>11</sup>C-CFT PET images. In the framework, MR images were segmented by a U-Net (Ronneberger et al., 2015; Wong et al., 2018). The resulting segmentation was then used to locate the SARs of the PET images by registration. Finally, features extracted from these SARs were used to diagnose PD. Our main contributions include:


### METHODOLOGY

The proposed framework is shown in **Figure 1**. It contains three major steps: (1) segmentation, (2) registration, and (3) feature extraction and prediction. In the first two steps, MRI-assisted PET segmentation is performed by MRI segmentation and MRI-PET registration, and in the subsequent step, only information of PET images is considered for PD diagnosis.

### Striatum Segmentation via Deep Neural Network

To obtain the fine structure of the brain tissues, a 3D deep neural network, i.e., U-Net (Ronneberger et al., 2015; Wong et al., 2018), is implemented to segment the striatum in the MR images. The obtained segmentation is used as a reference for SAR localization and extraction in the subsequent procedures.

**Figure 2** shows the network architecture for the segmentation, which outputs a mask indicating the segmented labels of the input image. The network further incorporates the idea of deep supervision introduced by Mehta and Sivaswamy (2017) for faster training convergence. Specifically, the network comprises encoding and decoding paths. The encoding path captures contextual information by residual blocks and max-pooling operations at different resolutions, while the decoding path sequentially recovers the spatial resolution and object boundaries. Besides, skip connections between the upsampled feature maps in the decoder and the corresponding feature maps in the encoder are employed for the combination of local and contextual information. Moreover, the deep supervision scheme is adopted to allow more direct backpropagation to the hidden layers for faster convergence. A final 1 × 1 × 1 convolution layer with a softmax function produces the segmentation probabilities. Gaussian blurring and dropout operations are adopted to avoid overfitting. A loss function is defined to handle the relatively small anatomical structures of labels for accurate segmentation, i.e.,

$$L = \left. \boldsymbol{w}\_{\rm D} \boldsymbol{L}\_{\rm Dict} + \boldsymbol{w}\_{\rm C} \boldsymbol{L}\_{\rm Cross},\tag{1}$$

where, w<sup>D</sup> and w<sup>C</sup> denote the weights of LDice and LCross, respectively; LDice denotes the Dice-related loss, and LCross denotes the cross-entropy. They are respectively, given by

$$L\_{Dice} = E\_i \left[ \left( -\ln Dice\_i \right)^{\mathcal{V}} \right],\tag{2}$$

with

$$Dice\_i = \frac{2\left(\sum\_{\mathbf{x}} \delta\_{il}\left(\mathbf{x}\right) \cdot p\_i\left(\mathbf{x}\right)\right)}{\sum\_{\mathbf{x}} \left(\delta\_{il}\left(\mathbf{x}\right) + p\_i\left(\mathbf{x}\right)\right)},\tag{3}$$

and

$$L\_{Cross} = E\_{\mathbf{x}} \left[ -\ln p\_l \left( \mathbf{x} \right) \right]. \tag{4}$$

In Eq. (3), δil (x) is the Kronecker delta, which equals to 1 if the segmentation label i(x) equals to the ground-truth label l(x) at the voxel position x, and 0 otherwise; p<sup>i</sup> (x) is the probability of voxel x being labeled asi. In our implementations, we chose w<sup>D</sup> = 0.8, w<sup>C</sup> = 0.2 and γ = 0.3 for the loss function and pretrained the model using an Adam optimizer with a learning rate of 1 × 10−<sup>3</sup> for 10 epochs (Kingma and Ba, 2014). Due to the computational limitations, an ROI of MR images with a size of 96 × 96 × 96 voxels was cropped, which contains the whole structure of SARs.

We employ T1-MR brain images from the Alzheimer's Disease Neuroimaging Initiative (ADNI<sup>1</sup> ) database for pretraining, for the size of the clinical data used in this work is far from being enough for the U-Net training. Note that the information related to AD or other modality data is not used in this study, namely we solely employ the 1859 brain T1-MR images to assist the U-Net training. The ADNI MR images are segmented by the multi-atlas label propagation with the expectation-maximization (MALP-EM<sup>2</sup> ) framework (Ledig et al., 2015). The manual segmentation of the caudate nucleus, the putamen and the pallidum are chosen to be the gold standards in the pretraining stage.

We propose to combine two-modality images for the automatic diagnosis of PD, where T1-MRI provides the morphological information of SARs, and <sup>11</sup>C-CFT offers pathological information related to PD. The extraction of the SAR information from the MR images is achieved by the DNN segmentation method, as described in Section "Striatum Segmentation via Deep Neural Network." With this information, one can extract the shape or substructure features from each of anatomical regions. For the combination of the two-modality images, we propose to use image registration, which propagates the anatomical and structural information of SARs in the MRI to the PET.

Combining Two-Modality Images via Image Registration

<sup>1</sup> adni.loni.usc.edu

<sup>2</sup>https://biomedia.doc.ic.ac.uk/software/malp-em/

The registration in the multi-modality diagnostic framework is achieved via the zxhproj<sup>3</sup> platform (Zhuang et al., 2011). Firstly, the image with prior label information is registered to the target PET image. The resulting transformation is then used to propagate the prior label information to the PET, which results in the automatic localization of the SARs for the target PET image. Since the MRI and PET images are from the same subject at the same acquisition session, we propose to use a rigid registration. By registration, the caudate nucleus, the putamen and the pallidum, as well as the parieto-occipital regions are labeled.

For comparisons, we propose a single-modality diagnostic framework using solely PET images. To achieve the fully automated diagnosis, we propose to achieve the anatomical information in the PET images via the same registration method used for the multi-modality scheme. In this scenario, the image with prior label information is defined using a pre-labeled PET template, and the registration between the template and the target PET is achieve via an affine registration following a prerigid registration.

### Feature Extraction and Prediction

To extract adequate features from the SARs, the caudate nucleus and the putamen are further divided into three substructures using a k-means algorithm (Ng et al., 2006). After clustering, statistics of image intensity are calculated to represent the feature information in each region, including maximum, minimum, median, 1st and 3rd quantile, and mean of PET intensity. Several studies characterize radioactive uptake by the striatal-to-occipital ratio (SOR), as the parieto-occipital region is widely considered of lacking CFT uptake (Ma et al., 2002; Carbon et al., 2004; Huang et al., 2007). In this work, the SOR, which is defined as (striatumoccipital)/occipital, is calculated with each kind of intensity value. Meanwhile, the volumes of the six anatomical SARs are included into the feature set. In all, 90 features are generated (for a list of specific features, see **Supplementary Table S1**).

After feature extraction, a t-test is performed to analyze the significance of each feature. Setting significance level α = 0.01, features with a p-value less than 0.01 are considered as being statistically significant. Only significant features would be regarded as the arguments of the machine learning models.

Consequently, the SVM classifier is trained to classify the subjects (Haller et al., 2012; Long et al., 2012). Furthermore, to estimate the generalization ability and stability of the method, the leave-n-out cross-validation strategy is employed to evaluate the performance of the models. In addition, we implement the random forest algorithm to calculate the importance of the features (Gray et al., 2013).

### EXPERIMENTS

The following parts in this section are organized as follows. Section "Data Acquisition" describes the data used in this work; Section "Evaluation of Automatic Striatum Segmentation" validates the reliability of the automatic segmentation method; Section "Advantages of Multi-Modality Images" investigates the advantages of combining multi-modality images; and Section "Efficacy of Volume Features" explores the efficacy of the volume features of SARs for the diagnostic of PD.

### Data Acquisition

Data used in this study was collected from the Department of Neurology, Huashan Hospital, Fudan University. It contains paired <sup>11</sup>C-CFT PET and T1-MR images of PD patients and healthy participants. MR images were acquired by a 3.0-T MR scanner (DiscoveryTM MR750, GE Healthcare, Milwaukee, WI, United States). Each MR image was visually inspected to rule out motion artifacts (Bu et al., 2018; Huang et al., 2019). PET images were acquired by a Siemens Biograph 64 PET/CT scanner (Siemens, Munich, Germany) in three-dimensional (3D) mode. A CT transmission scan was first performed for attenuation correction. Static emission data were acquired 60 min after the intravenous injection of 370 MBq of <sup>11</sup>C-CFT and lasted for 15 min. All subjects were scanned in a supine position with a dimly lighted and noise-free surrounding (Bu et al., 2018; Huang et al., 2019). The synchronous MRI data were acquired using a T1-weighted 3D inversion recovery spoiled gradient recalled acquisition (IR-SPGR) with the following parameters: TE/TR = 2.8/6.6 ms, inversion time = 400 ms, flip angle = 15◦ , matrix = 256 × 256 × 170, field-of-view = 24 cm, and slice thickness = 1 mm. MR and PET images acquisition for each subject had a time interval of no more than 3 months.

Forty-nine patients with PD and 18 age-matched normal control (NL) subjects were recruited. All subjects were screened and clinically examined by a senior investigator of movement disorders before entering the study and were followed up for at least 1 year. The diagnosis of PD was made referring to the MDS-PD Criteria. The Unified Parkinson's Disease Rating Scale (UPDRS) and Hoehn and Yahr scale (HY) were assessed after the cessation of oral anti-parkinsonian medications (if used) for at least 12 h. The following exclusion criteria were used for the NL subjects' recruitment: (1) being tested positive by the REM Sleep Behavior Disorder Single-Question Screen (Postuma et al., 2012), (2) a history of neurological or psychiatric illness, (3) a prior exposure to neuroleptic agents or drugs, (4) an abnormal neurological examination. The data are summarized in **Table 1**. In this study, gender proportion differences between groups could be ignored, as previous studies have shown no significant difference in DAT bindings between genders (Eshuis et al., 2009). The research was approved by the Ethics Committee of Huashan

TABLE 1 | Summary for the studied dataset.


For gender, the expression means Male/Female, and for age and Unified Parkinson's Disease Rating Scale, the expression means mean ± standard deviation.

<sup>3</sup>http://www.sdspeople.fudan.edu.cn/zhuangxiahai/0/zxhproj/


Hospital. All subjects or legally responsible relatives signed written informed consent in accordance with the Declaration of Helsinki before the study.

After data acquisition, both sides of the caudate nucleus, the putamen and the pallidum of each MR image were manually labeled by an experienced clinician from the Department of Neurology, Huashan Hospital. To ensure the qualities of the segmentation results, boundaries of these anatomical structures were double-checked by another clinician from the same department.

### Evaluation of Automatic Striatum Segmentation

To test the performance of the segmentation network, three-fold cross-validation was performed. The whole dataset was split into three disjoint parts, and the model was fine-tuned for 5 epochs on the union of every two disjoint subsets. **Table 2** illustrates the average Dice Similarity Coefficient (DSC) of each anatomical region, and **Figure 3** provides a visualization of the segmentation results of five example cases. One can find that the left pallidum (colored goldenrod in **Figure 3**) is worst segmented with the maximal standard deviation while the right putamen (colored olive drab in **Figure 3**) is best segmented with the minimal standard deviation.

**Figure 4** shows the average accuracy (ACC) and the number of wrong predictions with leave-n-out cross-validation of the different segmentation methods, i.e., automatically and manually. Both accuracies reached 100% when n = 1, and the accuracies and the numbers of wrong predictions of the two experiments result in no significant difference in a pairwise t-test (p-value = 0.1017). Furthermore, when training classifiers using features of manually segmented images and testing it using features of automatically segmented images, we still obtained 100% accuracy. All results indicate that the automatic segmentation provides accurate results for the proposed diagnostic framework of PD.

### Advantages of Multi-Modality Images

To evaluate the influence of multi-modality images, the singlemodality method using solely PET images was compared. In the multi-modality scheme, the MR images provides accurate anatomical and structural SAR information of the subject. By contrast, in the single-modality method this information is achieved by registering the PET images to a pre-labeled Automated Anatomical Labeling (AAL) PET template. We conducted the rest of the pipeline in the same way for the two methods.

**Figure 5** shows the results of the comparative experiments with leave-n-out cross-validation. The results demonstrate that with the assistance of MR images, the performance of the multimodality group is better than the single-modality PET group in the PD/NL task. When n = 1, the accuracy of the multi-modality

FIGURE 4 | The results for the leave-n-out cross-validation of the classification with automatic segmentation and manual segmentation. Panel (A) presents the average ACC, and panel (B) presents the average number of wrong predictions. The horizontal axes in the two panels represent subjects numbers of the test set, i.e., the n in the leave-n-out cross-validation.

FIGURE 5 | The results for the leave-n-out cross-validation of the classification by the multi-modality diagnostic method and the single-modality method. Panel (A) presents the average ACC, and panel (B) presents the average number of wrong predictions. The horizontal axes in the two panels represent subjects numbers of the test set, i.e., the n in the leave-n-out cross-validation.



The importance values were calculated by the random forest algorithm.

group reached 100% in the PD/NL task, while the accuracy of the single-modality PET group was 98.51%.

To test the uniformity of the classifiers based on the different groups, we also trained the classifier using features of multi-modality images and tested it using features of singlemodality PET, and the accuracy dropped to 88.05%, with 8 subjects misclassified.

TABLE 4 | Feature importance of groups with manual segmentation results and automatic segmentation results.


The importance values were calculated by the random forest algorithm.

### Efficacy of Volume Features

In the feature extraction step, t-tests were performed to evaluate the significances of all feature, and results indicated that the features of the volume are not statistically significant with α = 0.01 (see **Supplementary Table S1** for more details). To further evaluate the effects of the volume of SARs, we compared the importance of different features based on groups with

fnins-13-00874 August 22, 2019 Time: 17:44 # 6

and without volume, as **Table 3** shows. One can see that the importance values of the two groups are similar. Hence, the effect of volume to the model is negligible. Note that the volume is calculated based on the original MR images without downsampling.

### DISCUSSION AND CONCLUSION

In this work, we proposed a fully automatic framework for PD diagnosis. This method utilized two modalities, i.e., <sup>11</sup>C-CFT PET and T1-MR imaging, performed MRI-assisted PET segmentation, selected features and employed SVM to give the predictions. To validate the performance of the framework, we applied the proposed method on the clinical data from Huashan Hospital.

One of the major differences between the proposed method and the traditional methods is that the SARs are located according to the labels of the automatic segmentation by U-Net. To evaluate the performance of the U-Net, we calculated the DSCs between automatic and manual segmentation. In addition, we compared the proposed pipeline, whose SARs were located according to the automatic segmentation, to the method whose SARs were manually segmented. The leave-nout experiment shows the two methods performed comparably,

FIGURE 6 | The comparison of gold standard and wrongly placed SARs of the wrongly predicted subject. Panel (A) shows the segmentation result in gold standard, and panel (B) shows the segmentation in the wrongly predicted subject. Images are in sagittal plane and have the same cursor position.

fnins-13-00874 August 22, 2019 Time: 17:44 # 7

indicating that the automatic segmentation could provide accurate results for the proposed diagnostic framework of PD. Further investigation of the feature importance of the two groups is illustrated in **Table 4**. It indicates that the minimum has lower importance than the first five features. Given that the striatum region has a higher uptake value compared with its adjacent areas, voxels with minimal intensity value are more likely to appear on the edge of the SARs. Therefore, the inaccurate delineation of the anatomical boundary as a potential result of the automatic segmentation could not cause a significant decline in the performance of the overall diagnostic framework.

An alternative way to locate SARs for subsequent feature extraction is to apply a pre-labeled PET template by registration. In Section "Advantages of Multi-Modality Images," AAL PET template was used as the PET template, and was registered to PET images for the localization of SARs. Experiments show that the diagnostic capability of this singlemodality PET group is worse than the proposed multi-modality framework. Though the single-modality PET approach gives a favorable prediction, the multi-modality approach performs better. This is because the localization of the SARs occupies a significant place in the diagnostic framework, and the additional structural information from MR images can better locate SARs. **Figure 6** demonstrates that the single-modality PET approach might be affected by the erroneous delineation of the SARs. The error could be attributed to the ignored inter-subject variations in brain structures when defining SARs from a PET template.

To test the uniformity of the classifiers based on different segmentation approaches, we trained classifiers using features of manually segmented multi-modality images, and tested it using features of other methods. When testing with features of multi-modality automatic segmentation method, we still obtained 100% accuracy, indicating that the features of manual and automatic segmentation are highly consistent. However, testing with the singlemodality method resulted in an accuracy of 88.05%. The lower accuracy might be explained by the lack of adequate extracted features due to the falsely located SARs. Hence, compared with the multi-modality group, single-modality PET group naturally needs more feature engineering and better-designed algorithms.

In the feature extraction, features of volume were rejected according to the t-test. This could be the reason why the volume of SARs does not change significantly with the progression of PD, as concluded from the literature (Ibarretxe-Bilbao et al., 2011). **Figure 7** shows the heatmaps of feature distribution on the SARs, displaying the influence of each subregion for the classification in the PD/NL task. The difference of influence is expressed by the color scale. One can find that the most relevant region influencing the separation of PD/NL is localized in the middle and rear of the putamen, then the pallidum, and the caudate nucleus reveals the least significance on this task.

Several future studies could be completed explored based on the current pipeline. Firstly, the classifiers can be trained with Parkinsonian disorders (PDS) dataset to classify PD and atypical PDS, such as MSA and Progressive Supranuclear Palsy (PSP), which has important clinical values but is with great challenges. Secondly, this framework only contains medical imaging information currently, while other aspects of information, such as age, gender, motor ratings and other biomarkers are not included, which may further improve the diagnostic accuracy. Future research could be undertaken to incorporate additional multimodal data for better disease prediction. Finally, the sample size of subjects in this work is relatively small, and a bigger dataset is expected to validate our experiment results and improve the performance of the framework.

To conclude, we proposed a fully automatic framework combining the two modalities for PD diagnosis. This framework obtained a promising diagnostic accuracy in the PD/NL task. In addition, this work also emphasized the high value of the <sup>11</sup>C-CFT PET in the PD diagnosis.

### DATA AVAILABILITY

The datasets generated for this study are available on request to the corresponding author.

## ETHICS STATEMENT

All data were collected from Department of Neurology, Huashan Hospital, Fudan University, and the study was approved by the Ethics Committee of Huashan Hospital. All subjects or a legally responsible relative were given written informed consent before the study.

## AUTHOR CONTRIBUTIONS

XZ is the principle investigator of this work, designed the experiments, and supervised and revised the manuscript. PW co-investigated the research and revised the manuscript. CZ co-investigated the research. JX led the implementation and experiments and wrote the manuscript. YH coled the work and wrote the manuscript. XiL provided support to the work of coding, experiments, and manuscript writing. FJ, QX, and LL collected the data. XuL collected the data and segmented the striatum of the subjects.

### FUNDING

This work was funded by the Science and Technology Commission of Shanghai Municipality Grant (17JC1401600), and the National Natural Science Foundation of China (NSFC) Grant (61971142). Collection of the PD data was supported by the National Natural Science Foundation of China (81671239 and 81361120393).

### ACKNOWLEDGMENTS

fnins-13-00874 August 22, 2019 Time: 17:44 # 9

We thank Lei Li and Yuncheng Zhou, whose dedicated comments helped us to modify the organization of the manuscript and improve its quality.

### REFERENCES


### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fnins. 2019.00874/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Xu, Jiao, Huang, Luo, Xu, Li, Liu, Zuo, Wu and Zhuang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Brain White Matter Hyperintensity Lesion Characterization in T<sup>2</sup> Fluid-Attenuated Inversion Recovery Magnetic Resonance Images: Shape, Texture, and Potential Growth

Chih-Ying Gwo<sup>1</sup> \* † , David C. Zhu2† and Rong Zhang3†

*<sup>1</sup> Department of Information Management, Chien Hsin University of Science and Technology, Zhongli District, Taiwan, <sup>2</sup> Department of Radiology and Psychology, and Cognitive Imaging Research Center, Michigan State University, East Lansing, MI, United States, <sup>3</sup> Department of Neurology and Neurotherapeutics, Department of Internal Medicine, University of Texas Southwestern Medical Center and Institute for Exercise and Environmental Medicine, Texas Health Presbyterian Hospital Dallas, Dallas, TX, United States*

### Edited by:

*Nianyin Zeng, Xiamen University, China*

### Reviewed by:

*Xia-an Bi, Hunan Normal University, China Yuan Yang, Northwestern University, United States*

> \*Correspondence: *Chih-Ying Gwo ericgwo@uch.edu.tw*

*†These authors have contributed equally to this work as co-authors*

#### Specialty section:

*This article was submitted to Brain Imaging Methods, a section of the journal Frontiers in Neuroscience*

Received: *31 January 2019* Accepted: *27 March 2019* Published: *16 April 2019*

#### Citation:

*Gwo C-Y, Zhu DC and Zhang R (2019) Brain White Matter Hyperintensity Lesion Characterization in T*2 *Fluid-Attenuated Inversion Recovery Magnetic Resonance Images: Shape, Texture, and Potential Growth. Front. Neurosci. 13:353. doi: 10.3389/fnins.2019.00353* Prior methods in characterizing age-related white matter hyperintensity (WMH) lesions on T<sup>2</sup> fluid-attenuated inversion recovery (FLAIR) magnetic resonance images (MRI) have mainly been limited to understanding the sizes of, and occasionally the locations of WMH lesions. Systematic morphological characterization has been missing. In this work, we proposed innovative methods to fill this knowledge gap. We developed an innovative and proof-of-concept method to characterize and quantify the shape (based on Zernike transformation) and texture (based on fuzzy logic) of WMH lesions. We have also developed a multi-dimension feature vector approach to cluster WMH lesions into distinctive groups based on their shape and then texture features. We then developed an approach to calculate the potential growth index (PGI) of WMH lesions based on the image intensity distributions at the edge of the WMH lesions using a region-growing algorithm. High-quality T<sup>2</sup> FLAIR images containing clearly identifiable WMH lesions with various sizes from six cognitively normal older adults were used in our method development Analyses of Variance (ANOVAs) showed significant differences in PGI among WMH group clusters in terms of either the shape (*P* = 1.06 × 10−<sup>2</sup> ) or the texture (*P* < 1 × 10−20) features. In conclusion, we propose a systematic framework on which the shape and texture features of WMH lesions can be quantified and may be used to predict lesion growth in older adults.

Keywords: brain T2 FLAIR hyperintensity, shape, texture, potential growth, morphology

## INTRODUCTION

The presence of white matter hyperintensities (WMH) on T<sup>2</sup> fluid-attenuated inversion recovery (FLAIR) magnetic resonance images (MRI) is common in older adults over 65 years old with a prevalence rate of ∼ 60–80% in the general population (De Leeuw et al., 2001; Wen and Sachdev, 2004). WMH lesions are even more extensive in those with vascular or Alzheimer's disease (AD) type of dementia when compared with cognitively normal older adults, suggesting its role in

**40**

dementia pathogenesis and neurocognitive dysfunction (Bombois et al., 2007; Kloppenborg et al., 2014; Lee et al., 2016). WMH is also frequently observed in patients with multiple sclerosis (MS) (Loizou et al., 2015; Newton et al., 2017). Qualitative and quantitative WMH characterization has been used as a biomarker to assist cerebrovascular and neurodegenerative disease diagnosis and to assess treatment effects (Wardlaw et al., 2013). The pathogenic mechanisms of WMH are not well-understood, and have been attributed to cerebral small vessel disease (CSVD), white matter demyelization, or both, indicating brain white matter lesions (Greenberg, 2006; Wardlaw et al., 2013). Furthermore, periventricular and subcortical deep WMHs may have different pathogenic mechanisms (Schmidt et al., 2011; Poels et al., 2012; Tseng et al., 2013).

The most commonly used methods for WMH quantification in brain aging, vascular, and AD type of dementia are to measure its regional or total volume (i.e., the sum of WMH voxel size) within the whole brain based on image tissue segmentation algorithms (DeCarli et al., 2005; Wardlaw et al., 2013). This method, however, neglects entirely the typological or morphological features of WMH lesions which may have important clinical significance as demonstrated in recent studies in patients with MS (Loizou et al., 2015; Newton et al., 2017).

WMH shape is a basic morphological feature which can be derived from T<sup>2</sup> FLAIR images after tissue segmentation. Shape feature extraction, recognition, and classification can be implemented either in the original or the transformed image space (Khotanzad and Hong, 1990; Mikolajczyk et al., 2003; Carmichael and Hebert, 2004; Tahmasbi et al., 2011). Current shape classification methods include mainly the following: (1) one-dimensional function shape representation (Kauppinen et al., 1995; Yadav et al., 2007; Zhang and Lu), (2) polygonal approximation (ShuiHua and ShuangYuan), (3) spatial interrelation feature (Sebastian et al., 2004; Guru and Nagendraswamy, 2007; Bauckhage and Tsotsos), (4) moments (Mukundan, 2004; Celebi and Aslandogan; Taubin and Cooper), (5) scale-space methods (Zhang and Lu, 2003; Kpalma and Ronsin, 2006), and (6) shape transform domains (Chen and Bui, 1999; Zhang and Lu). These methods for shape classification may be suitable for specific applications in various fields but have major limitations for shape characterization of brain lesions. For example, method (1) is highly sensitive to noise, and inaccurate boundary definition can cause large errors; method (2) can only represent the object appearance but not all shape features; method (3) may be used to describe the general appearance of an object, but is limited by the orientation and size of the object; method (4) contains redundant information in the image feature vectors and thus unique images cannot be reconstructed back; method (5) is limited to the shapes which have shallow concavities/convexities; and method (6) requires the definition of the shape contour starting point derived from other methods. In order to characterize the complex shapes of brain lesions, a technique needs to be invariant to the orientation of a lesion, be resistant to image noise and be able to define a one-to-one relationship between feature vector and shape. In this regard, Zernike transformation can satisfy these criteria (Khotanzad and Hong, 1990). Similar to Fourier analysis, shape features of an object captured on MRI can be represented by the coefficients of shape function of the Zernike polynomial expansion (i.e., Zernike transform), referred to as Zernike moments (ZMs) (Zernike, 1934). In this study, we applied Zernike transformation to extract WMH shape features for pattern recognition and classification in cognitively normal older adults.

Image texture is another morphological feature which can be categorized through modeling (Chen et al., 1989), structure (Chow and Rahman, 2007), transformation (Tsai and Hsiao, 2001), and statistics based methods (Haralick et al., 1973; Iivarinen et al., 1996). Model-based and structure-based methods work best for repeating texture patterns but are not suitable for irregular texture patterns such as those in brain lesion images. The transformation based method works best in identifying subregions with known characteristics, but does not work well on unknown and potentially complicated patterns such as those in brain lesion images. Statistical-based methods describe the texture in the distribution of and relationships between gray-level values in an image. These statistics-based methods can normally describe objects better than the structure and transformation based methods because they are invariant to the orientation, the size of an object, and also robust to the noise inside the object (Castellano et al., 2004). Since WMH lesions often have various sizes, orientations, and locations, and are manifested across multiple image slices, a statistics-based method is likely to be the best choice to accommodate these complexities. Therefore, we adopted a statistical method based on fuzzy logic to construct the image intensity histogram of WMH lesions for texture feature extraction.

Finally, we have thought that as a potential imaging biomarker of brain aging and CSVD, the size, shape, and image texture of WMH lesion may change with time (Sachdev et al., 2007; Godin et al., 2011) which may reflect the progression of the underlying pathological process. In this regard, recent studies have shown that the immediate surrounding areas of clearly defined WMH lesions may be at risk for further tissue damage and conversion to lesions (Maillard et al., 2014; Promjunyakul et al., 2016). These areas are classified as WMH penumbras (Maillard et al., 2014). To characterize WMH lesions as well as their penumbras, we developed a seed-based region-growing algorithm to characterize WMH boundaries to explore the potential growth of WMH lesions. We defined this specific WMH boundary characteristic as potential growth index (PGI). To explore whether the shape and texture characterization techniques can potentially be used to predict lesion growth, we assessed whether different shape and texture patterns are related to PGI.

### METHODS AND RESULTS

### MRI Acquisition

Full-brain 2D T<sup>2</sup> FLAIR images were collected on a Philips Achieva 3T scanner (Philips Healthcare, Best, the Netherlands) with the following parameters: axial, time of echo (TE) = 125 ms, time of repetition (TR) = 11 s, time of inversion (TI) = 2,800 ms, field of view (FOV) = 23 cm × 23 cm, slice thickness = 2.5 mm, number of slices = 64 with no gaps, acquisition matrix size = 352 × 212, and reconstructed matrix size = 512 × 512. All subjects signed informed consent approved by the Institutional Review Boards of the UT Southwestern Medical Center and Texas Health Presbyterian Hospital of Dallas. Six T<sup>2</sup> FLAIR brain image datasets (two male, four female, 75 ± 4 years old and normal cognition), which contained clearly identifiable white matter hyperintensity (WMH) lesions with various sizes, were selected from an healthy aging study we published previously (Tarumi et al., 2014).

## T<sup>2</sup> FLAIR Image Segmentation

T<sup>2</sup> FLAIR WMH regions were segmented on each 2D image through the lesion prediction algorithm (LPA) implemented in the Lesion Segmentation Toolbox (LST) version 2.0.12 for Statistical Parametric Mapping (SPM12). In LPA, the algorithm is trained using a logistic regression model on T<sup>2</sup> FLAIR brain images from 53 MS patients with severe lesion patterns. LPA was also validated in other patient populations such as older adults with diabetes (Maldjian et al., 2013). The fitness of a new T<sup>2</sup> FLAIR brain image to this model provides an estimate of lesion probability for each voxel in the image. In this study, we used a threshold of 0.5, as suggested by LST, on the obtained lesion probability maps to identify WMH regions. The segmentation accuracy was further verified through visual inspection. **Figure 1** shows an example of the segmentation.

### Lesion Size Distribution

WMH binary masks generated from 2D T<sup>2</sup> FLAIR images (**Figure 1**) were used to obtain WMH size distribution. To minimize artifacts, only those masks with more than 10 connected WMH voxels (voxel size: 0.45 mm × 0.45 mm) on an image were considered probable lesions and were used for further characterization, which resulted in a total of 993 WMH lesions. Fitting each of these lesions within a square, these lesions had a size range of from 6 × 6 to 176 × 176 voxels. The lesion size distributions of six subjects are shown in **Figure 2**. Of note, more than 93% of these lesions are ≤ 60 × 60, and only about 1.5% are ≥ 120 × 120 voxels.

### WMH Shape Feature Extraction and Classification in 2D WMH Shape Feature Extraction Using Zernike Transformation

Zernike transformation has been used extensively in imaging shape feature extraction and pattern recognition (Papakostas et al., 2007; Wee and Paramesran, 2007). The coefficients of Zernike polynomial expansion of an object are referred to as Zernike moments (ZMs) which are used to represent the shape features of analyzed objects. In this study, Zernike polynomials were expressed in polar coordinates defined on a unit disc, which are a complete set of orthogonal basis functions (Papakostas et al., 2007; Wee and Paramesran, 2007). The lower-order ZMs describe global contour and gross shape features, and the higher-order ZMs describe regional and fine topological details of a shape (Gwo and Wei, 2016). Of note, the magnitudes of ZMs are not only rotational invariant but also robust to small perturbations on the contour of a shape image (Teh and Chin, 1988).

For a 2D image object (a WMH lesion image segmented from a T2 FLAIR image in this work), using polar coordinates, the complex Zernike moments of order n with repetition m can be represented as the inner product of a shape function f (r, θ) with the basis function of Zernike polynomials, Vnm (r, θ), specifically as

$$Z\_{nm} = \frac{n+1}{\pi} \int\_0^{2\pi} \int\_0^1 f\left(r, \theta\right) V\_{nm}^\*\left(r, \theta\right) r dr d\theta, \qquad |r| \le 1,\tag{1}$$

where V ∗ nm (r, θ) denotes the complex conjugation of Vnm (r, θ). The basis function of Zernike polynomial is given by

$$V\_{nm}\left(r,\theta\right) = R\_{nm}\left(r\right)e^{im\theta}, \quad i = \sqrt{-1} \tag{2}$$

where the radial polynomial, Rnm (r), is defined as follows:

$$R\_{nm}(r) = \sum\_{k}^{\frac{n-|m|}{2}} (-1)^k \frac{(n-k)!}{k! \left(\frac{n+|m|}{2} - k\right)! \left(\frac{n-|m|}{2} - k\right)!} r^{n-2k} \tag{3}$$

where 0 ≤ |m| ≤ n, n − |m| is an even integer, and n ≥ 0.

Since the shape features represented by ZMs at orders higher than six are usually too small (small ZM magnitude) to be detected reliably by human eyes (Charman, 2005), the maximum Zernike transformation order was set to five in this study (**Figure 3**). In Zernike transformation, <sup>V</sup>n,+<sup>m</sup> (r, θ) <sup>=</sup> <sup>V</sup>n,−<sup>m</sup> (r, θ) , and <sup>Z</sup>n,+<sup>m</sup>  = <sup>Z</sup>n,−<sup>m</sup> . The number of distinctive ZM magnitudes for an expansion up to order n is computed as follows:

$$\begin{cases} \left(\frac{n+2}{2}\right)^2 & \text{if ordern is even} \\ \frac{(n+3)(n+1)}{4} & \text{if order n is odd} \end{cases} \tag{4}$$

Shape feature extraction procedures based on Zernike transformation are illustrated in **Figure 4**. In this illustration, we chose three WMH masks, two with a similar shape but different sizes, and one with both different shape and size. To simplify computation complexity, these image masks with different sizes were first scaled to the same size of 60 × 60 voxels so that the ZM magnitudes can be compared on a same scale. Each Znmwas calculated using Equation (1–3). The calculation resulted with 21 ZM complex coefficients with maximum order n = 5. Based on the magnitudes of the ZM coefficient, only 12 coefficients were needed to extract shape features since the WMH shape (mask) generated with tissue segmentation is rotational invariant. As shown in the right column of **Figure 4**, the two lesion images (a) and (b) with a similar shape have similar ZMs magnitudes at all 12 coefficients. On the contrary, the ZM magnitudes of WMH lesions with a different shape are different from the other two.

### WMH Shape Classification

We then classified the lesion images to different clusters (groups) based on the similarity on shape features. Current common clustering algorithms, such as the K-means clustering algorithm,

lesion prediction algorithm (LPA) showing in red, and a WMH binary mask after tissue segmentation, which was used in shape feature extraction.

requires data-specific a priori selection on the number of clusters (Zhao, 2012). For instance, if the number of clusters is too small, the WMH lesion images with noticeable different shapes may be grouped inappropriately into a same cluster. On the other hand, if the number of clusters is too large, lesion images with trivial differences may be assigned into different clusters, confounding potential clinical significance. Finding the appropriate number of clusters using model simulation is one way to resolve this dilemma (Zhao, 2012). However, this procedure has to be carried out for all choices of shape feature dimensions. To simplify the procedures, the estimation of cluster characteristic indices based on sum of within-cluster dispersions [W<sup>k</sup> in Equation (5)] or its variants were proposed (Ball and Hall, 1965; Calinski and Harabasz, 1974; Xu, 1997; Tibshirani et al., 2001). For a better understanding of the influences of the classifiable number of clusters and the feature dimensions derived from the Zernike transform on W<sup>k</sup> , we plotted W<sup>k</sup> as a function of cluster numbers and feature dimensions which are equivalent to the numbers of the distinctive magnitudes of the ZMs (**Figure 5**). The ZMs for WMH shapes from one to 10 orders were calculated to generate 2 to 36 dimensional feature vectors (Equation 4). Euclidean distance was then calculated to assess the similarity between the feature vectors. The K-means clustering algorithm was applied for grouping purpose. W<sup>k</sup> was calculated based on the 2 to 20 cluster groups at each feature dimensions. W<sup>k</sup> , in general, decreases with the increase of the number of clusters but increases with the increase of the number of feature dimensions (**Figure 5**). For a specified feature dimension, a better grouping result is likely achieved at a lower W<sup>k</sup> value by finding a local minimum. However, in some feature dimensions, a local minimum cannot be found even after W<sup>k</sup> decreases to nearly constant. For example, when two is selected as the feature dimension, the W<sup>k</sup> value remains small even at small number of clusters because two-dimensional feature vectors only represents

FIGURE 3 | (A) The 21 basis functions of Zernike polynomials, *Vnm* (*r*, θ), with order *n* ≤ 5, are illustrated. The polynomials have a radial range of [−1, 1] (|*Vnm* (*r*, θ)| ≤ 1), shown by the color bar on the left column; (B) 12 distinctive magnitude images, which are rotational invariance, are shown, corresponding to the polynomials in (A).

The images are normalized to the same size of 60 × 60 voxels shown in the middle column. The magnitudes of 12 Zernike moments coefficients based on ZM orders ≤5 of Zernike polynomial expansion are shown in the right column for comparison.

the gross global contour and thus cannot differentiate shapes with enough details. Therefore, selecting an appropriate feature dimension is also crucial and will be discussed later. Nevertheless, once a feature dimension is selected [which was selected to be 12 in this study based on our exploration of data features (**Figures 4**, **5**)], the optimal number of clusters can be determined based on the estimation of cluster characteristics discussed below (Desgraupes, 2013).

The distance between two points in a share feature vector space can be calculated based on Euclidian distance. The overall distance of all points in a cluster to their mean indicates the compactness of a cluster, or within-cluster dispersion. To determine the optimal number of shape clusters for WMH shape classification, we then employed a gap statistics method proposed by Tibshirani et al. (2001). This method estimates the optimal number of clusters by comparing the logarithm of the sum of within-cluster dispersions of a set of clusters to that from the reference datasets created through sampling uniformly at random from the original dataset. The sum of all within-cluster dispersions decreases gradually with the increase of number of clusters but becomes nearly constant at some points as demonstrated in **Figure 5**. This is so called "elbow" phenomenon, which has been used to find the optimal number of clusters (Tibshirani et al., 2001). The algorithm used to estimate the optimal number of WMH shape clusters based on the gap statistic is presented below:

1. Group the shape vectors by varying the number of shape clusters from k = 1, 2, . . . , N (pre-defined as the maximum to evaluate), and compute the sum of the within-cluster dispersion W<sup>k</sup> for each choice k.

$$W\_k = \sum\_{r=1}^k \sum\_{\mathbf{x}\_i \in \mathcal{C}\_r} (\mathbf{x}\_i - \bar{\mathbf{x}}\_r)^2 \tag{5}$$

where x<sup>i</sup> is a data point, C<sup>r</sup> denotes cluster r, and x¯<sup>r</sup> is the vector mean of C<sup>r</sup> .

Generate reference datasets (total number = B) by sampling uniformly at random from the original dataset within its distribution ranges of all dimensions. Although a better statistically randomness is likely achieved with a large B, the choice of B is bounded by computation demand. For each reference dataset b, we can generate k clusters, and we can calculate the sum of the within-cluster dispersion Wkb for each k based on Equation (5) above, where b = 1, 2, . . . , B; k = 1, 2, . . . , N. The gap statistics for each k is calculated as below:

$$\log p\left(k\right) = \frac{1}{B} \sum\_{b=1}^{B} \log \left(W\_{kb}\right) - \log \left(W\_k\right) \tag{6}$$

2. letl = (1/B) P b log (Wkb), compute the standard deviation

$$sd\_k = \left[\frac{1}{B} \sum\_{b=1}^{B} \left(\log\left(W\_{kb}\right) - l\right)^2\right]^{1/2} \tag{7}$$

Let s<sup>k</sup> = sd<sup>k</sup> √ (1 + 1/B). Choose the optimal number of shape clusters kopt by Equation (8)

$$k\_{opt} = \text{smallest} \quad k \quad \text{such} \quad \text{that } \text{Gap}\left(k\right) \ge \text{Gap}\left(k+1\right) - s\_{k+1} \text{(8)}$$

In the gap statistic procedure above, N is a pre-selected number of shape clusters such that kopt can be determined in the range of [1, N]. B is selected such that the value of sd<sup>k</sup> converges. In this study, N and B were set to 20 and 10, respectively.

For the WMH shape datasets in this study, based on the previous discussion and **Figure 5**, the "elbow" phenomenon was sufficiently noticeable when the feature dimension was set to 12. At this feature dimension, only ZM magnitudes corresponding to ZM orders of n = 0 to 5 were used in clustering [cf. Equation (4)]. The Gap values were calculated and displayed in **Figure 6**; the optimal number of shape clusters was selected to be six according to Equation (8).

**Figure 7** shows the WMH shape classification results using the K-means algorithm based on the cluster number of six and feature dimension of 12. Unique shape difference can be visualized between the six clusters.

### WMH Texture Feature and Classification Texture Feature Extraction

Image texture characterizes the voxel signal intensity distribution patterns in a WMH region. Statistics-based methods quantify the distribution and relationships of voxel signal values in an image region. These methods often provide better discrimination indexes than structure and spectral transformation based methods (Castellano et al., 2004).

WMH lesions often have various sizes, orientations and locations, and manifest across multiple image slices. In this study, the distributions of WMH lesion size measured in the number of voxels are presented in **Figure 8**.

Most lesions are small, with 48.64% of lesions ≤40 voxels (8 mm<sup>2</sup> on a slice). Therefore, a robust texture analysis method needs to satisfy three requirements: (1) Texture feature should be independent of lesion orientation and location; (2) texture feature should be able to quantify small lesions, and (3) texture characterization needs to go beyond a single image slice. A statisticsbased method for WMH texture feature extraction is described next.

Since WMH lesions manifests across multiple slices, we used the "WMH3D" term to emphasize the 3D perspective. Specifically, if a WMH lesion image in a slice connects directly either above, below, or diagonally to another WMH lesion image in an adjacent slice, we treat these lesion images belonging to the same lesion, called it a "WMH3D" for texture analysis. This treatment also reduces the chance of false positive in lesion identification. To characterize the voxel signal intensity distribution, potential voxel spike noise, which is often seen in images, needs to remove first. This can be accomplished by setting the voxel intensities within the boundaries of above or below three standard deviations of the mean values. To reduce the slice variation in signal intensity, a min-max normalization was applied to a WMH3D to normalize its voxel intensity based on the equation,

$$s\left(\mathbf{x},\boldsymbol{y},\boldsymbol{z}\right) = \frac{f\left(\mathbf{x},\boldsymbol{y},\boldsymbol{z}\right) - \operatorname{gMin}}{\operatorname{gMax} - \operatorname{gMin}}\tag{9}$$

where f x, y, z is the intensity of voxel (x, y) at the zth slice and s∈ [0,1], gMax = Max(voxel intensities of WMH3D) and gMin = Min (voxel intensities of WMH3D).

For feature extraction, the normalized data were quantized into one of the pre-selected bins to create a histogram that

FIGURE 5 | The within-cluster dispersion *Wk* as a function of the number of shape clusters and feature dimensions. The feature dimensions are the number of distinctive magnitudes of the ZMs. *Wk* , in general, tends to decrease with the number of clusters but increase with the number of feature dimensions.

represents voxel intensity distribution of a WMH3D. To minimize the interference of image noise to the frequency histogram, we propose a fuzzy logic method (Gwo and Wei, 2013) to allocate voxel intensity values to each of the preselected bins. Specifically, a normalized voxel intensity s is assigned proportionally two values, called fuzzy values, to the two neighboring bins according its relative positions to the bin centers (**Figure 9**). The fuzzy logic method not only is able to characterize the local image signal intensity distribution of a lesion, but also its global distribution, producing different histogram skewness based on the intensity mean value.

The fuzzy logic functions used for assigning voxels to the frequency histogram are presented in Equation (10). The fuzzy value v[j] at bin j is calculated as:

$$\begin{cases} \nu\left[0\right] = 1 & \text{if } s \le \frac{1}{2n} \\ \nu\left[j-1\right] = \frac{2j+1}{2} - s \times n \\ \nu\left[j\right] = s \times n - \frac{2j-1}{2} \\ \nu\left[j\right] = \frac{2j+3}{2} - s \times n \\ \nu\left[j+1\right] = s \times n - \frac{2j+1}{2} \\ \nu\left[n-1\right] = 1 \text{ if } s \ge 1 - \frac{1}{2n} \end{cases} \tag{10}$$

images in each cluster and the six normalized lesion images closest to the cluster mean in each cluster are shown. All lesion images shown in the figure were normalized to the size of 60 × 60 voxels.

FIGURE 8 | The distributions of WMH lesion size measured in number of voxels from the six subjects. A number shown in a lesion size bin (the horizontal axis) represents a lesion size range. For example, lesion size bin of 50 represents the lesion size range of 40 × 40 to 50 × 50 voxels. Frequency (the vertical axis) counts the number of lesion sizes at the lesion size bins.

where n = the total number of bins, and j = 0, ..., n-1. To choose a proper number of bins, there are two considerations: (1) When the number of bins increases, the accumulated fuzzy values in some bins become sparse, especially for small size lesions. Sparsity is problematic for any statistical analysis method (Hughes, 1968). The amount of data needed to obtain a reliable statistical result grows exponentially with the number of bins (Hughes, 1968); (2) conversely, if the number of bins is too small, image features may not be differentiated effectively. In this study, to facilitate WMH texture feature classification discussed below, we selected five bins for texture feature extraction.

Since the sizes of WMH lesions vary in a wide range (**Figure 8**), the image intensity frequency distribution histograms need to be further normalized before they can be compared. Herein, each histogram is normalized to have a total accumulative frequency of 1. For example, for a WMH lesion shown in the first row in **Figure 10**, the original distributions of histogram with five bins and 814 voxels would produce a texture feature vector of (266.6, 240.2, 153.9, 125.3, 28.0). To compare with other WMH lesions with different sizes, this vector was divided by 814 to become the normalized distribution of (0.3275, 0.2951, 0.1891, 0.1539, 0.0343).

### WMH Texture Feature Classification

Texture feature classification of individual WMH lesion images was conducted using a feature vector clustering method similar to those discussed above in the section of "WMH Shape Classification." Of note, the texture feature vector is based on the histogram presented above using the fuzzy logic method. The influences of different texture feature dimensions (i.e., the number of bins used to construct the intensity histogram) and the numbers of clusters on texture feature classification were explored using the same strategy discussed above for WMH shape feature classification. Based on prior works (Shapiro and Stockman, 2001), Manhattan distance is a more preferred choice over Euclidean distance in accessing the similarity between feature vectors described in histograms. Thus, Manhattan distance was used to assess the similarity between the texture feature vectors in our work. The sum of within-cluster dispersion W<sup>k</sup> value was calculated with the cluster number from 2 to 20 and

FIGURE 10 | WMH texture feature extraction procedures: (A) source WMH lesion images, (B) WMH lesion mask images, and (C) WMH texture quantized images using fuzzy logic method.

the feature dimensions from 3 to 15. As illustrated in **Figure 11**, W<sup>k</sup> tends to decrease with the increase of the cluster numbers. A noticeable "elbow" phenomenon was seen for a wide range of texture feature dimensions from 3 to 15.

The gap statistics discussed above was applied to determine the optimal number of texture feature cluster for pattern recognition based on the K-means algorithm for grouping (Hartigan and Wong, 1979). **Figure 12** shows that five is the optimal number of cluster.

**Figure 13** shows the texture classification results, demonstrating five unique clusters.

### WMH Potential Growth Index in 2D

We developed a seed-based region-growing algorithm to characterize WMH boundary conditions in order to explore potential growth of WMH lesions (Maillard et al., 2014; Promjunyakul et al., 2016). We hypothesized that the area of potential growth of WMH lesions has similar signal intensity as WMH lesions and is located around the boundary of WMH lesions. With a pre-defined signal intensity threshold, calculated by the extreme values in the WMH3D [Equation (9)], we can use a seed-based region-growing algorithm to find the "potential growth" voxels around the WMH boundary. The region-growing algorithm is initiated by selecting the WMH mask boundary voxels as the growing seeds. At each growing seed voxel, the eight connected neighbor voxels, defined as A<sup>8</sup> x, y in Equation (11) below, are examined iteratively until no more voxels meet a given criterion in signal intensity.

$$A\_8\left(\mathbf{x},\boldsymbol{\chi}\right) = \begin{cases} \left(\boldsymbol{\chi}-1,\boldsymbol{\chi}-1\right), \left(\boldsymbol{\chi},\boldsymbol{\chi}-1\right), \left(\boldsymbol{\chi}+1,\boldsymbol{\chi}-1\right), \left(\boldsymbol{\chi}-1,\boldsymbol{\chi}\right) \\ \left(\boldsymbol{\chi}+1,\boldsymbol{\chi}\right), \left(\boldsymbol{\chi}-1,\boldsymbol{\chi}+1\right), \left(\boldsymbol{\chi},\boldsymbol{\chi}+1\right), \left(\boldsymbol{\chi}+1,\boldsymbol{\chi}+1\right) \end{cases} \tag{11}$$

In the study, the stopping criterion used for iterative seed growing is determined by comparing a neighboring voxel

FIGURE 11 | The within-cluster dispersion *Wk* as the function of the number of texture cluster and feature dimension. Note that a noticeable "elbow" phenomenon presents for a wide range of texture feature dimensions from 3 to 15.

intensity with the highest and lowest signal intensity, gMax and gMin of a WMH3D. If the voxel intensity difference from the gMax is less than a threshold, as defined below in Equation (12), the corresponding voxel is designated to a growth voxel set, R<sup>g</sup> , and assigned to the boundary seed voxel list S<sup>l</sup> for further searching. The pseudo-code of the seed-based region-growing algorithm is presented in **Figure 14A**. Note that M<sup>k</sup> is the set of voxels in a WMH lesion mask, and f p is the signal intensity at neighbor voxel p of a lesion boundary seed voxel, as defined in Equation (11).

$$Thrreshold = \text{y} \times \left(\text{gMax} - \text{gMin}\right) \tag{12}$$

where γ is the threshold control coefficient. The choice of γ represents the user-defined steepness of the edge around the WMH boundaries. Of note, if the value of γ is too large, the potential growth region would spread around all boundaries of the WMH lesions regardless of lesion shape or texture features. In this study, we chose γ = 1.02 to demonstrate the presence of potential growth regions of WMH lesions using the seed-based growing algorithm (**Figure 14B**).

After all "potential growth" voxels are found, the potential growth index (PGI) of a WMH lesion can be calculated. To calculate this index, first a set of four-connected voxels [Equation (13)] to a voxel (x, y) on the mask boundary is applied to generate

FIGURE 13 | The WMH lesion images from six subjects were classified to five clusters based on their texture features. The six lesion images closest to their cluster means based on the Manhattan distance are shown for each cluster.

successively l layers of apparent masks surrounding the lesion with a layer thickness of one voxel.

$$A\_4\left(\mathbf{x},\boldsymbol{\upchi}\right) = \begin{cases} \left(\mathbf{x},\boldsymbol{\upchi}-1\right), \left(\boldsymbol{\upchi}-1,\boldsymbol{\upchi}\right), \\ \left(\mathbf{x}+1,\boldsymbol{\upchi}\right), \left(\mathbf{x},\boldsymbol{\upchi}+1\right) \end{cases} \tag{13}$$

The pseudo-code of generating l layers around a WMH lesion growing algorithm is presented in **Figure 14C**. The notations S<sup>l</sup> and M<sup>k</sup> are the same as in **Figure 14A**. l is the number of layers to be generated and the i th layer voxels are kept in the E[i] list.

These apparent layer masks are used to identify the relative location of a growth voxel. A growth voxel at an outer layers of these masks weights more in its contribution to the potential growth index. Specifically, the weight w<sup>i</sup> at i th layer, with total l layers, is given by the following equation:

$$w\_i = \frac{i}{\sum\_{j=1}^{l} j} \tag{14}$$

Once the number of growth voxels at each layers were calculated, the potential growth index P<sup>g</sup> for each WMH lesion is calculated below:

$$P\_{\mathcal{S}} = \frac{\sum\_{i=1}^{l} GV\_i w\_i}{V\_l} \tag{15}$$

where, GV<sup>i</sup> = number of "growth voxels" found at the ith layer, and V<sup>l</sup> = the total number of voxels in all l layers for a WMH.

To demonstrate the potential application, all lesion images were evaluated for their potential growth indices with l set to three (**Figure 14D**).

### The Relationship Between Potential Growth Index and WMH Shape and Texture Features

The relationship between PGI and WMH Shape and texture features was investigated in the study. The K-means algorithm is the most commonly used clustering algorithm in unsupervised learning due to its simplicity and efficiency (Hung et al., 2005), and thus is appropriate for this proof-of-concept development. However, the initial cluster seeds in K-means algorithm can generate different clustering results. To demonstrate the applicability of the K-means algorithm, we performed 1,000 trials with randomly selected initial cluster seeds from the feature vectors of shape and then texture of the lesions (a total of 993 lesions) to examine the clustering results. For the shape and texture clusters classified as shown in **Figures 7**, **13** above, oneway Analyses of Variance (ANOVAs) were performed to evaluate if there were significant differences in potential growth index generated from each trial among the shape or the texture clusters.

FIGURE 14 | (A) The pseudo-code of seed-based region-growing algorithm of WMH lesions; (B) a WMH lesion mask and the potential growth voxels marked in red color which are identified using the seed-based region-growing algorithm. (C) The pseudo-code of layer generating algorithm for WMH lesions; (D) three one-voxel-thick layers surrounding the WMH lesion, which are used to locate a growth voxel.

Significant growth index differences for all trials were found among both shape (P = 2.04×10−<sup>10</sup> to P = 1.06×10−<sup>2</sup> ) and texture (P < 1×10−<sup>40</sup> to P < 1×10−20) clusters. **Table 1** shows the most conservative results.

### DISCUSSION AND CONCLUSION

In this study, we have developed innovative and proof-ofconcept methods to quantitatively characterize the shape (based on Zernike transformation) and texture (based on fuzzy logic) of WMH lesions. A multi-dimension feature vector approach based on these new features was used to cluster WMH lesions into distinctive groups to assess whether these features can potentially be used as image biomarkers. We have also developed an approach to calculate the potential growth index (PGI) of WMH lesions using a region-growing algorithm along the WMH boundaries. From preliminary data analyses of six subjects with a total of 993 lesions, we observed significant differences in PGI among the clustered WMH groups in terms of either the shape or the texture features. These findings, even though only a proof-ofconcept, suggest that the shape and texture features of WMH can potentially be used as new imaging biomarkers to predict lesion growth in brain aging, vascular dementia, or AD.

This work demonstrates the feasibility and potential usefulness of our methods. However, there are several limitations, which are beyond the scope of this study to address completely. In WMH lesion segmentation, we adopted the mid-range point of 0.5 of the lesion probability map as the cut-off threshold suggested by the authors of the LST Toolbox. This threshold appears logical for a wide range of populations. However, different thresholds might be suitable for different study populations. Maldjian et al. (2013) suggested to use a 0.25 threshold in their study on older adults with diabetes. The change of segmentation threshold may introduce small change in the quantification of lesion sizes, which is not the focus of



*Results presented are the most conservative case from 1,000 trials with randomly selected initial cluster seeds from the feature vectors of the shape and texture of the white matter lesions.*

our work. The effects of change of segmentation threshold on characterizations of the WMH shape, texture, and potential growth should be further studied.

In shape feature extraction, all images were proportionally scaled to the same size of 60 × 60 voxels. This scaling procedure resulted in blurring the shape contours of smallsize images and losing the contour details of large-size images. This one-size-fit-all scaling treatment can lead to quantification inaccuracy at higher orders of ZM. Nevertheless, we have observed that high ZM orders are not required to represent primary WMH shape features. In this study, we limit the feature characteristics at ZM ≤5. Thus, the scaling factor used in this study should have minimal effects on the shape feature extraction results. When there is a large number of WMH lesions, a more proper procedure in shape analysis can be applied to reduce the influence of this image scaling issue on shape feature extraction. Specifically, WMH lesion images can be first divided into several groups based on the size, and then are scaled appropriately based on their corresponding size groups. Shape feature analyses can then be carried in each size group. It should also be mentioned that image shape feature extraction using the Zernike transform, in theory, is independent of the image sizes to be analyzed (Teague, 1980). The purpose of image scaling in this study was to improve computational efficiency.

In texture analysis, a linear fuzzy logic method was proposed to quantize the distribution of voxel signal intensity in a lesion image. This approach is robust in handling the potential quantization error due to imaging noise (Gwo and Wei, 2013). We have chosen a linear approach in fuzzy logic and a number of bins that appeared to work well on our data. However, we have not devised a method to systematically obtain an optimal bin number or type of linear or non-linear fuzzy logic function, which needs to be investigated in studies with large sample sizes. For both shape and texture analyses, we have selected the feature dimensions that appeared reasonable to the dataset of this study. However, selecting appropriate feature dimensions and cluster numbers is still a challenging problem in the field of pattern recognition (Steinbach et al., 2004). Common approach is data-driven trial and error. For a large dataset, a supervised machine learning via artificial neural network might lead to identification of optimized feature dimensions as well as the number of group clusters (Raschka, 2015).

PGI was developed to explore the possibility of predicting WMH progression by quantification of image characteristics of WMH penumbras (Maillard et al., 2014). To do this, multiple layers surrounding a lesion mask was used to calculate PGI with a linear weighted function based on the layer locations of the "growth voxels." The choice of a linear weighted function is consistent with the probable locations of WMH lesion development found in recent studies (Maillard et al., 2014; Promjunyakul et al., 2016). In our study, we used three layers sounding the WMH lesions to demonstrate the potential growth. A large dataset with repeated measures in longitudinal studies is needed to identify a more appropriate number of layers and devise an optimal weighting function.

We are fully aware that WMH lesion growth is likely affected by multiple factors besides the shape, texture and PGI. In this regard, the potential effects of anatomical locations of WMH on its progression rate have been investigated in prior works (DeCarli et al., 2005; Wardlaw et al., 2013). Identification of other key contributors to WMH growth and the underlying biological mechanisms are warranted for future studies.

The objective of this paper is to formulate concepts and to demonstrate the feasibility of the methods used to analyze the WMH shape, texture, and potential growth. To accomplish this object, we selected high-quality T<sup>2</sup> FLAIR images which contain a large number of lesions with various sizes from six subjects as sample cases to develop our theoretical framework. While only small number of subjects was used in this study, a relatively large number of lesion (a total of 993) was used in our development and analyses. Nevertheless, the algorithms and parameters used for texture feature extraction and potential growth index estimation in this work were empirical based on trial and error, or were optimized based on the relatively small dataset. The algorithms and parameters used in this work need to be optimized based on larger datasets covering various type of lesions in future studies. Currently, we are working on the application of our methods to over 500 subjects with more than 2 years of brain imaging data from the ADNI (Alzheimer's Disease Neuroimaging Initiative) database.

Lastly, due to widely available 2D T<sup>2</sup> FLAIR images in clinical practice and research, we decided to develop our concept in 2D T<sup>2</sup> FLAIR first. On the other hand, we have also begun to expand our work to 3D imaging to capture the lesion shape, texture and potential growth in all spatial directions, which benefits from the recent development in 3D high-resolution T<sup>2</sup> FLAIR acquisition technique (Wiggermann et al., 2016).

In summary, our work demonstrated the concept and the feasibility that shape and texture features of WMH lesions observed on T<sup>2</sup> FLAIR images can be quantitatively characterized which are related to the potential growth index of white matter lesions. Future studies of large datasets and longitudinal studies based on the systematic

### REFERENCES


framework proposed in this study are warranted to further optimize the algorithms and parameters used for white matter lesion shape and texture feature extraction and classification as well as PGI estimation. Furthermore, our approaches for image feature extraction and classification can potentially be generalized to other types of brain lesions and imaging modalities.

### AUTHOR CONTRIBUTIONS

C-YG conducted experiments, wrote code to analyze the data, interpreted the data, and wrote the manuscript. DZ prepared brain images and lesion segmentation. DZ and RZ interpreted the data, participated in the scientific discussions, and provided critical insights. All authors reviewed the manuscript and approved it for publication.


along a continuum of injury in the aging brain. Stroke 45, 1721–1726. doi: 10.1161/STROKEAHA.113.004084


Raschka, S. (2015). Python Machine Learning. Birmingham: Packt Publishing.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Gwo, Zhu and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Nested Dilation Networks for Brain Tumor Segmentation Based on Magnetic Resonance Imaging

Liansheng Wang1,2, Shuxin Wang<sup>2</sup> , Rongzhen Chen<sup>2</sup> , Xiaobo Qu<sup>3</sup> , Yiping Chen1,2 , Shaohui Huang<sup>2</sup> \* and Changhua Liu<sup>4</sup> \*

*<sup>1</sup> Fujian Key Laboratory of Sensing and Computing for Smart City, School of Information Science and Engineering, Xiamen University, Xiamen, China, <sup>2</sup> Department of Computer Science, School of Information Science and Engineering, Xiamen University, Xiamen, China, <sup>3</sup> Department of Electronic Science, Fujian Provincial Key Laboratory of Plasma and Magnetic Resonance, School of Electronic Science and Engineering (National Model Microelectronics College), Xiamen University, Xiamen, China, <sup>4</sup> Department of Medical Imaging, Chenggong Hospital Affiliated to Xiamen University, Xiamen, China*

Aim: Brain tumors are among the most fatal cancers worldwide. Diagnosing and manually segmenting tumors are time-consuming clinical tasks, and success strongly depends on the doctor's experience. Automatic quantitative analysis and accurate segmentation of brain tumors are greatly needed for cancer diagnosis.

#### Edited by:

*Tong Tong, Independant Researcher, Fuzhou, China*

#### Reviewed by:

*Xiaohua Qian, Shanghai Jiao Tong University, China Dong Ni, Shenzhen University, China*

#### \*Correspondence:

*Shaohui Huang hsh@xmu.edu.cn Changhua Liu liuxingc@126.com*

#### Specialty section:

*This article was submitted to Brain Imaging Methods, a section of the journal Frontiers in Neuroscience*

Received: *15 January 2019* Accepted: *11 March 2019* Published: *05 April 2019*

#### Citation:

*Wang L, Wang S, Chen R, Qu X, Chen Y, Huang S and Liu C (2019) Nested Dilation Networks for Brain Tumor Segmentation Based on Magnetic Resonance Imaging. Front. Neurosci. 13:285. doi: 10.3389/fnins.2019.00285* Methods: This paper presents an advanced three-dimensional multimodal segmentation algorithm called nested dilation networks (NDNs). It is inspired by the U-Net architecture, a convolutional neural network (CNN) developed for biomedical image segmentation and is modified to achieve better performance for brain tumor segmentation. Thus, we propose residual blocks nested with dilations (RnD) in the encoding part to enrich the low-level features and use squeeze-and-excitation (SE) blocks in both the encoding and decoding parts to boost significant features. To prove the reliability of the network structure, we compare our results with those of the standard U-Net and its transmutation networks. Different loss functions are considered to cope with class imbalance problems to maximize the brain tumor segmentation results. A cascade training strategy is employed to run NDNs for coarse-to-fine tumor segmentation. This strategy decomposes the multiclass segmentation problem into three binary segmentation problems and trains each task sequentially. Various augmentation techniques are utilized to increase the diversity of the data to avoid overfitting.

Results: This approach achieves Dice similarity scores of 0.6652, 0.5880, and 0.6682 for edema, non-enhancing tumors, and enhancing tumors, respectively, in which the Dice loss is used for single-pass training. After cascade training, the Dice similarity scores rise to 0.7043, 0.5889, and 0.7206, respectively.

Conclusion: Experiments show that the proposed deep learning algorithm outperforms other U-Net transmutation networks for brain tumor segmentation. Moreover, applying cascade training to NDNs facilitates better performance than other methods. The findings of this study provide considerable insight into the automatic and accurate segmentation of brain tumors.

Keywords: brain tumor segmentation, nested dilation networks, residual blocks nested with dilations, squeeze-and-excitation blocks, coarse-to-fine

### 1. INTRODUCTION

Brain tumors are one of the deadliest cancers worldwide. Gliomas are the most common primary craniocerebral tumor and are caused by the carcinogenesis of glial cells in the brain and spinal cord (Bauer et al., 2013). In pathology, gliomas can be classified as low-grade or high-grade according to the malignant degree of the tumor cells (Cho and Park, 2017; Wang et al., 2018b). Low-grade gliomas are mainly represented by low-speed cell division and proliferation, whereas high-level gliomas are characterized by rapid cell division and proliferation accompanied by angiogenesis, hypoxia, and necrosis (Gerlee and Nelander, 2012; Bogdanska et al., ´ 2017). Although significant advances have been made in healthcare so far, the vast majority of gliomas are incurable, except for a small number of low-grade gliomas, which can be completely resected surgically. Gliomas can be further divided into different tumor sub-regions according to the severity of the tumor cells, such as edemas, non-enhancing tumors, and enhancing tumors. Magnetic resonance imaging (MRI) is the most frequently used and most effective noninvasive auxiliary diagnostic tool (Wen et al., 2010; Yang et al., 2018), providing a reference for the formulation of treatment programs (Mazzara et al., 2004). Brain tumors are usually imaged with different MRI modalities, and these images are interpreted by image analysis methods (Bauer et al., 2013). The MRI sequence usually includes four different modalities: T1-weighted, T2-weighted, post-contrast T1-weighted, and fluid-attenuated inversion-recovery (FLAIR). Different MRI modalities are employed for different diagnosis tasks in clinical diagnosis and treatment. However, it is still a daunting task for clinicians to diagnose diseases with MRI, because there is a wide variation in the size, shape, regularity, location, and heterogeneous appearance of brain tumors (Dong et al., 2017). Therefore, automatic quantitative analysis and accurate segmentation of brain tumors are greatly needed clinically to help doctors make accurate diagnoses.

CNNs have become a prominent deep learning method and have been used to make a series of breakthroughs in different tasks, including computer vision (Krizhevsky et al., 2012; Long et al., 2015; Ren et al., 2015). The success of CNNs is credited to their ability to independently learn deep features instead of relying on manual features. With historical opportunities provided by a strong calculation capability and large numbers of annotations, the development of CNNs has been explosive. The original LeNet5 (LeCun et al., 1998) was proposed in 1998 with five layers, establishing the modern structure of CNNs. Krizhevsky et al. (2012) presented a classical CNN structure called "AlexNet, and made a historic breakthrough. The great success of AlexNet stimulated new research on CNNs. ZFNet (Zeiler and Fergus, 2014), VGGNet (Simonyan and Zisserman, 2014), GoogLeNet (Szegedy et al., 2015), and ResNet (He et al., 2016) were successively presented with more layers and better performances. Huang et al. (2017) used a more radical dense connection mechanism to maximize the flow of information. Hu et al. (2017) proposed an SE network that modeled the interdependencies between feature channels, adaptively learning important information. All of these CNN studies made it possible to apply neural networks to medical image processing.

Recent reports have shown that CNNs outperform stateof-the-art medical image analyses (Li et al., 2017; Lin et al., 2018). MRI-based brain tumor segmentation is a task that still requires extensive attention. Extant methods for automatic brain tumor segmentation are diverse. DeepMedic (Kamnitsas et al., 2016b) was designed as a dual-pathway three-dimensional (3D) network with 11 layers, to simultaneously process images at different scales and combine the results with fully connected layers. Kamnitsas et al. (2016a) and Castillo et al. (2017) further improved the architecture of DeepMedic by adding residual connections and parallel pathways. U-Net (Ronneberger et al., 2015) was proposed to train an end-to-end network with few images for the accurate segmentation of biomedical images. Many architectures similar to U-Net have been widely adopted for brain tumor segmentation. Kayalibay et al. (2017) and Isensee et al. (2017) employed deep supervision by combining segmentation layers from different levels in the localization pathway. Iqbal et al. (2018) increased the number of U-Net layers and trained the network with the Dice loss. Le and Pham (2018) used the U-Net architecture to extract features and put them into an ExtraTrees classifier. Zhao et al. (2018) integrated fully convolutional neural network (FCNN) and conditional random field (CRF) and trained three models using two-dimensional (2D) image patches obtained from axial, coronal, and sagittal views. A voting-based fusion strategy was used to obtain segmentation results. To deal with the class imbalance problem, Wang et al. (2017) proposed a triple-cascaded framework for brain tumor segmentation. Three similar networks were used to segment the entire tumor (all lesions, including edema, non-enhancing tumors, and enhancing tumors), and the tumor core (all lesions except edema). They then sequentially enhanced tumor core. Zhou et al. (2018) drew upon lesions with coarse-to-fine medical image segmentation methods and proposed a single multitask CNN that could learn correlations between different categories. Partial model parameters can be shared when different tasks are being trained according to different sets of training data to utilize the underlying correlation among classes.

We propose a CNN-based 3D segmentation algorithm, the NDN, which can handle multimodal images. Instead of simple convolution layers, residual blocks are stacked in the U-Net architecture to simplify optimization. The SE blocks used in NDNs fuse the global information and adaptively learn important information from each channel. A new block i.e., residual blocks nested with dilations (RnD) enlarges the receptive fields and avoids the gridding effect. RnD blocks can enrich information in shallow layers by using dilation convolutions while retaining detailed information during the rapid expansion of receptive fields by using residual connections. The cascade training strategy is adopted to train three tasks individually to deal with the class imbalance problem.

### 2. MATERIALS AND METHODS

This section describes the proposed NDNs algorithm for detailed brain tumor segmentation, including the data preprocessing, network architecture, training strategy, and post-processing methods. We also concisely describe the experimental design.

### 2.1. Data Acquisition and Preprocessing

### 2.1.1. Data Acquisition

Most of the data used in this work are downloaded from the Medical Segmentation Decathlon (MSD) organized by the 21st Annual Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2018. A small number of low-grade glioma data are abstained from MICCAIs Multimodal Brain Tumor Segmentation (BraTS) Challenge of the same year. These are used to test the stability of the proposed algorithm. The images for each patient comprise four scanning sequences: T1-weighted, T2-weighted, post-contrast T1-weighted, and FLAIR. Every scan is aligned to the same anatomical template space and interpolated into 1 × 1 × 1mm<sup>3</sup> with an image size of 240 × 240 × 155 voxels. The purpose of the study is to segment brain tumors (i.e., gliomas) into three different classes: edemas, non-enhancing tumors, and enhancing tumors. All data are labeled and verified by an expert human rater. Efforts were made to mimic the accuracy required for clinical use.

### 2.1.2. Data Preprocessing and Augmentation

Training an effective neural network requires thousands or even tens of thousands of data. However, the quantity of available medical images is usually well short of that. To avoid overfitting, more training data need to be generated from the limited images and annotations. Our method applies the following data augmentation techniques to make reasonable changes to the image shapes: flip the x-, y-, or z-axis with a probability of 50%; rotate the images with a rotation angle of −15◦ to 15◦ ; apply gamma correction with the gamma value varied randomly from 0.4 to 1.6; and apply elastic distortion. **Figure 1** shows the data augmentation.

Images from multiple modalities may have varying intensity ranges. When the intensity values are not standardized, it is detrimental to the training of the neural network. Normalization is critical to allow images from different modalities to be trained with one algorithm. In our study, each modality is normalized individually by subtracting the mean from the value for each patient and dividing it by the standard deviation. The useless black borders in the images along the x- and y-axes are also removed. On the z-axis, we note that the head and tail of the image slices are uninformative. Therefore, 70% of the slices used for the network input are captured from the middle.

### 2.2. Residual Blocks

He et al. (2016) reformulated the layers as residual blocks and yielded unusually brilliant results in the 2015 ImageNet competition. Instead of simply stacking convolution layers to fit a desired underlying mapping, they added identity mapping, which was easier to optimize. The residual blocks depicted in **Figure 2A** are achieved by a shortcut connection and element-wise addition operation, performed on the input and output feature maps of the blocks, channel-by-channel. The operating principle of the residual blocks can be defined as

$$\mathbf{y} = F(\mathbf{x}, W\_i) + \mathbf{x},\tag{1}$$

where x and y are the input and output vectors of the relevant layers; and F(x, Wi) is the mapping function for the residual path. The results of F(x, Wi) should have the same dimensions as x. Otherwise, we can perform linear mapping on the shortcut connection (**Figure 2B**). This simple algorithm does not add additional parameters or computations to the network, but it greatly increases the training speed of the model and improves the training effect.

The standard convolutional layers of a U-Net are replaced by the residual structure shown in **Figure 2A**. The residual path comprises two convolution layers with a kernel size of 3, followed by a batch normalization (BN) operation (Ioffe and Szegedy, 2015) and a rectified linear unit (ReLU). The input and output of the residual path are added element by element. The results of the residual blocks are fed directly into subsequent network layers.

### 2.3. SE Blocks

A lot of research has recently been accomplished to strengthen the learning power of CNNs and to improve their performance. Hu et al. (2017) introduced the SE blocks to enhance the representations of features produced by a convolutional network. SE blocks embed the global spatial information into the channel vector by encoding each channel dependency with a fully connected operation. It allows the network to pay different amounts of attention to each channel according to the importance of the feature maps. **Figure 3A** illustrates the structure of SE blocks. The features are first passed through a squeeze operation achieved by a global average pooling layer to aggregate global information per channel for the whole image. Then, the outputs are fed into an excitation operation to get the final weights for each channel. The excitation operation is achieved by using two fully connected layers: one with ReLU activation and another with a sigmoid. Finally, the weight vectors are reshaped to (1, 1, 1, C), where C is the number of the feature maps and are applied to each feature map by the multiply operation. The SE blocks emphasize useful features and suppress useless features through weights like an attention mechanism.

SE blocks have a simple structure and can be used directly in existing state-of-the-art architectures. We draw on experience with the attention mechanism and nested SE blocks in the architecture to help the network focus on important feature maps. As shown in **Figure 3B**, feature maps with size (X, Y, Z, C) are put into SE blocks. Then, the blocks generate a significant coefficient for each channel, finally gaining outputs with different weights and the same size as the inputs.

### 2.4. RnD Blocks

The traditional up-sampling and down-sampling structures lead to a loss of internal structure, and the information of small objects cannot be reconstructed. To solve this problem, Yu and Koltun (2015) presented a model with dilated convolutions that can increase the receptive fields without reducing the resolution or increasing the parameters. Chen et al. (2014,

2017, 2018) used dilated convolutions in their networks and achieved good performance for dense prediction tasks. However, standard dilated convolution causes a gridding issue that will harm small objects. Wang et al. (2018a) proposed a hybrid dilated convolution (HDC) framework, which can not only expand receptive fields but also mitigate the gridding issue. Implementing the HDC framework requires two conditions to be met. First, the dilation rates of a groups dilated convolutions should not have a common divisor > 1. The maximum distance between two nonzero values is defined as follows:

$$M\_i = \max[M\_{i+1} - 2r\_i, M\_{i+1} - 2(M\_{i+1} - r\_i), r\_i], \tag{2}$$

where r<sup>i</sup> is the dilation rate in layer i, and M<sup>i</sup> is the maximum dilation rate from layer 0 to layer i. The second condition requires satisfying M<sup>i</sup> < K, where K is the kernel size.

The standard U-Net architecture does not get enough semantic information in the shallow layers because of the limited receptive fields. This is harmful to feature fusion in the first few cross-layer connections. To resolve this issue and avoid the gridding effect, we draw on an idea from the HDC framework. RnD blocks (**Figure 4**) are built to enlarge receptive fields in the first two layers of the network. This new type of block can obtain more extensive local information via 3 convolution layers with different dilation rates (e.g., 1, 2, 5). The kernel size is 3 for all dilated convolutions, which are followed by a ReLU activation. The residual connection in RnD blocks helps retain information and fill details during the rapid expansion of receptive fields.

### 2.5. NDNs

The structure of our proposed NDNs is shown in **Figure 5**. The architecture is inspired by U-Net, which is a stable encoder– decoder network designed for limited data training, especially for medical images. Here, we carefully modify the standard U-Net to make it perform better for the brain tumor segmentation task. First, we use 3D convolution layers rather than 2D to adapt images from multiple modalities. The classic encoder–decoder structure that fuses the lower features in the shallow layers and higher features in the deep layers is retained to ensure the stability of the proposed network. The architecture comprises three maxpooling layers to capture context and three up-sampling layers to enable precise localization. To obtain enough receptive fields, the first two encoder modules adopt RnD blocks to enrich the low-level features. This is followed by an SE block and a max-pooling layer. In the decoder part, each module comprises a stack of residual blocks, an SE block, and an up-sampling layer. The BN is employed immediately after each convolution and before activation. As shown in **Figure 5**, the network can obtain rich information to boost essential features and achieve a stable effect.

### 2.6. Cascade Training

The cascade strategy trains different models for each category sequentially, showing ideal results. Coarse-to-fine medical image segmentation is becoming increasingly popular because of the class imbalance problem. Cascaded models decompose complex problems into simple ones and capitalize on the hierarchical structure of tumor sub-regions. A single model is trained repeatedly to segment substructures of brain tumors hierarchically and sequentially. Each sequence is handled as a binary segmentation problem. The first task is to segment the entire tumor including edemas, enhancing tumors, and nonenhancing tumors. These three classes are regarded as a binary segmentation problem. Then, NDNs are trained to crop the target. After the first stage of training, the entire tumor region is segmented in the 3D volumes of a patient. A cuboid subregion, based on the entire tumor, is used as inputs to the network to segment the enhancing and non-enhancing tumors together. Similarly, the third training differentiates enhancing tumors from non-enhancing ones by using the cuboid sub-region produced by the second stage as input.

In the training, the input of the network is generated based on the ground truth, as shown in **Figure 6A**. In the testing, the results of the previous stage are extended by 32 pixels on the x- and y-axes, and 8 slices on the z-axis as the input for the

next stage. The process is described in **Figure 6B**. Finally, we integrate the three binary segmentation tasks to obtain the final segmentation results of multiple classes. Cascade training offers a way to adaptively alleviate the class imbalance problem of brain tumor segmentation.

### 2.7. Post-processing

Post-processing is further used to improve the segmentation results of NDNs. During data processing, we noticed that the brain tumors for all patients in the 3D volumes were of a single connected domain. Thus, isolated small clusters should be removed from the results. More specifically, connected domain analysis should be performed to retain the maximal region and remove other smaller clusters to better fit the ground truth. Moreover, some patients are observed to have benign tumors, which means that the gliomas only comprise edemas and non-enhancing tumors. Some small clusters are erroneously classified as enhancing tumors in our task instead of benign tumors, which harms the segmentation results. To deal with this issue, we impose volumetric constraints by removing enhancing tumor clusters in the segmentation that are smaller than a predefined threshold.

### 2.8. Dice Similarity Score

In our work, the Dice similarity score is calculated for quantitative evaluation. This performance metric measures the similarity between the ground truth and predicted results. The Dice similarity score is defined as follows:

$$DSC = \frac{2TP}{\left(FP + 2TP + FN\right)},\tag{3}$$

where TP, FP, and FN are the numbers of true positives, false positives, and false negatives, respectively.

MSD and BraTS 2018 provide three different tumor regions that can be described as edemas, enhancing tumors, and nonenhancing tumors. The Dice similarity scores are calculated for each tumor region to evaluate the segmentation results, and the scores are compared with those of other methods.

### 2.9. Experimental Design

We conduct three groups of experiments according to different requirements, which we describe in this section.

Experiment 1: We explored the effects of different network structures on brain tumor segmentation. Ronneberger et al. (2015) developed a U-Net architecture based on the fully convolutional network (FCN) (Long et al., 2015), which can work with very few training images and yield more precise segmentation. Some new architectures derived from U-Net have appeared and have been applied to the field of medical image processing. In Experiment 1, the standard Conv + BN + ReLU module in U-Net was replaced by frequently used blocks, such as residual blocks and dense blocks separately for comparison with the proposed NDNs.

Experiment 2: Different loss functions were attempted with NDNs to improve segmentation results. The loss function quantifies the amount by which the predicted value deviates from the actual value. Choosing a suitable loss function benefits both the training process and the results of brain tumor segmentation. In Experiment 2, different loss functions were applied to the brain tumor segmentation task: the categorical cross-entropy loss, Dice loss, and focal loss. The Dice similarity scores are calculated for each task.

Experiment 3: The proposed method was compared with other state-of-the-art methods. We implemented several previously published algorithms and trained the networks with the same datasets. The brain tumors comprise of edemas, enhancing tumors, and non-enhancing tumors with very different volumes, resulting in an imbalanced number of samples in each class. This category imbalance problem impairs the performance of a deep network. In Experiment 3, a cascade strategy was used to train NDNs, which decomposed a multiple classification problem into multiple binary classification problems. The segmentation results of the cascaded NDNs were compared with several state-of-the-art methods according to the Dice similarity score.

### 2.10. Implementation Details

All networks were implemented in Keras (Chollet et al., 2015) 2.1.2 using the Tensorflow (Abadi et al., 2016) 1.4.0 backend. Adaptive moment estimation (Kingma and Ba, 2014) was used as an optimizer with an initial learning rate of 0.0001, a momentum of 0.9, and a weight decay of 0.00001. Training was implemented on an NVIDIA 1080 Ti GPU with a version of CUDA 8.0 for 300 epochs. We did not use a dropout (Hinton et al., 2012) but rather L2 regularization and BN for the whole network structure. We cropped 96 × 96 × 48 patches as inputs close to the ground truth from images and annotations. All networks were trained from scratch with a batch size of 4.

### 3. EXPERIMENTS AND RESULTS

In this section, we explain the advantages of the proposed algorithm with regard to brain tumor segmentation. The Dice similarity score is adopted as the evaluation criterion for each model. Edemas, non-enhancing tumors, and enhancing tumors were trained together with single NDNs in Experiments 1 and 2 for the sake of fairness. In Experiment 3, however, the cascade training strategy was used to train NDNs for each class, which was then compared with the state-of-the-art methods.

### 3.1. Experiment 1

To prove the effectiveness of the NDNs structure, different U-Net-like networks were trained with the same brain tumor dataset for comparison. A traditional 3D U-Net with three down-sampling layers and three symmetric up-sampling layers was trained first. It consisted of two convolution layers used repeatedly with a kernel size of 3, similar to the standard 2D U-Net structure presented by Ronneberger et al. (2015). The

TABLE 1 | Comparison of different U-Net-like architectures: (A) standard 3D U-Net; (B) U-Net with residual blocks; (C) U-Net with dense blocks; (D) NDNs without SE blocks; (E) NDNs without RnD blocks; and (F) NDNs network.


*The bold values mean the best result (the highest dice value) for each class.*

represent results predicted by each U-Net-like network. The organizers provided the ground truth images. (A-C) list three samples for different patients.

filter number was doubled at the end of each down-sampling layer and halved after each up-sampling layer. Then, the repeated convolution layers were replaced by residual blocks and dense blocks to be trained. ResNet18, ResNet50, and ResNet101 were each employed as an encoder path, and the decoder path was consistent with the expanding path in 3D U-Net. For the dense U-Net, dense blocks were used as substitutes for the two repeated convolution layers, and each dense block had four dense connected convolution layers. Finally, we studied the effect of the NDNs architecture with SE blocks or RnD blocks only. **Table 1** lists the Dice similarity scores calculated for brain tumor segmentation with these networks, and **Figure 7** presents the boxplots for each class. Note that all networks were trained with the Dice loss in Experiment 1.

We achieved better results for non-enhancing tumor segmentation and enhancing tumor segmentation with NDNs than with the other U-Net-like architectures. According to TABLE 2 | Comparison with different losses: (A) categorical cross-entropy; (B) weighted categorical cross-entropy loss; (C) focal loss; and (D) Dice loss.


*The bold values mean the best result (the highest dice value) for each class.*

**Table 1**, the non-enhancing tumor results segmented by NDNs are about 2.6% better than U-Net with ResNet101, and the enhancing tumor segmentation results are at least 2.0% better than the other methods. However, the proposed algorithm lacked the ability to segment the edema part with a result of 0.6652, which is worse than the other U-Net-like algorithms. **Figure 8** presents the ground truth and prediction results for different U-Net-like architectures from different perspectives.

### 3.2. Experiment 2

Class imbalance is a severe issue in medical image segmentation and needs to be carefully tackled. The data provided by MSD and BraTS 2018 are heavily imbalanced, especially the classes of the non-enhancing tumors and enhancing tumors. To alleviate the class imbalance, we use a Dice loss function. We also explore the effects of other loss functions on NDNs for comparison. The categorical cross-entropy is used as a base loss function:

$$Crossentropy(p, q) = -\frac{1}{N} \sum\_{\mathbf{x}, \mathbf{y}, \mathbf{z}} \sum\_{k} p\_{\mathbf{x}, \mathbf{y}, \mathbf{z}}^{k} \log q\_{\mathbf{x}, \mathbf{y}, \mathbf{z}}^{k},\tag{4}$$

where p k x,y,z and q k x,y,z correspond to the ground truth and predicted results for class k, and N is the total number of samples. Based on previous experience, the class imbalance can be addressed by associating different weights with individual classes. Therefore, the weighted categorical cross-entropy is also used:

$$W\_{\text{\\_Crossentropy}}(p,q) = -\frac{1}{N} \sum\_{\mathbf{x},\mathbf{y},\mathbf{z}} \sum\_{\mathbf{k}} w^k p\_{\mathbf{x},\mathbf{y},\mathbf{z}}^k \log q\_{\mathbf{x},\mathbf{y},\mathbf{z}}^k,\tag{5}$$

where w k is the weight for class k. Here the weights for the background, edema, non-enhancing tumors, and enhancing tumors are defined as (1, 1, 2, 1) respectively. The focal loss function described by Lin et al. (2017) for dense object detection is a modified version of binary cross-entropy and is aimed toward low-confidence labels. We adopt a multiclass focal loss for the segmentation task:

$$Focal(p,q) = -\frac{\sum\_{\mathbf{x},\mathbf{y},\mathbf{z}} \sum\_{k} p^{k}\_{\mathbf{x},\mathbf{y},\mathbf{z}} (1 - q^{k}\_{\mathbf{x},\mathbf{y},\mathbf{z}})^{\mathcal{V}} \log q^{k}\_{\mathbf{x},\mathbf{y},\mathbf{z}}}{\sum\_{\mathbf{x},\mathbf{y},\mathbf{z}} \sum\_{k} p^{k}\_{\mathbf{x},\mathbf{y},\mathbf{z}}},\tag{6}$$

where (1 − q k x,y,z ) γ is a modulating factor and the value of γ is set to 2.0 in our algorithm. Finally, our proposed model is trained with the following Dice loss to segment different parts of the brain tumors:

$$Dice(p,q) = 1 - \frac{1}{N} \frac{2\sum\_{\mathbf{x},\mathbf{y},z} \sum\_{k} p\_{\mathbf{x},\mathbf{y},z}^{k} \* q\_{\mathbf{x},\mathbf{y},z}^{k}}{\sum\_{\mathbf{x},\mathbf{y},z} \sum\_{k} p\_{\mathbf{x},\mathbf{y},z}^{k} + \sum\_{\mathbf{x},\mathbf{y},z} \sum\_{k} q\_{\mathbf{x},\mathbf{y},z}^{k}} . \tag{7}$$

The Dice similarity scores for the different loss functions used in NDNs are presented in **Table 2** and **Figure 9**. We obtain final scores of 0.6652, 0.5880, and 0.6682 for edemas, nonenhancing tumors, and enhancing tumors, respectively, using the Dice loss. Normal loss functions like the categorical crossentropy may achieve good results for balanced datasets, but datasets with a massive imbalance among classes require special attention. We avoid weighted categorical cross-entropy as much as possible, because it needs additional hyperparameters that may introduce another difficult problem for network optimization. The results show that the focal loss may be good for binary TABLE 3 | Comparison of methods for the same dataset: (A) Isensee et al. (2017); (B) Iqbal et al. (2018); (C) Wang et al. (2017); (D) Zhou et al. (2018); and (E) our proposed method training with the cascade strategy.


*The bold values mean the best result (the highest dice value) for each class.*

classification problems to solve intra-class imbalance. However, it is less helpful for inter-class imbalance. The Dice loss is calculated based on the Dice coefficient and can deal with situations with large amounts of class imbalance. **Figure 10** shows the ground truth and prediction results for the different loss functions used in NDNs.

### 3.3. Experiment 3

We reproduced several state-of-the-art methods for brain tumor segmentation for comparison with our algorithm. Isensee et al. (2017) achieved a high Dice score in the BraTS 2017 Challenge by using a U-Net-like architecture. They employed deep supervision in the localization pathway to integrate segmentation layers at different levels of the network and combined them via elementwise summation to form the final network output. Iqbal et al. (2018) adopted SE blocks at the end of the decoder part and fused its output with the output of encoder blocks. These two methods were chosen for comparison, because they have similarities with our network structure. Wang et al. (2017) proposed a triple-cascaded framework to segment the entire tumor, tumor core, and enhancing tumor core sequentially. They used dilated convolutions after the down-sampling layers and set the dilation parameter from 1 to 3. Zhou et al. (2018) presented a one-single multitask CNN that can learn the correlations between different categories. These two methods used a cascade or cascade-like training strategy like our training process, and they both obtained high Dice scores in the brain tumor segmentation task. In this experiment, a multiclass segmentation problem was decomposed into three binary segmentation problems by repeated training of NDNs with the coarse-to-fine method just like (Wang et al., 2017). **Table 3** and **Figure 11** present the quantitative evaluation according to the Dice similarity scores for the same datasets.

**Table 3** indicates that the Dice similarity scores of our proposed method are 0.7043, 0.5889, and 0.7206 for edemas, non-enhancing tumors, and enhancing tumors, respectively, which are higher than those of all comparison methods for every class. Moreover, the results are 3.9 and 5.2% higher for edemas and enhancing tumors than when the three classes are trained together, and the results for the non-enhancing tumors do not worsen. These results prove that the cascade training strategy can improve the accuracy for brain tumor segmentation. **Figure 12** shows the ground truth and prediction results for different state-of-the-art methods.

### 4. DISCUSSION

### 4.1. Competitive Segmentation Results

U-Net increases the number of up-sampling and skip connections compared with FCN, which can supplement more location information for semantic information. The U-Net architecture has received increasing attention recently and has been shown that it is a stable algorithm for many segmentation tasks. Despite its great success, however, U-Net still has limitations for some specialized tasks.

We found that stacking residual blocks instead of simple convolution layers can improve the brain tumor segmentation performance. This is because residual blocks can fuse receptive fields of different sizes and ease the training of the networks. Attention mechanisms have shown their utility for many computer vision tasks. SE blocks work as an attention mechanism that can explore the relationship between channels to suppress useless information and enhance useful information by fusing global information. They can help a network notice essential features and make correct decisions. Nesting the SE blocks into our base structure causes the corresponding Dice similarity

FIGURE 10 | Brain tumor segmentation results predicted by NDNs with different loss functions. The rows represent three samples from different patients, and the columns represent results predicted by NDNs with different losses. Organizers provided ground truth images. (A-C) list three samples for different patients.

FIGURE 11 | Boxplots for each method in Table 3. Dice similarity scores for (A) edemas, (B) non-enhancing tumors, and (C) enhancing tumors. The symbol "×" marks the mean.

scores of the edemas, non-enhancing tumors, and enhancing tumors to reach 0.6725, 0.536, and 0.638, respectively. To solve the problem of insufficient receptive fields and to simultaneously avoid the gridding issue, we add RnD blocks to the network. By learning from the HDC framework, RnD blocks can enlarge receptive fields by using dilated convolutions with different dilation rates. Based on this, our method obtains results of 0.6652, 0.5880, and 0.6682, respectively.

An extreme imbalance between categories affects the segmentation results, especially for edemas, and needs to be addressed. Non-enhancing tumors usually have smaller regions than the other two classes, as shown in **Figure 14**, which will have a negative effect on the segmentation results. In order to alleviate the class imbalance, two measures are taken. First, different losses are employed by NDNs to determine the best performance, and a Dice loss function is eventually selected. Moreover, we borrowed the cascade training strategy adopted by many state-of-the-art methods for brain tumor segmentation. Cascade training can balance the quantitative differences among different classes to some extent. The final results obtained by our proposed method were 0.7043 for edema, 0.5889 for non-enhancing tumors, and 0.7206 for enhancing tumors. The experimental results are shown in **Figure 13**. These reasonable results are attributed to both the network structure and training strategies.

### 4.2. Limitations

This study is limited by the class imbalance problem, even though some measures have been taken to alleviate it. Some small

regions in brain tumors like non-enhancing tumors could not be predicted very well. For example, in the two samples in **Figure 14**, only 8.5% of the entire tumor is non-enhancing in sample A and 2.25% in sample B. This huge category imbalance lead to inaccurate segmentation results of 0.279 and 0.402 for nonenhancing tumors in samples A and B, respectively. The class imbalance problem remains a challenge that should be addressed in the future.

## 5. CONCLUSION

Clinical applications of computer-aided systems have gained a great deal of research attention. Supremely accurate brain tumor segmentation is a tedious but vital task for clinicians because of various sizes and shapes of tumors. Quantitative analysis of brain tumors is critical to relieve pressure on doctors and obtain more accurate segmentation results. We developed a new deep learning framework based on U-Net, NDNs, for segmenting brain tumors. Our results showed that NDNs can extract discriminative features of edemas, non-enhancing tumors, and enhancing tumors by obtaining large receptive fields and integrating channel information. Compared with other state-of-the-art methods, NDNs obtained higher Dice similarity scores. The proposed method makes it possible to generate accurate segmentation result for brain tumors without manual interference and provides considerable insight on the application of computer-aided systems to clinical tasks.

## DATA AVAILABILITY

The datasets for this study can be found at Medical Segmentation Decathlon, and the low-grade data can be downloaded from MICCAI BraTS 2018.

## AUTHOR'S NOTE

All data are made available online with a permissive copyrightlicense (CC-BY-SA 4.0), allowing for data to be shared, distributed and improved upon.

## AUTHOR CONTRIBUTIONS

LW, SW, and RC conceptualized the algorithm design. LW, SW, and SH designed the study. LW, SW, RC, SH, and CL collected the data. LW, SW, SH, and CL analyzed the data. LW, SW, and CL interpreted the data. LW, SW, and SH sourced the literature. LW, SW, RC, and SH wrote the draft. LW, SH, XQ, YC, and CL edited the manuscript. LW and CL acquired the funding and supervised the whole study.

## FUNDING

This work was partially supported by the National Natural Science Foundation of China (Grant No. 61671399, 61601392, 61571380), National Key R&D Program of China (Grant No. 2017YFC0108703), and Fundamental Research Funds for the Central Universities (Grant No. 20720180056).

### REFERENCES


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Wang, Wang, Chen, Qu, Chen, Huang and Liu. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Diagnosis of Alzheimer's Disease via Multi-Modality 3D Convolutional Neural Network

#### Edited by:

Bradley J. MacIntosh, Sunnybrook Research Institute (SRI), Canada

#### Reviewed by:

Jingyun Chen, New York University School of Medicine, United States Veena A. Nair, University of Wisconsin–Madison, United States

#### \*Correspondence:

Xiahai Zhuang zxh@fudan.edu.cn

†Data used in preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/ uploads/how\_to\_apply/ADNI\_ Acknowledgement\_List.pdf

#### Specialty section:

This article was submitted to Brain Imaging Methods, a section of the journal Frontiers in Neuroscience

Received: 15 January 2019 Accepted: 02 May 2019 Published: 31 May 2019

#### Citation:

Huang Y, Xu J, Zhou Y, Tong T, Zhuang X and the Alzheimer's Disease Neuroimaging Initiative (ADNI) (2019) Diagnosis of Alzheimer's Disease via Multi-Modality 3D Convolutional Neural Network. Front. Neurosci. 13:509. doi: 10.3389/fnins.2019.00509 Yechong Huang<sup>1</sup> , Jiahang Xu<sup>1</sup> , Yuncheng Zhou<sup>1</sup> , Tong Tong<sup>2</sup> , Xiahai Zhuang<sup>1</sup> \* and the Alzheimer's Disease Neuroimaging Initiative (ADNI)†

<sup>1</sup> School of Data Science, Fudan University, Shanghai, China, <sup>2</sup> Fujian Provincial Key Laboratory of Medical Instrument and Pharmaceutical Technology, Fuzhou, China

Alzheimer's disease (AD) is one of the most common neurodegenerative diseases. In the last decade, studies on AD diagnosis has attached great significance to artificial intelligence-based diagnostic algorithms. Among the diverse modalities of imaging data, T1-weighted MR and FDG-PET are widely used for this task. In this paper, we propose a convolutional neural network (CNN) to integrate all the multi-modality information included in both T1-MR and FDG-PET images of the hippocampal area, for the diagnosis of AD. Different from the traditional machine learning algorithms, this method does not require manually extracted features, instead, it utilizes 3D image-processing CNNs to learn features for the diagnosis or prognosis of AD. To test the performance of the proposed network, we trained the classifier with paired T1-MR and FDG-PET images in the ADNI datasets, including 731 cognitively unimpaired (labeled as CN) subjects, 647 subjects with AD, 441 subjects with stable mild cognitive impairment (sMCI) and 326 subjects with progressive mild cognitive impairment (pMCI). We obtained higher accuracies of 90.10% for CN vs. AD task, 87.46% for CN vs. pMCI task, and 76.90% for sMCI vs. pMCI task. The proposed framework yields a state-of-the-art performance. Finally, the results have demonstrated that (1) segmentation is not a prerequisite when using a CNN for the classification, (2) the combination of two modality imaging data generates better results.

Keywords: Alzheimer's disease, multi-modality, image classification, CNN, deep learning, hippocampal

### INTRODUCTION

Aging of the global population results in an increasing number of people with dementia. Recent studies indicate that 50 million people are living with dementia (Patterson, 2018), of whom 60– 70% have Alzheimer's Disease (AD) (World Health Organization, 2012). Known as one of the most common neurodegenerative diseases, AD can result in severe cognitive impairment and behavioral issues.

Mild cognitive impairment (MCI) is a neurological disorder, which may occur as a transitional stage between normal aging and the preclinical phase of dementia. MCI causes cognitive impairments with a minimal impact on instrumental activities of daily life (Petersen et al., 1999, 2018). MCI is a heterogeneous group and can be classified according to its various clinical outcomes (Huang et al., 2003). In this work, we partitioned MCI into progressive MCI (pMCI) and stable MCI (sMCI), which are retrospective diagnostic terms based on the clinical follow-up according to the DSM-5 criteria (American Psychiatric Association, 2013). The term pMCI, refers to MCI patients

**69**

who develop dementia in a 36-month follow-up, while sMCI is assigned to MCI patients when they do not convert. Distinguishing between pMCI and sMCI plays an important role in the early diagnosis of dementia, which can assist clinicians in proposing effective therapeutic interventions for the disease process (Samper-González et al., 2018).

With the progression of MCI and AD, the structure and metabolic rate of the brain changes accordingly. The phenotypes include the shrinkage of cerebral cortices and hippocampi, the enlargement of ventricles, and the change of regional glucose uptake. These changes could be quantified with the help of medical imaging techniques such as magnetic resonance (MR) and positron-emission tomography (PET) (Correa et al., 2009). For instance, T1-weighted magnetic resonance image (T1-MRI) provides high-resolution information for the brain structure, making it possible to accurately measure structural metrics like thickness, volume and shape. Meanwhile, 18-Fluoro-DeoxyGlucose PET (18F-FDG-PET or FDG-PET) indicates the regional cerebral metabolic rate of glucose, making it possible to evaluate the metabolic activity of the tissues. Other tracers, such as <sup>11</sup>C-PiB and <sup>18</sup>F-THK, are also widely used in AD diagnosis (Jack et al., 2008b; Harada et al., 2013), as they are sensitive to the pathology of AD as well. By analyzing these medical images, one can obtain important references to assist the diagnosis and prediction of AD (Desikan et al., 2009).

This work aims at distinguishing AD or potential AD patients from cognitively unimpaired (labeled as CN) subjects accurately and automatically using medical images of the hippocampal area and recent techniques in deep learning, as it facilitates a fast-preclinical diagnosis. The method is further extended for the classification between sMCI and pMCI so that an early diagnosis of dementia would be possible. Data of two modalities were used. i.e., the T1-MRI and <sup>18</sup>F-FDG-PET, as they provide complementary information.

Numerous studies have been published on diagnosing AD by utilizing these two methods. Using T1-MRI, Sorensen et al. segmented the brains and extracted features of thickness and volumetry in the selected regions of interest (ROIs) (Sorensen et al., 2017). A linear discriminant analysis (LDA) was used to classify AD, MCI, and CN. David et al. implemented the kernel metric learning method in the classification (Cárdenas-Peña et al., 2017). Another popular machine learning method is the random forest. Lebedeva et al. (2017) extracted the structural features of MRI and used mini-mental state examination (MMSE) as a cognitive measure. Ardekani et al. (2017) took the hippocampal volumetric integrity of MRI and neuropsychological scores as the selected features. Both studies used the random forest. As for <sup>18</sup>F-FDG-PET, Silveira and Marques (2010) proposed a boosting learning method that used a mixture of simple classifiers to perform voxel-wise feature selections. Cabral and Silveira (2013) used favorite class ensembles to form ensembled support vector machine (SVM) and random forest.

In addition to the single modality classifications, taking both T1-MRI and <sup>18</sup>F-FDG-PET into consideration is also a major concern for research on AD diagnosis. Gray et al. (2013) took regional MRI volumes, PET intensities, cerebrospinal fluid (CSF) biomarkers and genetic information as features and implemented random-forest based classification. Additionally, Zhang et al. (2011) conducted a classification based on MRI, PET, and CSF biomarkers . Moreover, other imaging modalities or PET tracers can be considered, as Rondina et al. (2018) used T1- MRI, <sup>18</sup>F-FDG-PET and rCBF-SPECT as the imaging modalities while Wang et al. (2016) used <sup>18</sup>F-FDG and <sup>18</sup>F-florbetapir as tracers of PET.

The studies mentioned above mostly follow three basic steps in the diagnosis algorithms, namely segmentation, feature extraction and classification. During segmentation, data are manually or automatically partitioned into multiple segments based on anatomy or physiology. In this way, the ROIs are well-defined, making it possible to extract features from them. Finally, these features will be fed to the classification step so that the classifiers are able to learn useful diagnostic information and propose predictions for given test subjects. Among them, segmentation plays an important role as it is used to measure the structural metrics in the feature extraction step. However, it is hard to obtain a segmentation automatically and accurately, which leads to a low efficiency. As a result, we proposed an end-to-end diagnosis without segmentation in the following work. What is more, though highly reliable and explainable, these steps could be integrated weakly, as different platforms are used in different steps of these algorithms. The above considerations lead to our attempt to use a neural network in AD diagnosis.

Benefited by the rapid development of computer science and the accumulation of clinical data, deep learning has become a popular and useful method in the field of medical imaging recently. The general applications of deep learning in medical imaging are mainly feature extraction, image classification, object detection, segmentation and registration (Litjens et al., 2017). Among the deep learning networks, convolutional neural networks (CNNs) are common choices. Hosseini-Asl et al. (2016) built a 3D-CNN based on a 3D convolutional auto-encoder, which takes functional MRI (fMRI) images as input and gives the prediction for the AD vs. MCI vs. CN task, while Sarraf and Tofighi (2016) used a CNN structured like LeNet-5 to classify AD from CN based on fMRI. Liu et al. (2018) conducted a T1-MRI and FDG-PET based cascaded CNN, which utilized a 3D CNN to extract features and adopted another 2D CNN to combine multi-modality features for task-specific classification. Previous studies showed a promising potential of AD diagnosis, and thus we propose to use a deep learning framework in our work to complete the feature extraction and classification steps simultaneously.

In this work, we propose a multi-modality AD classifier. It takes both MR and PET images of the hippocampal area as the inputs, and provides predictions in the CN vs. AD task, the CN vs. pMCI task and the sMCI vs. pMCI task. The main contributions of our work are listed below:

(1) We show that segmentation of the key substructures, such as hippocampi, is not a prerequisite in CNNbased classification.


### MATERIALS AND METHODS

Studies of biomarkers for AD diagnosis are of great interest in the research fields. Among these bio markers, the shrinkage of the hippocampi is the best-established MRI biomarker to stage the progression of AD (Jack et al., 2011a), and by now the only MRI biomarker qualified for the enrichment of clinical trials (Hill et al., 2014). Therefore, the hippocampi are the most studied organs for MRI based AD diagnosis, and the hippocampal area is chosen to be the ROI of MRI in this work. As for PET images, published studies indicated that AD may cause the decline of [18]F-FDG uptake in both hippocampi and cortices (Mosconi et al., 2006; Mosconi et al., 2008; Jack et al., 2011b). Hence, when dealing with PET images, we tried different ROIs, i.e., containing only hippocampi, and containing both hippocampi and cortices.

### Image Acquisition

Data used in the preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database<sup>1</sup> . The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD). In this work, we used the T1-MRI and the FDG-PET from the baseline and follow-up visit in ADNI, as these two modalities have the greatest number of images. The details about the data acquisition are interpreted on the ADNI website (Jack et al., 2008a). We generated two datasets in this work. The Segmented dataset, containing MR images and corresponding segmentation results, was chosen to verify the effect of the segmentation, and the Paired dataset, containing MR and PET images, to verify the effect of multi-modality images.

In the Segmented dataset, we picked 2861 T1-MR images, including AD and cognitively unimpaired subjects. Basic information of the Segmented dataset is summarized in **Table 1**. All images in the Segmented dataset were segmented using multi-atlas label propagation with the expectationmaximization (MALP-EM) framework<sup>2</sup> (Ledig et al., 2015). MALP-EM is a framework for the fully automatic segmentation of MR brain images. The approach is based on multi-atlas label fusion and intensity-based label refinement, using an TABLE 1 | Summary of the studied subjects from Segmented dataset.


TABLE 2 | Summary of the studied subjects from the Paired dataset.


expectation-maximization (EM) algorithm. Through the MALP-EM framework, we obtained 138 anatomical regions with fixed boundaries, including the hippocampi of interest.

As for the Paired dataset, we used the following steps to generate it. For the same subject, we paired the MRI with the PET with (a) closest acquisition dates, (b) within 1 year since the MRI scan, and (c) at the time of the scan with the same diagnosis as the MRI. Among the acquired data, the MCI subjects were classified into pMCI and sMCI according to the DSM-5 criteria, that is, MCI should be defined as pMCI if it develops into AD within 3 years, or be defined as sMCI if it does not. Subjects without follow-up data for more than 3 years were ignored. Finally, we acquired 647 AD, 767 MCI (326 pMCI and 441 sMCI) and 731 cognitively unimpaired subjects over 1211 ADNI participants. All the information for these subjects is summarized in **Table 2**.

### Data Processing

The pre-processing of images was implemented by zxhtools<sup>3</sup> (Zhuang et al., 2011). In this work, MR images were reoriented and resampled to a resolution of 221 × 257 × 221 and with a 1 mm isotropic spacing using zxhreg and zxhtransform from zxhtools. Furthermore, in the Paired dataset, each PET image was rigid-registered to a respective MR image for the proceeding process.

The hippocampal area was selected to be the region of interest (ROI) because of its great significance in AD diagnosis. In addition, due to limited computation ability, we cropped the ROI centered in the hippocampi. For the Segmented dataset, which includes the segmentation results, we directly calculated the center of the hippocampi as it has been shown in the segmentation results. For the Paired dataset, we acquired the central points of the MR images as follows. First, we randomly chose one MR image from the Paired dataset as a template. Then we registered the images from the Segmented dataset to the template image by affine-registration, thus calculating the average indices of the center in the template image. After that, we registered the template image to other MR images in the Paired dataset using affine-transformation and used the corresponding affine matrix to determine the center for each MR image. Finally, each PET image was rigid-registered to a respective MR image for the identification of the hippocampi's

<sup>1</sup>http://adni.loni.usc.edu

<sup>2</sup>https://biomedia.doc.ic.ac.uk/software/malp-em/

<sup>3</sup>http://www.sdspeople.fudan.edu.cn/zhuangxiahai/0/zxhproj/

center. After the registration, PET images were transformed into a uniform isotropic spacing of 1 mm.

After the centers of the ROIs were located, we dilated and cropped the ROIs to a region of size 96 × 96 × 48 in voxels from the center of hippocampi for MR images (see the red rectangles in **Figures 1A–C**). In the experiment on the Segmented dataset, we processed the cropped ROI and corresponding labels in three different ways. Three slightly different groups were obtained: ImageOnly, MaskedImage and Mask. The ImageOnly group contains MR raw images and maintains all the imaging information of the hippocampi and surrounding areas. The MaskedImage group is made up of MR images masked by binary labels, it considers both the original images and the segmentation results for the hippocampi as the inputs. The Mask group is made up of binary hippocampi segmentation labels, only indicating information about the shape and volume of the hippocampi. By comparing the classification performance using these three datasets, it can be judged whether the segmentation results have an important effect on AD diagnosis. The information for the three groups from the Segmented dataset is shown in **Figures 1D–F**. When it comes to the Paired dataset, we used two different methods to generate the patches of PET images. The group generated using the first method is called the Small Reception Field (SmallRF) group, which has the same reception field as the ROI of MR images with 1 mm isotropic spacing. The group generated using the second method is called the Big Reception Field (BigRF) group, which has the same orientation and ROI center but has a 2 mm isotropic spacing for each dimension, thus having a larger reception field but a lower spatial resolution. The information for the two groups from the Paired dataset is shown in **Figures 1H,I** as a sample of the original PET image is shown in **Figure 1G**.

After the data processing, the datasets were randomly split into training sets, validation sets, and testing sets according to the patient IDs to ensure that all subjects of the same patient only appear in one set. Finally, 70% of a dataset was used as the training set, 10% as the validation set, and 20% as the testing set by random sampling. Details of these subsets were shown in **Supplementary Tables S1** and **S2**.

### Methodology

Convolutional neural network (LeCun et al., 1995) is a deep feedforward neural network composed of multi-layer artificial neurons, with excellent performance in large-scale image processing. Unlike traditional methods which use manually extracted features of radiological images, CNNs are used to learn general features automatically. CNNs are trained with a back propagation algorithm while it usually consists of multiple convolutional layers, pooling layers and fully connected layers and connects to the output units through fully connected layers or other kinds of layers. Compared to other deep feedforward networks, CNNs have fewer connections and a smaller number of parameters, due to the sharing of the convolution kernel among pixels and are therefore easier to train and more popular.

With CNNs prospering in the field of computer vision, a number of attempts have been made to improve the original network structure to achieve better accuracy. VGG (Simonyan and Zisserman, 2014) is a neural network based on AlexNet (Krizhevsky et al., 2012) and it achieved a 7.3% error rate in the 2014 ILSVRC competition (Russakovsky et al., 2015) as one of the Top-5 winners. VGGs further deepen the network based on AlexNet by adding more convolutional layers and pooling layers. Different from traditional CNNs, VGGs evaluate very deep convolutional networks for large-scale image classification, which come up with significantly more accurate CNN architectures and can achieve excellent performance even when used as a part of relatively simple pipelines. In this work, we built our network with reference to the structure of VGG.

### EXPERIMENTS

In the Section "Data Type Analysis", we determined the proper types of data and ROIs through two experiments. In the Section "Multi-Modality AD Classifier", we constructed a set of VGG-like multi-modality AD classifiers, which considers both T1-MRI and FDG-PET data as inputs and provides predictions. In the Section "Classification of sMCI vs. pMCI and CN vs. pMCI Tasks", we trained and tested our networks with the pMCI and sMCI data. Finally, in the Section "Comparison With Other Methods" we compared our proposed method with state-of-the-art methods.

### Implementation Details

All the networks mentioned above were programmed based on TensorFlow (Abadi et al., 2016). Training procedures of the networks were conducted on a personal computer with a Nvidia GTX1080Ti GPU. During the training, batch normalization (Ioffe and Szegedy, 2015) was deployed in the convolutional layers and dropout (Hinton et al., 2015) was deployed in fully connected layers to avoid overfitting. To accelerate the training process and to avoid local minima, we used an ADAM optimizer (Kingma and Ba, 2014) to train. The batch size was set to 16 when we trained single modality networks and to eight when we trained multi-modality networks. The number of epochs was set to 150, though the loss would generally converge after 30 epochs. Each training epoch took several minutes. During training, the parameters of the networks were saved every 10 epochs. The resulting models were tested using the validation data set. The accuracies and receiver operating characteristic (ROC) curves of the classification on the validation data were then calculated, and the model with the best accuracy was chosen to be the final classifier.

### Data Type Analysis

In order to determine the proper data type for network training, we designed two experiments and evaluated the classification performances of models when they were fed with different data types.


All the models mentioned above were trained in the same network, as shown in **Figure 2**. The input resolution is 96 × 96 × 48 in voxels, and the network contains eight convolutional layers, five max-pooling layers, and three fully connected layers. The output was given through a softmax layer.

### The Influence of Segmentation

As mentioned above, segmentation plays an important role in traditional classification methods. However, segmentation is also known to be time-consuming. Additionally, CNN can extract useful features directly from raw images, as CNNs show a strong ability to locate key points in object detection tasks for natural images (Ren et al., 2015; He et al., 2017).

To verify the effect of segmentation, we segmented the AD and cognitively unimpaired subjects of T1-MR images with the MALP-EM algorithm (Ledig et al., 2015) and obtained the Segmented datasets, including 2861 subjects and containing both MR images and the corresponding segmentation. In our assumption, segmentation can indicate the shapes, volumes, textures and relative locations of hippocampal areas. Therefore, the data obtained from the subjects formed three different groups, as shown in **Figures 1D–F**. The ImageOnly group contains raw MR images only; the Mask group is made up of binary hippocampal segmentation labels and the MaskedImage group is made up of MR images masked by the binary labels.

For each model trained from these groups, accuracy and AUC were evaluated, as listed in **Table 3**. Among all the three models, the model trained by the Mask group provided a favorable prediction, though inferior to those trained by the ImageOnly and the MaskedImage group. The results indicate

TABLE 3 | Summary of the models trained from the Mask, MaskedImage, and ImageOnly groups for CN vs. AD task.


The Segmented dataset was used. <sup>1</sup>ACC, SEN, SPE, AUC denotes accuracy, sensitivity, specificity and area under curve, respectively. When testing, the numbers of true positive (TP), true negative (TN), false negative (FP), and false negative (FN) subjects were counted, as ACC = (TP+TN)/(TP+TN+FP+FN), SEN = TP/(TP+FN), SPE = TN/(TN+FP). AUC is obtained through calculating the area under the receiver operating characteristic (ROC) curve. For all four metrics, the values are between 0 and 100%, the higher, the better.

that segmentation results do contain information needed for the classification, however, it is not necessary for the classification task since CNN is able to learn useful features without labeling the voxels. In addition, features from the region out of the hippocampi also provide further information to separate AD patients from normal ones.

### ROI Determination for PET Images

Due to the limitation of GPU RAM and its computational ability, it was difficult to consider the entire image as the network input, as our proposed network only considered a region of 96 × 96 × 48 in voxels, which was still 2.91 times the input size of the original VGG (224 × 224 pixels × 3 channels). Hence, the selection of the ROI was of great importance, as only the features in the ROI were considered. As for the MR images, the selection of the ROI was of little doubt, because the hippocampal area was long enough to be the main concern of AD research (Jack et al., 2011b; Hill et al., 2014). However, the ROIs of PET images varied, as studies also attached great significance to metabolic changes in cortices, e.g., temporal lobes (Mosconi et al., 2006, 2008).

To verify the effects of cortices on the classification, we generated two groups from all PET images from the Paired dataset, the SmallRF and the BigRF groups, as shown in **Figures 1H,I**. The SmallRF group uses exactly the same reception field with the MRI ROI; the images in the BigRF group are eight times the volume of the images in the SmallRF group but have a lower spatial resolution.

Two models were trained using these two groups, and their performance was evaluated by some metrics, as listed in **Table 4**. The result showed that the two models behaved similarly. This is because although the SmallRF group has a higher spatial resolution, the BigRF group contains more features. Furthermore, in terms of multi-modality classification tasks, the SmallRF group might be better, because PET images in the SmallRF group were voxel-wisely aligned with paired MR images, which could help better locate the spatial features. Therefore, we chose the same ROI for both MR and PET images in the following experiments (see the red rectangles in **Figures 1A–C**).

### Multi-Modality AD Classifier

The information a classifier can obtain, by using a single modality, is limited, as one medical imaging method can only profile one or several aspects of AD pathological changes, which is far from being complete. For example, T1-MR images provide a high-resolution brain structure but give little information about the functional information of the brain. Meanwhile, FDG-PET images are fuzzy but are better in revealing the metabolic activity of glucose in the brain. In order to take as much information of the brain as possible, we introduced a classification framework to integrate multi-modality information.

TABLE 4 | Summary of the models trained from the SmallRF and the BigRF groups for CN vs. AD task.


The Paired dataset was used.

fnins-13-00509 May 31, 2019 Time: 17:10 # 7

To prepare the dataset, we first matched MR with PET images and transformed them into same world coordinates. After that, paired images of MR and PET were aligned by rigid registration to ensure that the voxels of the same indices in the paired images represent the same part of the brain. After the paired images were cropped with reference to the center point of MR images, the Paired dataset was obtained.

To implement the multi-modality classifier, we proposed two different network architectures, as shown in **Figure 3**. In **Figure 3A**, MR and PET images were used as two parallel channels, in which paired images were stacked into 4D images. In these 4D images, the first three dimensions represent the three spatial dimensions, and the fourth one represents the channels. In **Figure 3B**, MR and PET images have separate entrances, as they are convolved, respectively, in two separate VGG-11s, and the extracted features are concatenated. This network was trained in two strategies, denoted by B1 and B2. B1 was to train the model with weights shared for the convolutional layers. Meanwhile, B2 usedwas to update the weights of two VGG-11s separately.

We trained five models based on the Paired dataset, that is, two single modality models (for MRI and PET respectively), and three multi-modality models (A, B1, and B2). The results are shown in **Table 5** and **Figure 4A**. As shown in **Table 5**, multimodality classifiers had better performance than single modality classifiers. Additionally, among the three multi-modality models, the model trained with strategy B1 had the highest accuracy and sensitivity, while the model trained with strategy B2 had the highest specificity and AUC.


TABLE 5 | Summary of the models trained from single modality protocols and

The Paired dataset was used. The best results were indicated in bold.

### Classification of sMCI vs. pMCI and CN vs. pMCI Tasks

Simply classifying AD patients from normal controls is relatively easy but of little significance, as the development of AD can be observed easily by the behaviors of the patients. In addition, there are a lot of alternative indicators in clinical diagnosis. Therefore, the prediction of AD seems to be more meaningful, as one of the main concerns is telling pMCI from sMCI and normal individuals. As pMCI would progress to AD while the other two would not, identifying pMCI could give a prediction of the development of MCI, and thus have high reference value and clinical meaning.

According to Lin et al. (2018), the models that were trained by the CN vs. AD training set performed better than the models trained by the sMCI vs. pMCI training set in the sMCI vs. pMCI task. Therefore, we trained models with the CN vs. AD training set and tested the models with the CN vs. pMCI testing set and the sMCI vs. pMCI testing set, with the results shown in **Table 6** and **Figures 4B,C**. Though B1 performed slightly better in CN vs. AD task, B2 was superior in CN vs. pMCI and sMCI vs. pMCI tasks. These results indicate that features of MRI and PET tend to be more consistent when dementia is highly developed, since convolutional kernels of model B1 shared the weight, while those of B2 did not.

### Comparison With Other Methods

In this part, we compared our method with those that were used in previous literature. We first compared our method with stateof-the-art research using 3D CNN-based multi-modality models as well (Lin et al., 2018). Liu et al. (2015) proposed a multimodality cascaded CNN. They used the patch-based information of a whole brain to train or test their models and they integrated the information from the two modalities by concatenating the feature maps(Liu et al., 2015). **Table 7** shows the results of the method in comparison to our work. Note that our models used the data from multiple facilities and that our models only used the hippocampal area as the input. These would influence the behavior of our method.

Moreover, Lin et al. (2018), chose to reduce the amount of input by slicing the data (in different directions) instead of

TABLE 6 | Summary of the models trained from three multi-modality protocols for CN vs. AD.


The best results were indicated in bold.

fnins-13-00509 May 31, 2019 Time: 17:10 # 9

TABLE 7 | Comparison of our proposed method and Liu's multi-modality method.


<sup>1</sup>Using B1 protocol, the CN vs. AD training set and the CN vs. AD testing set. <sup>2</sup>Using B2 protocol, the CN vs. pMCI training set and the CN vs. pMCI testing set. <sup>3</sup>Using B1 protocol, the CN vs. AD training but the CN vs. pMCI testing set. See Table 9 for reference. The best results were indicated in bold.

TABLE 8 | Comparison of our proposed method and published AD diagnosis methods.


<sup>1</sup>Using B1 protocol, the CN vs. AD training set and the CN vs. AD testing set. <sup>2</sup>Using B2 protocol, the CN vs. sMCI training set and the CN vs. sMCI testing set. <sup>3</sup>Using B2 protocol, the CN vs. AD training but the CN vs. sMCI testing set. See Table 9 for reference. The best results were indicated in bold.


TABLE 9 | Comparison of the performance of models trained from the CN vs. AD training set and the tasks' own training set.

The Paired dataset was used.

cropping the hippocampi out as we did. Tong et al. (2017) used non-linear graph fusion to join the features of different modalities. In Zu et al.'s (2016) study, the feature selection from multiple modalities were treated as different learning tasks. Liu et al. (2015) used stacked autoencoders (SAE) with a masking training strategy. Jie et al. (2015) used a manifold regularized multitask feature learning method to preserve both the relations among modalities of data and the distribution in each modality.

Li et al. (2014) used a deep learning framework to predict the missing data. **Table 8** compares the previous multi-modality models with our proposed models. Among all the results listed below, our results are favorable in the CN vs. AD task and are the best in the sMCI vs. pMCI task.

### DISCUSSION

In this work, we proposed a VGG-like framework, with several instances, to implement a T1-MRI and FDG-PET based multimodality AD diagnosing system. The ROI of MRI was selected to be the hippocampal area, as it is the most frequently studied and is thought to be of the highest clinical value. Through the experiments, we proved that segmentation is not necessary for a CNN-based diagnosing system, which is different from the traditional machine learning based methods. However, registration is still needed, as the images we used were taken from different facilities and had different spacings and orientations. Although models obtained from the SmallRF and BigRF groups had similar performances, the ROI of PET was chosen to be the same as the MRI's, because the ROI of SmallRF was voxelwisely aligned with the ROI of the paired MRI. In short, only hippocampal areas were used as ROIs in our proposed methods, which is the main difference between our study and previous studies. Thus, we constructed a deeper neural network and fed it with medical images of higher resolution, as we supposed that the hippocampal area itself can serve as a favorable reference in AD diagnosis.

"Since the ROI was selected, we introduced a multi-modality method to the classifier. Two networks and three types of models were proposed as listed in **Table 6**. Among these three types of models, the model trained using strategy B1, which means that the MR and PET images were separately input for the convolutional layers, but with their convolutional kernels shared, performed the best in the CN vs. AD task. One possible explanation is that MR and PET images have some common features, and sharing weight helped the model to extract these features during the training process. Furthermore, we used proposed networks to train CN vs. pMCI and sMCI vs. pMCI classifiers, both of them indicated the potential of preclinical diagnosis using our proposed methods.

We also followed Lin et al.'s (2018) lead and used the model trained by CN vs. AD subjects to distinguish sMCI and pMCI. The results were better than that of the model trained by sMCI and pMCI themselves, as shown in **Table 9**. This is reasonable because the features of sMCI and pMCI are close to each other in the feature space and are difficult to differentiate, while those of CN and AD are widely spread making the classification a lot easier. The same conclusion can be obtained by testing the CN vs. AD model on the CN vs. pMCI dataset. Specifically, when the CN vs. AD model was used, the accuracy reached 76.90% for sMCI vs. pMCI and 87.46% for CN vs. pMCI, which was about 5% higher than the accuracy obtained using their own models. These results are also better than those of Lin et al.'s (2018).

As for the future work, we only used two modalities (T1-MRI and FDG-PET) as inputs for this work. However, new modalities can easily be implemented based on the proposed networks. The interested new imaging modalities include T2-MRI (Rombouts et al., 2005), <sup>11</sup>C-PIB-PET (Zhang et al., 2014), and other PET agents such as amyloid protein imaging (Glenner and Wong, 1984). Also, the features extracted by CNN are hard for human beings to comprehend, while some methods like attention mechanisms (Jetley et al., 2018) are able to visualize and analyze the activation maps of the model, in which future work could be done to improve the classification performance and to discover new medical imaging biomarkers.

## CONCLUSION

To conclude, we have proposed a multi-modality CNN-based classifier for AD diagnosis and prognosis. VGG backbone, which is deeper than most similar studies, has been used and explored. The accuracy of models reached 90.10% for the CN vs. AD task, 87.46% for the CN vs. pMCI task and 76.90% for the sMCI vs. pMCI task. Our work also indicates that the hippocampal area with no segmentation can be chosen as the input.

## ETHICS STATEMENT

Data used in the preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD). Since its launch more than a decade ago, the landmark public-private partnership has made major contributions to AD research, enabling the sharing of data between researchers around the world. For up-to-date information, see www.adni-info.org.

## AUTHOR CONTRIBUTIONS

XZ is the corresponding author. XZ proposed the idea, supervised and managed the research, and revised the manuscript. YH lead the implementations, experiments, and wrote the manuscript. JX co-lead the work and wrote the manuscript. YZ provided the support to the work of coding, experiments, and wrote the manuscript. TT co-investigated the work and revised the manuscript.

## FUNDING

This work was supported by the Science and Technology Commission of Shanghai Municipality (17JC1401600) and the Key Projects of Technological Department in Fujian Province (321192016Y0069201615).

### ACKNOWLEDGMENTS

fnins-13-00509 May 31, 2019 Time: 17:10 # 11

Data collection and sharing for this project was funded by the Alzheimer's Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer's Association; Alzheimer's Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd. and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies;

### REFERENCES


He, K., Gkioxari, G., Dollár, P., and Girshick, R. (2017). Mask r-cnn. arXiv

Hill, D. L. G., Schwarz, A. J., Isaac, M., Pani, L., Vamvakas, S., Hemmings, R., et al. (2014). Coalition against major diseases/european medicines agency biomarker qualification of hippocampal volume for enrichment of clinical trials Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer's Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fnins. 2019.00509/full#supplementary-material

in predementia stages of Alzheimer's disease. Alzheimer's Dement. 10, 421–429. doi: 10.1016/j.jalz.2013.07.003



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Huang, Xu, Zhou, Tong, Zhuang and the Alzheimer's Disease Neuroimaging Initiative (ADNI). This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Supervised Brain Tumor Segmentation Based on Gradient and Context-Sensitive Features

Junting Zhao<sup>1</sup> , Zhaopeng Meng1,2, Leyi Wei <sup>3</sup> , Changming Sun<sup>4</sup> , Quan Zou<sup>5</sup> \* and Ran Su<sup>1</sup> \*

*<sup>1</sup> School of Computer Software, College of Intelligence and Computing, Tianjin University, Tianjin, China, <sup>2</sup> Tianjin University of Traditional Chinese Medicine, Tianjin, China, <sup>3</sup> School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China, <sup>4</sup> CSIRO Data61, Sydney, NSW, Australia, <sup>5</sup> Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China*

Gliomas have the highest mortality rate and prevalence among the primary brain tumors. In this study, we proposed a supervised brain tumor segmentation method which detects diverse tumoral structures of both high grade gliomas and low grade gliomas in magnetic resonance imaging (MRI) images based on two types of features, the gradient features and the context-sensitive features. Two-dimensional gradient and three-dimensional gradient information was fully utilized to capture the gradient change. Furthermore, we proposed a circular context-sensitive feature which captures context information effectively. These features, totally 62, were compressed and optimized based on an mRMR algorithm, and random forest was used to classify voxels based on the compact feature set. To overcome the class-imbalanced problem of MRI data, our model was trained on a class-balanced region of interest dataset. We evaluated the proposed method based on the 2015 Brain Tumor Segmentation Challenge database, and the experimental results show a competitive performance.

Edited by:

*Nianyin Zeng, Xiamen University, China*

#### Reviewed by:

*Haiyang Yu, Nanyang Technological University, Singapore Lin Gu, National Institute of Informatics, Japan*

#### \*Correspondence:

*Quan Zou zouquan@nclab.net Ran Su ran.su@tju.edu.cn*

#### Specialty section:

*This article was submitted to Brain Imaging Methods, a section of the journal Frontiers in Neuroscience*

Received: *15 January 2019* Accepted: *07 February 2019* Published: *14 March 2019*

#### Citation:

*Zhao J, Meng Z, Wei L, Sun C, Zou Q and Su R (2019) Supervised Brain Tumor Segmentation Based on Gradient and Context-Sensitive Features. Front. Neurosci. 13:144. doi: 10.3389/fnins.2019.00144* Keywords: brain tumor segmentation, gradient, context-sensitive, random forest, mRMR, class-imbalanced

## 1. INTRODUCTION

Gliomas, the most common brain tumors in adults, have the highest mortality rate and prevalence among the primary brain tumors (DeAngelis, 2001). They can be classified into high grade gliomas (HGG) and low grade gliomas (LGG). HGG is more aggressive and infiltrative than LGG, thus patients with HGG have a shorter life expectancy (Louis et al., 2007). Magnetic resonance imaging (MRI) with multiple sequences, such as T2-weighted fluid attenuated inversion recovery (Flair), T1-weighted (T1), T1-weighted contrast-enhanced (T1c), and T2-weighted (T2) provides detailed and valuable information of the brain, and thus is commonly used to diagnose brain diseases, plan the medical treatment strategies, and monitor tumor progression (Bauer et al., 2013; Zeng et al., 2018a). However, gliomas from MRI are difficult to localize as they invade into almost everywhere in the brain with various shapes and sizes and heterogeneous growth patterns (Zhao et al., 2018); they have similar appearances with other diseases such as stroke or inflammation observed in the images; and they are also tangled with surrounding tissues, causing the boundaries diffusive and blurry (Goetz et al., 2016). Furthermore, the scale of MRI voxels is not uniform as the X-ray computed tomography (CT) scans, causing the same tumors to have different gray values, especially when the scans are obtained at different institutions (Sapra et al., 2013). Manual segmentation requires expertise and manually labeling each voxel is laborious and time-consuming (Gordillo et al., 2013). Meanwhile, a variability of 20% and 28% for intra- and inter-rater respectively has been reported for manually segmentation of brain tumors (Mazzara et al., 2004; Goetz et al., 2016). For these reasons, automatic methods instead of manual segmentation with high accuracy and less time-consumption is in high demand.

In this paper, our goal is to propose an automatic method to detect the three different regions of interest (ROI): complete tumor, tumor core, and enhancing tumor from the brain MRI. Our main contributions can be summarized as following:


The paper is organized as follows: we give a brief literature review of related work in section 2. Then the methods are described in details in section 3. We give the experimental results in section 4, followed by the conclusions in section 5.

### 2. RELATED WORK

Numerous methods of brain tumor detection and segmentation including semi-automatic methods and full-automatic techniques have been proposed (Tang et al., 2017). These segmentation techniques can be roughly divided into 4 categories: threshold-based techniques, region-based techniques, model-based techniques, and pixel/voxel classification techniques.

The threshold-based techniques, region-based techniques, and pixel classification techniques are commonly used for twodimensional image segmentation (Vijayakumar and Gharpure, 2011). Model-based techniques and voxel classification methods are usually used for three-dimensional image segmentation. We will review the four types of methods in the following subsections.

### 2.1. Threshold-Based Techniques

Threshold-based method is a simple and computationally efficient approach to segment brain tumors because only intensity values need to be considered. The objects in the image are classified by comparing their intensities with one or more intensity threshold values (Gordillo et al., 2013). The Otsu algorithm (Otsu, 1979), Bernsen algorithm (Bernsen, 1986), and Niblack algorithm (Niblack, 1986) are simple and commonly used algorithms.

Gibbs et al. proposed an unsupervised approach using a global threshold to segment. The ROI for the tumor extraction task from the MRI images (Gibbs et al., 1996). Stadlbauer et al. used the Gaussian distribution of intensity values as the threshold to segment tumors in brain T2-weighted MRI (Stadlbauer et al., 2004). However, if the information in the image is too complex, the threshold-based algorithm is not suitable. It is also limited to extract enhanced tumor areas.

### 2.2. Region-Based Techniques

Region-based methods divide an image into several regions that have homogeneity properties according to a predefined criterion (Adams and Bischof, 1994). Region growing and watershed methods are the most commonly used region-based methods for brain tumor segmentation.

Ho et al. proposed a region competition method which modulates the propagation term with a signed local statistical force to reach a stable state (Ho et al., 2002). Salman et al. examined the seeded region growing and active contour to be compared against experts' manual segmentations (Salman et al., 2005). Sato et al. proposed a Sobel gradient magnitude-based region growing algorithm which solves the partial volume effect problem (Sato et al., 2000). Deng proposed a region growing method which was based on the gradients and variances along and inside of the boundary curve (Deng et al., 2010).

Letteboer et al. and Dam et al. described multi-scale watershed segmentation (Letteboer et al., 2001; Dam et al., 2004). Letteboer et al. proposed a semi-automatic multi-scale watershed algorithm for brain tumor segmentation in MR images (Letteboer et al., 2001). Region-based techniques are used commonly in brain tumor segmentation. However, regionbased segmentation has the over-segmentation problem and there is considerable difficulty in marker extraction when using marker-based watershed segmentation. Li and Wan solved these problems by proposing an improved watershed segmentation method with an optimal scale based on ordered dither halftone and mutual information (Li and Wan, 2010).

### 2.3. Model-Based Techniques

Model-based segmentation techniques could be divided into parametric deformable and geometric deformable approaches. There are a number of studies on image segmentation based on active contours, which is a popular parametric deformable method (Boscolo et al., 2002; Amini et al., 2004). Snake is one of the most commonly used geometric deformable algorithm for brain tumor segmentation. Luo et al. proposed a deformable model to segment brain tumors (Luo et al., 2003). This method combined the adaptive balloon force and the gradient vector flow (GVF) force to increase the GVF snake's capture range and convergence speed. Ho et al. proposed a new region competition method for automatic 3D brain tumor segmentation based on level-set snakes which overcome the difficulty in initialization and the missing boundary problems by modulating the propagation term with a signed local statistical force (Ho et al., 2002).

### 2.4. Pixel/Voxel Classification Techniques

Voxel-based classification usually uses voxel attributes for each voxel in the image such as gray level and color information. In brain tumor segmentation, voxel-based techniques are classified as unsupervised classifiers and supervised classifiers to cluster each voxel in the feature space (Gordillo et al., 2013).

Juang and Wu proposed a color-converted segmentation approach with the K-means clustering technique for MRI which converts the input gray-level MRI image into a color space image and the image is labeled by cluster indices (Juang and Wu, 2010). Selvakumar et al. implemented a voxel classification method which combined K-means clustering and fuzzy Cmeans (FCM) segmentation (Selvakumar et al., 2012). Vasuda and Satheesh improved the conventional FCM by implementing data compression including quantization and aggregation to significantly reduce the dimensionality of the input (Vasuda and Satheesh, 2010). Comparing to the conventional FCM, the modified FCM has a higher convergence rate. Ji et al. proposed a modified possibilistic FCM clustering of MRI utilizing local contextual information to impose local spatial continuity to reduce noise and resolve classification ambiguity (Ji et al., 2011). Autoencoders were used in Vaidhya et al. and Zeng et al. work for brain tumor segmentation and other imaging tasks (Vaidhya et al., 2015; Zeng et al., 2018b). Zhang et al. proposed a hidden Markov random field model and the expectation-maximization algorithm for brain segmentation on MRI (Zhang et al., 2001).

For the voxel-classification MRI processing techniques, proper depiction of voxels is required as a criteria to accurately classify each voxel. In the previous studies, Zulpe et al. used gray-level co-occurrence matrix (GLCM) textural features to detect the brain tumors (Zulpe and Pawar, 2012); Contextsensitive features were used in Meier et al.'s study to classify tumors and non-tumors (Meier et al., 2014). Meanwhile, a feature selection algorithm also requires good designs to select a compact set of features in order to reduce the computation cost (Zou et al., 2016a,b; Su et al., 2018), considering the huge data size of the MRI. In our study, one set of informative features and efficient feature selection algorithm were proposed. The experimental results have demonstrated that promising brain tumor segmentation performance can be achieved using the proposed method.

### 3. METHODOLOGY

In this paper, we extracted various types of features from the brain MRI and used for classification. And an mRMR feature selection method was used to reduce the feature dimension and select the best feature set. The whole pipeline was depicted in **Figure 1**. Firstly, the MRI sequences were pre-processed with smoothing and normalization operations. Secondly, we extracted two types of features, gradient-based features and context-sensitive features. Thirdly, we used an mRMR feature selection method to select the optimal feature set with minimal redundancy and maximal relevance. We will explain the whole process in detail later.

### 3.1. Data

We used the training data of BraTS 2015 as our training and test data (Menze et al., 2015). It provides 4 sequences T1, T1c, T2, and Flair. The image data contains 220 HGG (anaplastic astrocytomas and glioblastoma multiforme tumors) MR scans and 54 LGG (histological diagnosis: astrocytomas or oligoastrocytomas) cases. The "ground truth" are labeled by manual annotations with 0- 5 with four types of tumoral structures labeled as the following: "necrotic (or fluid-filled) core" is labeled 1, "edema" is labeled 2 , "non-enhancing(solid) core" is labeled 3, and "enhancing core" is labeled 4. The normal tissue is labeled 0. We evaluated our work within three regions: complete tumor (which contains necrotic core, edema, non-enhancing core and enhancing core), tumor core (which contains necrotic core, non-enhancing, and enhancing core) and enhancing tumor.

### 3.2. Pre-processing

We carried out smoothing and normalization on the MRI sequences to reduce the impact of image noise and to enhance image quality for further processing. As for smoothing, we chose the Gaussian filter which has been widely used in image processing and computer vision for noise suppression (Bergholm, 1987; Deng and Cahill, 1993; Kharrat et al., 2009; Zeng et al., 2017).

For further processing, MRI sequences are sensitive to all the acquisition conditions such as MR protocols, MR scanners, and MR adjustments (Sled et al., 1998). Even for the same tissue information acquired with the same conditions, there will be a variation because MRI intensities do not have a tissue specific value. In order to eliminate the impact of the variation for further image processing which is based on image intensity, we normalized the smoothed value to the range from 0 to 1. The normalization was calculated as in Equation (1)

$$X^\* = \frac{X - X\_{\text{min}}}{X\_{\text{max}} - X\_{\text{min}}} \tag{1}$$

where X ∗ and X are the normalized and raw gray value respectively; Xmax is the maximal gray value, and Xmin is the minimal gray values.

### 3.3. Gradient Based Features

The gradient value represents the rate of change in the direction of the largest possible intensity change. In our study, we used the central difference gradient as the gradient operator. For each voxel p, the derivative at one direction is the mean of the two voxels adjacent to p in that particular direction. Here we calculated two sets of gradient-based features within the ROI. The first set calculated the gradient along each coordinate plane, which we named as Gradient2D. The Gradient2D of one image in each coordinate plane has two components: the x-derivative and the y-derivative. In Equation (2), we take the x-derivative as an example (I is the input image). The second set, the Gradient3D, is based on the three-dimensional gradient magnitude. We further divide the Gradient3D into five subsets, the GM, rMean, rVar, seqMean, and seqVar, and we show them in **Table 1**.

The GM feature consists of the three-dimensional gradient magnitude, which is calculated based on Equation (3). G<sup>x</sup> is the directional gradient along the x-axis, G<sup>y</sup> is the directional gradient along the y-axis and G<sup>z</sup> is the directional gradient along the z-axis. In our study, we also used the central difference gradient as gradient operator to extract the GM feature for

TABLE 1 | The five feature subsets in Gradient3D.


each respective MRI image sequence. The operator is given in Equation (2).

$$d\text{I/dx} = \frac{I(\mathbf{x} + 1) - I(\mathbf{x} - 1)}{2} \tag{2}$$

$$
gamma(G\_{\mathbf{x}}, G\_{\mathbf{y}}, G\_{\mathbf{z}}) = \sqrt{G\_{\mathbf{x}}^2 + G\_{\mathbf{y}}^2 + G\_{\mathbf{z}}^2} \tag{3}
$$

We further extracted the rMean and rVar features by calculating the mean and variance of the GM feature over a cube-shaped neighborhood with sizes 3<sup>3</sup> , 5<sup>3</sup> , 7<sup>3</sup> for each GM feature of each respective sequence. Meanwhile, we extracted seqMean and seqVar by calculating the mean and variance of the GM features over the sequences in cube-shaped neighborhoods.

### 3.4. Circular Context-Sensitive Feature

Meier et al. proposed context-sensitive features for brain tumor segmentation which extracts ray features in plane by calculating the histogram using intensity values from T<sup>1</sup> and Flair-weighted images after atlas-normalization (Meier et al., 2014). The rationale of this method is that the intensity range of T<sup>1</sup> and Flair-weighted modalities is larger than that of the healthy tissues. Based on this method, we proposed a circular context-sensitive (CCS) feature to capture more details in various sizes.

In context-sensitive features, every voxel sends out four rays with radius r ∈ {10, 20} and orienting at ang ∈ {0, π/2, π, 2π/3}. In order to obtain more information and extract features in multiple scales, we made several improvements to the original context-sensitive features. Firstly, instead of utilizing only voxel information in the horizontal or vertical directions, we used rays evenly distributed on a circle to swipe all the orientations, which are denser. The directions are calculated using the following equation:

$$\begin{aligned} \arg &= \beta + n \ast \beta\_{\theta}, n \in N, \\ \beta\_{\theta} &= 2\pi / N\_{\beta} \end{aligned} \tag{4}$$

where β is the initial angle, β<sup>θ</sup> is the step size rotating around the center point, and Nβ is the total number of directions. Secondly, in order to capture context features with all the scales, we used a continuous radius to cover the neighboring voxel information as much as possible. The radius r is defined as:

$$\begin{aligned} r &= r\_0 + n \ast r\_\vartheta, n \in N^\*, \\ r\_\vartheta &= (r\_{\max} - r\_{\min}) / N\_r \end{aligned} \tag{5}$$

where r<sup>0</sup> is the initial radius, where r<sup>θ</sup> is the step size moving toward the outermost circle, rmin and rmax are the minimum and maximum of the radius, and N<sup>r</sup> is the total number of rays. We show the comparison between the original contextsensitive features and the circular context-sensitive features in **Figure 2**. The original context-sensitive sends out four rays in four directions. The CCS features, however, send out rays along all the orientations and with all the radius, which is supposed to capture rich context information. In our studies, we used 8 rays evenly distributed on 45 circles ranging from 10 to 20 with even numbers to capture the context-sensitive features.

In the MRI, the slice thickness varies considerably which will affect the feature extraction results so the features are considered only for the T<sup>1</sup> and Flair-weighted in-plane images. In summary, our CCS feature extraction is summarized as the following:


In summary, we extracted 12 Gradient2D features, 34 Gradient3D features, 4 context-sensitive features, and 12 CCS features, as shown in **Table 2**. In the next section, we will show the voxel-based classification and feature selection to label different regions of the brain MR images.

FIGURE 2 | The original context-sensitive features and the circular context-sensitive (CCS) features. (A) shows the original context-sensitive features. (B) shows the CCS features. In this example, the center voxel sends out 18 rays of length *r* + *n* ∗ *r* <sup>θ</sup> with an angle β + *n* ∗ β<sup>θ</sup> . The CCS can fully extract context information instead of only voxel information in horizontal or vertical directions.

TABLE 2 | Number of features in each feature group.


### 3.5. Feature Selection Based on mRMR

We classified the brain MRI images into five categories: normal tissue, edema, non-enhancing core, necrotic core, and enhancing core. Random forests (RF) (Breiman, 2001) is an ensemble learning method for classification, regression, and other tasks, and has been widely used in image analysis (Jin et al., 2014; Liu et al., 2015). It consists of multitude of decision trees and outputs the votes over each tree. They are able to handle multiclass problems, and they provide a probabilistic output instead of hard label separations. However, due to the large volume of MRI images, direct classification based on the extracted features (as shown in **Table 2**) is time-consuming. A proper feature selection algorithm will greatly reduce the computational cost and increase the efficiency.

Minimal Redundancy maximal Relevance(mRMR) was proposed by Peng et al. which can select features that have minimum redundancy and maximum relevance with each other (Peng et al., 2005) . We use Equation (6) to search for features which have the maximum relevance, and use the minimal redundancy condition as the Equation (7) to select mutually exclusive features. Equation (8) gives mRMR features. x<sup>i</sup> represents the i − th feature in feature set S with target class c. 8 means the combination of D and R.

$$\max \text{D(S,c)}, D = \frac{1}{|\mathcal{S}|} \sum\_{\mathbf{x}\_i \in \mathcal{S}} I(\mathbf{x}\_i; c) \tag{6}$$

$$\min \text{R(S)}, R = \frac{1}{|S|^2} \sum\_{\boldsymbol{\chi}\_i \boldsymbol{\chi}\_j \in S} I(\boldsymbol{\chi}\_i, \boldsymbol{\chi}\_j) \tag{7}$$

$$\max \Phi(D, \mathbb{R}), \Phi = D - \mathbb{R} \tag{8}$$

In order to avoid over-learning and under-learning, we evaluated our RF classifier by 5-fold cross validation with the measurements in section 3.7. In detail, we divided the subcases into 5 roughly equal parts. For each k = 1, 2, ..., 5, we fit the RF model to the other 4 parts, and predict the kth part with the fitted RF model. The final outcome equals to the mean of the results of the 5-folds. In our study, we used the mRMR feature selection method to select the minimal feature set which reduces the computational cost without performance degradation. The raw 62-dimensional feature set is shown in **Table 2**. The mRMR feature selection details are as follows:


### 3.6. Solution to the Class-Imbalance Problem

The "BraTS" data is seriously unbalanced, with less than 1% of voxels being tumor voxels. Training on them would result in problems such as higher mis-classification rate for the minority class data. Thus, we carried out three steps to overcome the class-imbalance problem.

1. Detect the boundary of ROI: We used a plane to move along each axis's direction until it detected a voxel labeled with nonezero. Then the plane in the current position was used as the boundary in that direction.


TABLE

3


Performance

of

feature

sets

with

different

dimension.

corresponding values across all the *f*. If equal values appeared, we took another metric, the size of the feature set into account. Then we summed all the rankings of each *f* and plotted the curve.


### 3.7. Performance Measurements

To show the performance of our segmentation approach, we use Dice, positive predictive value (PPV), Sensitivity, Specificity to evaluate HGG and LGG tumor regions segmentation. TP represents the number of "true positive," where "true positive" is the event that the test makes a positive prediction, and the subject has a positive ground truth. FP is the number of "false positive," where "false positive" is the event that the test makes a positive prediction, and the subject has a negative ground truth. FN is the number of "false negative" and TN indicateds the size of "true negative" set.

1. Dice

$$\text{Dice}\_{\quad} = \frac{TP}{((TP + FP) + (TP + FN))/2}$$

The Dice score normalizes the number of true positives to the average size of the two segmented areas. It is identical to the F-score (the harmonic mean of the precision recall curve) and can be transformed monotonously to the Jaccard score.

2. Positive predictive value (PPV)

$$^{PPV} \quad = \frac{TP}{TP + FP}$$

The PPV represents the proportions of positive results in tests that are true positive results.

3. Sensitivity

$$\text{Sensitivity} \quad = \frac{TP}{TP + FN}$$

Sensitivity measures the proportion of actual positives that are correctly identified as such.

4. Specificity

$$\text{Specificity} \quad = \frac{TN}{FP + TN}$$

Specificity (also called the true negative rate) measures the proportion of actual negatives that are correctly identified as such.

### 4. EXPERIMENTAL RESULTS

In our experiments, we extracted four groups of features after pre-processing: Gradient2D features, Gradient3D features, context-sensitive features, and CCS features, including totally 62 features (as shown in **Table 2**). The Gradient3D set contains five subsets: the GM, rMean, rVar, seqMean, and seqVar. We used the mRMR feature selection method to select a compact set of features and built the random forest classifier.




Firstly, we built the random forest classifier using the top f feature set ranked by mRMR respectively. We used f to denote the dimension of the feature set, where f = 62 − 5n, n = 0, 1, 2, ..., 12. Secondly, we compared the performance of feature sets with diverse f . The best feature dimension f<sup>m</sup> is the f which has the best performance. We recorded this fm-dimensional feature set as the optimal feature set. Here we present the performance comparisons between diverse fs; Then we show the performance comparisons among different feature groups. Next, we compared our results with Meier et al.'s method (Meier et al., 2014). Lastly, we show the segmentation results marked with different colors.

### 4.1. Comparison Between Feature Sets With Different Dimensions

In this step, we trained our model with different numbers of features as shown in **Table 3**. For the n-th training, we selected the top f-dimensional feature set ranked by the mRMR feature selection method according to their relevance and redundancy, where f = 62 − 5n, n = 0, 1, 2, ..., 12. Here we used random forest as our classifier and set the number of trees in the forest to 100. For every n, we evaluated the model by the measurements in section 3.7 within three regions: complete tumor, tumor core, and enhancing tumor. The HGG&LGG, HGG, and LGG MR scans subsets were tested.

As shown in **Table 3**, the difference between the results of two adjacent experiments is very small. It is difficult to distinguish which dimension f has the best performance. In order to obtain an intuitive feature selection outcome, we provide an overall ranking of the performances for each f and show the results in **Figure 3**. It can be seen that 22-dimensional feature set achieves the best performance among all the tested feature sets.

### 4.2. Comparison Between Different Feature Groups

In our studies, We have tested four different types of features. In order to learn which feature group is more informative for classification, we trained the model with each feature group and compared the performances of each feature group in **Table 4**. The table shows that the Gradient3D group performed far better than the other groups. It obtained a high Dice score (0.91) for complete tumor in HGG and a high PPV (0.92) for complete tumor in HGG&LGG datasets. And the CCS features performs slightly better than the context-sensitive features. However, compared with using all the 62 features, using a single group cannot achieve a better performance, which shows that each group is useful for classification and the integration of all the four groups is more helpful for classification.

### 4.3. Comparison With Other Method

We compared our methods with another method which was proposed by Meier et al. (2014). They extracted appearancesensitive and context-sensitive features and also used random forest as a classifier.

As shown in **Table 5**, Meier et al. evaluated the classifier with three ROIs: complete tumor, tumor core, and enhancing

FIGURE 4 | Examples the brain tumor segmentation results using the proposed method. The rows 1,3,5 are the axial, sagittal, and coronal slices of the ground truth. Rows 2,4,6 are the axial, sagittal, and coronal slices of our results. The labels of the tumor structure: enhancing tumor (green), tumor core(green and red), complete tumor (green, red, and yellow).

tumor. However, the LGG performance was not mentioned. In our experiments, we trained the HGG&LGG, HGG, and LGG models and tested our models with three ROIs. As shown in **Table 3**, we had better performance especially for complete tumor. Compared to Meier et al.'s method, we not only extract the context-sensitive features, but we also made improvements and proposed the circular context-sensitive features, which considered multiple scales and multiple directions.

### 4.4. Performance of Brain Tumor Segmentation

In **Figure 4**, the axial, sagittal, and coronal slices of the ground truth are shown in rows 1, 3, and 5, respectively. The corresponding slices of the segmentation results are shown in rows 2, 4, and 6, respectively. As shown in **Figure 4**, our segmentation results are consistent with the ground truth. And we have good performance in all axial, sagittal, and coronal directions.

### 5. CONCLUSIONS

In our study, we proposed a supervised brain tumor segmentation method for MRI scans. We extracted four

### REFERENCES


types of feature groups named Gradient2D set, Gradient3D set, context-sensitive features, and circular context-sensitive features, totally 62 features. Then we selected a set of the most informative feature set based on the mRMR algorithm and used them to build the random forest in order to distinguish different regions of brain tumors. We presented the performance comparisons among different dimensions of feature sets for feature selection, comparisons among different feature subgroups and comparisons with other tumor segmentation approaches. The results show that the proposed method is competitive in segmenting brain tumors.

### DATA AVAILABILITY

Publicly available datasets were analyzed in this study. This data can be found here: http://braintumorsegmentation.org/.

### AUTHOR CONTRIBUTIONS

JZ conducted the experiements. ZM, LW, and CS participated the manuscript writing. QZ and RS designed the experiments and edited the manuscript.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Zhao, Meng, Wei, Sun, Zou and Su. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Prediction and Classification of Alzheimer's Disease Based on Combined Features From Apolipoprotein-E Genotype, Cerebrospinal Fluid, MR, and FDG-PET Imaging Biomarkers

Yubraj Gupta, Ramesh Kumar Lama, Goo-Rak Kwon\* and the Alzheimer's Disease Neuroimaging Initiative†

*Department of Information and Communication Engineering, Chosun University, Gwangju, South Korea*

Edited by:

*Siyang Zuo, Tianjin University, China*

#### Reviewed by:

*Alle Meije Wink, VU University Medical Center, Netherlands Mohammad M. Herzallah, Al-Quds University, Palestine*

> \*Correspondence: *Goo-Rak Kwon grkwon@chosun.ac.kr*

*†For information on the Alzheimer's Disease Neuroimaging Initiative please see the Acknowledgments section*

> Received: *14 January 2019* Accepted: *01 October 2019* Published: *16 October 2019*

#### Citation:

*Gupta Y, Lama RK, Kwon G-R and the Alzheimer's Disease Neuroimaging Initiative (2019) Prediction and Classification of Alzheimer's Disease Based on Combined Features From Apolipoprotein-E Genotype, Cerebrospinal Fluid, MR, and FDG-PET Imaging Biomarkers. Front. Comput. Neurosci. 13:72. doi: 10.3389/fncom.2019.00072* Alzheimer's disease (AD), including its mild cognitive impairment (MCI) phase that may or may not progress into the AD, is the most ordinary form of dementia. It is extremely important to correctly identify patients during the MCI stage because this is the phase where AD may or may not develop. Thus, it is crucial to predict outcomes during this phase. Thus far, many researchers have worked on only using a single modality of a biomarker for the diagnosis of AD or MCI. Although recent studies show that a combination of one or more different biomarkers may provide complementary information for the diagnosis, it also increases the classification accuracy distinguishing between different groups. In this paper, we propose a novel machine learning-based framework to discriminate subjects with AD or MCI utilizing a combination of four different biomarkers: fluorodeoxyglucose positron emission tomography (FDG-PET), structural magnetic resonance imaging (sMRI), cerebrospinal fluid (CSF) protein levels, and Apolipoprotein-E (APOE) genotype. The Alzheimer's Disease Neuroimaging Initiative (ADNI) baseline dataset was used in this study. In total, there were 158 subjects for whom all four modalities of biomarker were available. Of the 158 subjects, 38 subjects were in the AD group, 82 subjects were in MCI groups (including 46 in MCIc [MCI converted; conversion to AD within 24 months of time period], and 36 in MCIs [MCI stable; no conversion to AD within 24 months of time period]), and the remaining 38 subjects were in the healthy control (HC) group. For each image, we extracted 246 regions of interest (as features) using the Brainnetome template image and NiftyReg toolbox, and later we combined these features with three CSF and two APOE genotype features obtained from the ADNI website for each subject using early fusion technique. Here, a different kernel-based multiclass support vector machine (SVM) classifier with a grid-search method was applied. Before passing the obtained features to the classifier, we have used truncated singular value decomposition (Truncated SVD) dimensionality reduction technique to reduce high dimensional features into a lower-dimensional feature. As a result, our combined method achieved an area under the receiver operating characteristic (AU-ROC) curve of 98.33, 93.59, 96.83, 94.64, 96.43, and 95.24% for AD vs. HC, MCIs vs. MCIc, AD vs. MCIs, AD vs. MCIc, HC vs. MCIc, and HC vs. MCIs subjects which are high relative to single modality results and other state-of-the-art approaches. Moreover, combined multimodal methods have improved the classification performance over the unimodal classification.

Keywords: Alzheimer's disease, MCIs (MCI stable), MCIc (MCI converted), sMRI, FDG-PET, CSF, apolipoprotein-E (APOE) genotype, support vector machine

### INTRODUCTION

Alzheimer's disease (AD) is an age-related neurodegenerative disorder that is commonly seen in the aging population. Its prevalence is expected to increase greatly in the coming years as it affects one out of nine people over the age of 65 years (Bain et al., 2008). AD involves progressive cognitive impairment, commonly associated with early memory loss, leading patients to require assistance for activities of self-care during its advanced stages. AD is characterized by the accumulation of amyloidbeta (Aβ) peptide in amyloid plaques in the extracellular brain parenchyma and by intra-neuronal neurofibrillary tangles caused by the abnormal phosphorylation of the tau protein (De Leon et al., 2007). Amyloid deposits and tangles are necessary for the postmortem diagnosis of AD. A prediction of an AD dementia in a predictable time-period, i.e., within 1–2 years, appears much more pertinent in a clinical outlook than a prediction of AD dementia in the faraway future, e.g., in 10–20 years. Individual classified to be at "short-term risk" can receive more active treatment and counseling. Mild cognitive impairment (MCI) is a prodromal (predementia) stage of AD, and recent studies have shown that individuals with amnestic MCI tend to progress to probable AD at a rate of ∼10–15% per year (Braak and Braak, 1995; Braak et al., 1998). Thus, accurate diagnosis of AD, and especially MCI, is of great size for prompt treatment and likely delay of the progression of the disease. MCI patients who do not progress to AD either develop another form of dementia, retain a stable condition or revert to a non-demented state. Therefore, predicting which MCI patients will develop AD in the short-term and who will remain stable is extremely relevant to future treatments and is complicated by the fact that both AD and MCI affect the same structures of the brain. In subjects with MCI, the effects of cerebral amyloidosis and hippocampal atrophy on the progression to AD dementia differ, e.g., the risk profile is linear with hippocampal atrophy but reaches a ceiling with higher values for cerebral amyloidosis (Jack et al., 2010). In subsequent investigations, biomarkers of neural injury appeared to best predict AD dementia from MCI subjects at shorter time intervals (1–2 years) in particular (Dickerson, 2013). This demonstrates the great importance of developing a sensitive biomarker that can detect and monitor early changes in the brain. The ability to diagnose and classify AD or MCI at an early stage allows clinicians to make more knowledgeable decisions at a later period regarding clinical interventions or treatment planning, thus having a great impact on reducing the cost of longtime care.

Over the past several years, several classification methods have been implemented to overcome these problems using only a single modality of biomarkers. For example, many highdimensional classification techniques use only the sMR images for classification of AD and MCI. sMRI captures the diseaserelated structure patterns by measuring the loss of brain volumes and decreases in cortical thickness (Davatzikos et al., 2008; Cuingnet et al., 2011; Salvatore et al., 2015; Beheshti et al., 2016, 2017; Jha et al., 2017; Lama et al., 2017; Long et al., 2017) for the early prediction of AD and MCI. A number of studies, covering volume of interest, region of interest (ROI), shape analysis and voxel-based morphometry, have reported that the amount of atrophy in several sMRI brain regions, such as the entorhinal cortex, hippocampus, parahippocampal gyrus, cingulate, and medial temporal cortex (Cuingnet et al., 2011; Moradi et al., 2015; Beheshti et al., 2016; Gupta et al., 2019), are sensitive to the disease progression and prediction of MCI conversion. In addition to the sMRI, another important modality of biomarkers thoroughly established neuroimaging tool in the diagnosis of neurodegenerative dementia (AD or MCI) is 18F-FDG-PET image, which mainly measures hypometabolism, reflecting neuronal dysfunction (Minoshima et al., 1997; Foster et al., 2007; Li et al., 2008; Förster et al., 2012; Nozadi et al., 2018; Samper-González et al., 2018). With FDG-PET image, some recent studies have reported the reduction of glucose metabolism or an alternations of hypometabolism occurs in the posterior cingulated cortex, precuneus, and posterior parietal temporal association cortex (Förster et al., 2012), and it usually precedes cortical atrophy (Minoshima et al., 1997; Li et al., 2008) and clinical cognitive symptoms in AD patients. Besides these neuroimaging biomarkers, there are also some biochemical (blood-protein level) and genetic (gene-protein level) biomarkers for the diagnosis of AD and MCI subjects. Biochemical changes in the brain are reflected in the cerebrospinal fluid (CSF) (Chiam et al., 2014; Zetterberg and Burnham, 2019), decreased CSF levels of amyloid-beta (Aβ) 1 to 42 peptide (Aβ1−−42; a marker of amyloid mis-metabolism) (Blennow, 2004; Shaw et al., 2009; Frölich et al., 2017), and elevations of total tau (t-tau) and hyperphosphorylated tau at the threonine181 (p-tau181p) protein (markers of axonal damage and neurofibrillary tangles) (Andreasen et al., 1998; Anoop et al., 2010; Fjell et al., 2010), are considered to be CSF best established predictive biomarkers of AD dementia in patients with MCI. Recent studies have shown that alternation or reduction of polymorphism (genetics) also play a vital role in AD and MCI patients (Gatz et al., 2006; Spampinato et al., 2011; Dixon et al., 2014). Perhaps,

the most commonly considered polymorphism in cognitive and neurodegenerative aging is apolipoprotein E (APOE; rs7412; rs429358). It involved in lipid transfer, cell metabolism, repair of neuronal injury due to oxidative stress, amyloid-beta peptide accumulation, and in elderly process. A gene on chromosome 19 in a locus synthesizes APOE with three alleles (ε2, ε3, and ε4) and it is expressed in the central nervous system in astrocytes and neurons. The APOE ε4 allele has been consistently linked to normal cognitive decline in MCI and AD dementia patients (Luciano et al., 2009; Brainerd et al., 2011; Alzheimer's Disease Neuroimaging Initiative et al., 2016; Sapkota et al., 2017). It is also said that especially APOE ε4 is the strongest genetic risk factor that increases the occurrence with a 2-to 3-fold risk for AD, and it lowers the age of onset AD. These all research focuses using only a single modality of biomarkers and their proposed algorithm performance is low compared to a recently published multimodal method (Zhang et al., 2011; Suk et al., 2014; Ritter et al., 2015; Frölich et al., 2017; Li et al., 2017; Gupta et al., 2019). These studies suggest that classification performance will improve when combining all different modalities of biomarkers into one form because different biomarkers offer a piece of different complementary information (or capture disease information from different outlooks) which are useful for the early classification of the AD and MCI patients.

Recently, Jack et al. (2016, 2018) proposed the A/T/N system, as shown in **Table 1**, in which seven major AD biomarkers are divided into three binary categories based on the nature of the pathophysiology that each subject exhibits.

Based on the above system, we propose to combine four different modalities of biomarkers, fluorodeoxyglucose positron emission tomography (FDG-PET), structural magnetic resonance imaging (sMRI), cerebrospinal fluid (CSF) protein levels, and the apolipoprotein E (APOE) genotype, of each patient. Over the past few years, several techniques have been proposed using either a combination of two or three different biomarker modalities, such as the combination of MRI and CSF biomarkers (Vemuri et al., 2009; Fjell et al., 2010; Davatzikos et al., 2011); MRI and FDG-PET biomarkers (Chetelat et al., 2007; Li et al., 2008, 2017; Shaffer et al., 2013); MRI, FDG-PET, and CSF (Walhovd et al., 2010; Zhang et al., 2011; Shaffer et al., 2013; Ahmed et al., 2014; Ritter et al., 2015); and MRI, FDG-PET, and APOE (Young et al., 2013). Although these published approaches have utilized a combination of different types of biomarkers to develop neuroimaging biomarkers for AD, the above methods

#### TABLE 1 | A/T/N biomarker grouping.


may be limited. They have used brain atrophy from a few manually extracted regions as a feature for sMRI and PET images to classify different groups. However, using only a small number of brain regions as features from any imaging modality may not be able to reflect the spatiotemporal pattern of structural and physiological abnormalities in their entirety (Fan et al., 2008). Furthermore, by only increasing the number of biomarkers, their combination did not lead to an increase in predictive power. As Heister et al. (2011) explained, a combination of impaired learning ability with medial temporal atrophy was associated with the greatest risk of developing AD in a group of MCI patients.

In this study, we propose a novel approach for the early detection of AD with other groups and to differentiate the most similar clinical entities of MCIs and MCIc by combining biomarkers from two imaging modalities (sMRI, FDG-PET) with CSF (biochemical protein level) and APOE genotype biomarkers obtained from each patient. As the A/T/N system defines that each modality of biomarkers offers a different complementary information, which is useful for the early classification of AD and MCI subjects, so in our study we have used four different modalities of biomarkers, sMRI, FDG-PET, CSF (biochemical protein level), and APOE genotype for the early prediction of AD and MCI subjects. Moreover, using early fusion method we have combined the measurement from all four (sMRI, FDG-PET, CSF, and APOE) different biomarkers to discriminate between AD and HC, MCIc and MCIs, AD and MCIs, AD and MCIc, HC and MCIs, and HC and MCIc. We compare classification performance for different groups using typical measures of gray matter atrophy (from sMR image), average intensity of each region (from FDG-PET image), t-tau, p-tau181p, and Aβ<sup>42</sup> scores (from biochemical level), and ε3/ε4, ε4/ε4 values from APOE genotype biomarker. To distinguish between these groups, we used a different kernel-based multiclass SVM classifier with a 10-fold stratified cross-validation technique that helps to find the optimal hyperparameter for this classifier. Our experiment results show that the grouping of different measurements from four different modalities of biomarkers exhibits much better performance for all classification groups than using the best individual modality of the biomarkers.

### MATERIALS AND METHODS

### Participants

Data used in the preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (http://adni.loni.usc.edu/ADNI). The ADNI was launched in 2003 as a public-private partnership led by Principal Investigator, Michael W. Weiner, MD. The primary goal of the ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD). For up-to-date information, see https://www.adni-info.org.

In total, we included 158 different subjects from the ADNI database. Included subjects were African-American, Asian, and white who stay in America and their age were between 50 and 89 years and spoke either Spanish or English. Patients with specific psychoactive medications have been excluded from the study while taking scans and the general inclusion/exclusion norms were as follows: for an HC subject, a Clinical Dementia Rating (CDR) (Morris, 1993) of 0, Mini-Mental State Examination (MMSE) score must be between 24 and 30 (inclusive), non-MCI, non-depressed, and non-demented. MCI subjects had a CDR level of 0.5, MMSE scores between 24 and 30 (inclusive), a slight memory complaint, having objective memory loss measured by education adjusted scores on Wechsler Memory Scale Logical Memory II (Elwood, 1991), absence of significant levels of impairment in other cognitive domains, essentially preserved activities of daily living, and an absence of dementia, and for an AD patients the MMSE scores between 20 and 26, CDR level of 0.5 or 1.0, and meets the National Institute of Neurological and Communicative Disorders and Stroke and the Alzheimer's Disease and Related Disorders Association (NINCDS/ADRDA) criteria for probable AD. We selected all subjects for whom all four modalities of biomarkers were available. The four obtained biomarkers were 1.5-T T1-weighted sMRI, FDG-PET, CSF measures of three protein levels (t-tau, p-tau181p, and Aβ42), and APOE genotype. Of the 158 subjects, 38 subjects were in the AD group (MMSE ≤ 24), 82 subjects in the MCI group (46 with MCIc [converted to AD within 24 months of the time-period] and 36 with MCIs [patients who did not convert to AD within 24 months of the time-period]) (MMSE ≤ 28). The remaining 38 subjects were healthy controls (MMSE ≤ 30).

**Table 2** shows the neuropsychological and demographic information for the 158 subjects. To measure the statistically important difference in demographics and clinical features, Student's t-test was applied using age data, were the significance value was set to 0.05. No any significant differences were found for any groups. In all groups, the number of male subjects was higher than the number of female subjects. Compared to the other groups, the HC group had higher scores on the MMSE. Healthy subjects had a significantly lower Geriatric Depression Scale (GDS) scores than the other groups. The Functional Activities Questionnaire (FAQ) was higher for the AD group than the other groups.

### MRI and FDG-PET Datasets MRI Protocol

Structural MRI scans were acquired from all data centers using Philips, GE, and Siemens scanners. Since the acquisition protocols were different for each scanner, an image normalization step was performed by the ADNI. The imagining sequence was a 3-dimensional sagittal part magnetization prepared of

rapid gradient-echo (MPRAGE). This sequence was repeated consecutively to increase the likelihood of obtaining at least one decent quality of MPRAGE image. Image corrections involved calibration, geometry distortion, and reduction of the intensity of non-uniformity applied on each image by the ADNI. More details concerning the sMRI images is available on the ADNI homepage (http://adni.loni.usc.edu/methods/mritool/mri-analysis/). We used 1.5-T sMRI T1-weighted images from the ADNI website. Briefly, raw (NIFTY) sMRI scans were

TABLE 2 | Demographical and neuropsychological characteristics of the studied sample.


\**Values are presented as mean* ± *and standard deviation (SD).*

downloaded from the ADNI website. All scans were 176 × 256 × 256 resolution with 1 mm spacing between each scan.

### FDG-PET Protocol

The FDG-PET dataset was acquired from the ADNI website. A detailed explanation of the FDG-PET image acquisition is available on the ADNI homepage (http://adni.loni.usc.edu/ pet-analysis-method/pet-analysis/). Briefly, FDG-PET images were acquired from 30 to 60 min post-injection. First, images were averaged and then spatially aligned. Next, these images were interpolated to a standard voxel size, and later intensity normalization was performed. Finally, images were smoothed to a common surface of 8 mm (FWHM) full width at half maximum. First, the FDG-PET images were downloaded in the Digital Imagining and Communication in Medicine (DICOM) format. In the second step, we use the dcm2nii (Li et al., 2016) converter to convert DICOM images into the Nifty format. All scans were 160 × 160 × 96 resolution with 1.5 mm spacing between each scan.

### CSF and APOE Genotype

### CSF

We downloaded the required CSF biomarker values for each selected MRI and FDG-PET image from the ADNI website. A brief description regarding the collection procedure is available on the ADNI website. As the manual describes, a 20-ml volume was obtained from each subject using a lumbar puncture with a 24 or 25 gauge atraumatic needle around the time of their baseline scans. Subsequently, all samples were stored on dry ice on the same day and later they were sent to the University of Pennsylvania AD Biomarker Fluid Bank Laboratory where the levels of proteins (Aβ42, total tau, and phosphorylated tau) were measured and recorded. In this study, the three protein levels, Aβ42, t-tau, and p-tau181p, were used as features.

### APOE Genotype

APOE genotype is known to affect the risk of developing sporadic AD in carriers. Basically, there are three types of the APOE gene, called alleles: APOE2, APOE3, and APOE4. Everyone has two copies of gene and their combination (ε2/ε2, ε2/ε3, ε2/ε4, ε3/ε3, ε3/ε4, and ε4/ε4) determines our APOE genotype score. The APOE (ε2) allele is the rarest form of APOE and carrying even one copy appears to reduce the risk of developing AD by up to 40%. APOE (ε3) is the most common allele and doesn't seem to influence risk whereas APOE (ε4) allele which present in ∼10–15% of people, and having one copy of ε4 (ε3/ε4) can increase the risk of having AD by 2–3 times while having the two copies (ε4/ε4) of APOE ε4 can increase the risk by 12 times. The APOE genotype of each subject was recorded as a pair of numbers representing which two alleles were present in the blood. The APOE genotype was obtained from 10 ml of a blood sample taken at the time of the scan and sent immediately to the University of Pennsylvania AD Biomarker Fluid Bank Laboratory for analysis. The APOE genotype value was available for all subjects for whom we had imagining data.

### Overview of Proposed Framework

The proposed framework consists of three processing stages: feature extraction and fusion of multiple features into the single form using early fusion technique, optimal feature subset selection using truncate SVD dimensionality reduction method, and classification. **Figure 1** illustrates the block diagram of the proposed framework. The set of participants were randomly split into two groups in a 75:25 ratios as a training and testing sets, respectively, before passing them to the kernel-based multiclass SVM classifier. Moreover, during the training stage, a gray matter atrophy (from sMR image) and average intensity of each region (from FDG-PET image) which had automatically extracted using NiftyReg toolbox, as well as a set of t-tau, p-tau181p, and Aβ<sup>42</sup> (from biochemical level) CSF scores, and (ε3/ε4, ε4/ε4) values from APOE genotype biomarker, were downloaded from the ADNI website. Here, we have used random tree embedding (Geurts et al., 2006; Moosmann et al., 2008) method to transform low dimensional data into a higher dimensional state, to make sure that the complementary information found across all modalities is still used while classifying AD subjects. In addition, we have used an early fusion technique for the combination of different features into one form before passing them to the feature selection process. Moreover, a feature selection technique using truncate SVD was employed to select the optimal subsets of features from the bunch of features, including the sMRI, FDG-PET, CSF, and APOE extracted features to train the classifiers to distinguish between AD and HC, MCIc and MCIs, AD and MCIs, AD and MCIc, HC and MCIs, and HC and MCIc groups. In the testing stage, a remaining 25% of the dataset is then passed to the kernel-based multiclass SVM classifier to measure the performance of our proposed method.

### Image Analysis and Feature Extraction

Image preprocessing was performed for all sMR and FDG-PET images. First, we performed anterior commissure (AC)–posterior commissure (PC) correction for all subjects. Afterward, we used N4 bias field correction using ANTs toolbox (Tustison et al., 2010) to correct the intensity of inhomogeneity for each image. In our pipeline, skull striping was not necessary as images were already preprocessed. Therefore, we reduced the total number of required pre-processing steps for the original images. Later highdimensional data from the images were preserved for the feature extraction step. For sMR images, we first aligned them to the MNI152 T1-weighted standard image using SPM12 (Ashburner and Friston, 2000) toolbox in Matlab 2018b. For the purpose of anatomical segmentation or parcellation of whole-brain into anatomic regions and to quantify the features of each specific regions of interest (ROI) from each sMR image, we have used NiftyReg toolbox (Modat et al., 2010) with 2-mm Brainnetome atlas template (Fan et al., 2016) image, which is already segmented into 246 regions, 210 cortical and 36 subcortical regions. Moreover, we processed the sMRI image using open source software, NiftyReg (Modat et al., 2010), which is an automated registration toolkit that performs fast diffeomorphic non-rigid registration. After the registration process, we gained the subject-labeled image based on a 2-mm Brainnetome atlas template with 246 segmented regions. For the 246 ROI in the labeled sMR images, we computed the volume of gray matter tissues in that ROI and used it as a feature. For the FDG-PET images, the first step was to register the FDG-PET image to its corresponding sMRI T1-weighted image, using the reg\_aladin command from the NiftyReg software. Once the FDG-PET images were registered with their respective MR images, we again used NiftyReg toolbox for non-rigid registration between processed FDG-PET image and the 2-mm Brainnetome atlas template image. After registration, we obtained 246 segmented regions for each FDG-PET image. Again, we computed the average intensity of each region for the ROI and used it as a feature for classification. **Figure 2** shows the pipeline for extraction of 246 regions from sMRI and FDG-PET image.

Therefore, for each subject, we obtained 246 ROI's features for each sMRI image, another 246 features for each FDG-PET image. Three features from CSF biomarkers for each subject, and two feature values from APOE genotype for all selected images.

### Combining Multimodality of Biomarkers

After assessing the performance for each individual modality, we combined different modalities in order to study possible improvements in classification performance. Here, a general framework based on an early fusion (or straightforward feature concatenation) method which use special combination rules to combine (or to concatenate) complementary information from different modalities of biomarker into single feature vector is used, and later we have used kernel-based multiclass SVM classifier to train that single feature vector. In this context, various authors have combined sMRI-based features with the features calculated from FDG-PET, DTI, and fMRI (Zhang et al., 2011, 2012; Young et al., 2013; Schouten et al., 2016; Bron et al., 2017; Bouts et al., 2018) for early classification of AD subjects. Moreover, in our case, we have combined four (sMRI, FDG-PET, CSF, and APOE) modality of biomarkers into one form using early fusion technique for the early classification of AD and MCI subjects. Here, the value of the features for the APOE and CSF are of small dimensional compared to the sMRI and FDG-PET extracted features values. Therefore, if classification algorithms trained on (high + low) dimensional combined features then it may produce prediction models that effectively ignore the low dimensional features. Moreover, to overcome this problem, we have transformed low dimensional extracted features into a high dimensional state using random tree embedding method, which ensures that the complementary information found across all modalities is still used while classifying several groups. This step

is followed for every classification problem. **Figure 3** shows the early fusion pipeline. Moreover, here 1st (APOE), 2nd (CSF), 3rd (sMRI), and 4th (FDG-PET) features are concatenated with each other using early fusion technique before passing them further. We assessed the classification performance for individual and combined modalities by calculating the AUC for each group.

### Feature Selection

With the help of automated feature extraction methods, we extracted 246 ROIs from each sMRI and FDG-PET image. As in the neuroimaging analysis, the number of features per subjects is very high relative to the number of patients, a phenomenon normally referred to as the curse of dimensionality. Furthermore, because of the computational difficulties of dealing with high dimensional data, dealing with many features can be a challenging task, which may result in overfitting. Feature selection is an additional helpful stage prior to the classification problem, which helps to reduce the dimensionality of a feature by selecting proper features and omitting improper features. This step helps to speed up the classification process by decreasing computational time for the training and testing datasets and increases the performance of classification accuracy. At first, we normalized the extracted features using the standard scalar function from Scikit-learn library (0.19.2) (Pedregosa et al., 2011), which transforms the dataset in such way that its distribution will have a mean of 0 and unit variance of 1 to reduce the redundancy and dependency of the data. After that, we performed high dimensional transformation of the data using random tree embedding (Geurts et al., 2006; Moosmann et al., 2008) from Scikit-learn library (0.19.2) (Pedregosa et al., 2011) and a dimensionality reduction process using truncated singular value decomposition (SVD) method. Random tree embedding system works based on the decision tree ensemble learning (Brown, 2016) system that execute an unsupervised data transformation algorithm to solve a random tree embedding task. It uses a forest of complete random trees, that encodes the data by the indices of the leaves where a data sample point

non-linear registration between the (sMRI and FDG-PET) image with 2 mm Brainnetome template image. Above pipeline shows that, we have successfully extracted 246 ROI's from each (sMRI and FDG-PET) images.

ends up. This index is then encoded in a one-of-k encoder, which maps the data into a very high-dimensional state which might be beneficial for the classification process. The mapping process is completely unsupervised and very efficient for any dataset. After mapping the dataset into the very high dimensional state, we applied the truncated SVD function for dimensionality reduction purposes, which only selects the important features from the complete set of features. The truncated SVD is similar to principal component analysis (PCA) but differs in that it works on the sample matrices X directly instead of working on their covariance matrices. When performed column-wise (perfeature), i.e., means of X are deducted from the value of the feature, the truncated SVD of the resulting matrix corresponds to PCA. Truncated SVD implements an irregular SVD that only calculates the k largest singular values, where k is a user-specified parameter. Mathematically, the truncated SVD can be applied to train data X, which produces a low-rank approximation of X:

$$X = X\_k = U\_k \Sigma\_k V\_k^T \tag{1}$$

After this process, Uk6<sup>T</sup> k is transformed into the training set with k features. To transform a test set X, we can multiply it by V<sup>k</sup> :

$$X' = XV\_k \tag{2}$$

In this way, we can perform the truncated SVD method on the training and testing dataset.

### Classification

### Support Vector Machine

SVM is a supervised learning method. SVM (Cortes and Vapnik, 1995) works by finding a hyperplane that best separates two data groups. It is trained by training data in n-dimensional training space after which test subjects are classified according to their position in n-dimensional feature space. It has been used in several neuroimaging areas (Cui et al., 2011; Zhang et al., 2011; Young et al., 2013; Collij et al., 2016) and is known to be one of the most powerful machine learning tools in the neuroscience field in recent research. In mathematical representation, for a 2D space, a line can discriminate the linearly separable data. The equation of a line is y = ax + b. By renaming x with x<sup>1</sup> and y with x2, the equation will change to a(x<sup>1</sup> − x2) + b = 0. If we stipulate X = (x1, x2) and w = (a, −1), we get w.x + b = 0, which is an equation of hyperplane. The linearly separable output with the hyperplane equation has the following form:

$$f\left(\mathbf{y}\right) = \mathbf{z}^T \mathcal{Q}.(\mathbf{y}) + b \tag{3}$$

Where y is an input vector, z T is a hyperplane parameter, and ∅(y) is a function used to map feature vector y into a higherdimensional space. The parameters z and b are scaled suitably by the same quantity, the decision hyperplane given by the Equation (2) remains unchanged. Moreover, in order to make any decision boundary surface (hyperplane) correspond to the exclusive pair of (z, b), the following constraints are familiarized:

$$\min \left| z^T \mathcal{Q}.(\mathcal{y}\_i) + b \right| = 1, \qquad i = 1, \ldots, N,\tag{4}$$

Where y1, y2, y3, . . . ., y<sup>N</sup> are the given training points. Equation (4) hyperplanes are known as the canonical hyperplanes. For a given hyperplane (or decision surface) which is described with the equation;

$$z^T \mathcal{Q}.\text{(}y\text{)} + b = 0\text{, which is same as } z^T \mathcal{Q}.\text{(}y\text{)}$$

$$= 0\text{ (which has more dimensions)}\quad \text{(5)}$$

And, for a vector x that does not belong to the hyperplane, the following equation is satisfied (Cortes and Vapnik, 1995, Madevska-Bogdanova et al., 2004, Cui et al., 2011):

$$z^T \mathcal{Q}.(x) + b = \pm s \left\| z \right\|\tag{6}$$

Where s is the distance of a point x to the given hyperplane. The different signs determine the vector's x side of the hyperplane. Therefore, the output f y (or predictive value) of the SVM is truly proportional to the norm of vector z and the distance s(x) from the chosen hyperplane. Moreover, in our study, we have used kernel-support vector machine, which is used to solve the non-linear problem with the use of linear classifier and involved in exchanging linearly non-separable data into linearly separable data. The idea behind this concept is linearly non-separated data in n-dimensional space might be linearly separated in higher m-dimensional space. Mathematically, the kernel is indicated as,

$$K\left(a,b\right) = <\, F\left(a\right), \, F\left(b\right)>\tag{7}$$

Where, K is a kernel function and a, b are inputs in ndimensional space. F is a mapping function which maps from n-dimensional to m-dimensional space (i.e., m > n). Moreover, in our case, we have used three different kinds of kernel function which is defined as follow:

• Polynomial type: It represents the resemblance of vectors (training samples) in a feature space over the polynomials of the original variables, allowing the learning of non-linear models. A Polynomial kernel is defined as;

$$K\left(\mathbf{x},\mathbf{y}\right) = \left(\mathbf{x},\,\mathbf{y}\right)^{d} \tag{8}$$

Where x and y are vectors in the input space. d is the kernel parameter.

• Gaussian radial basis type: Radial basis functions mostly with Gaussian form and it is represented by;

$$K\left(\mathbf{x},\boldsymbol{\nu}\right) = \exp(-\frac{\left\|\mathbf{x}-\boldsymbol{\nu}\right\|^2}{2\sigma^2})\tag{9}$$

Where, x and y are the two input samples, which represented as a feature vector in input space. <sup>x</sup> <sup>−</sup> <sup>y</sup> <sup>2</sup> may be seen as a squared Euclidean distance between two feature vectors. σ is a kernel parameter.

• Sigmoid type: It comes from the neural networks field, where the bipolar sigmoid function is often used as an activation function for an artificial neuron. And, it is represented by;

$$K\left(\mathbf{x},\boldsymbol{\nu}\right) = \tanh(\boldsymbol{\infty}\,\boldsymbol{x}^T\boldsymbol{\nu} + \boldsymbol{\mathcal{c}})\tag{10}$$

Where, x and y are vectors in the input space and ∝, c are the kernel parameters.

For our study, we used a different kernel-based multiclass SVM from Scikit-learn 0.19.2 library (Pedregosa et al., 2011). Scikit-learn library internally use LIBSVM (Chang and Lin, 2011) to handle all computations. The hyperparameter of the kernel-based SVM must be tuned to measure how much maximum performance can be augmented by tuning it. It is important because they directly control the behavior of the training algorithm and have a significant impact on the performance of the model is being trained. Moreover, a good choice of hyperparameter can really make an algorithm smooth. Therefore, to find an optimal hyperparameter for the kernelbased multiclass SVM, C (explains the SVM optimization and percentage of absconding the misclassified trained data. For high C values, training data will classify accurately by a hyperplane; similarly, for low C values, optimizer looks for a higher margin separating hyperplane while it will misclassify the more data points) and γ (Gaussian kernel width describes the impact of specific training data. The high gamma values result in consideration of datasets that are near to separation line. Likewise, for low gamma values, datasets that are away from the separation line, will also be taken into consideration while in the calculation line) parameters are optimized using a grid search and a ten-fold stratified cross-validation (CV) method on the training dataset. This validation technique gives an assurance that our trained model got most of the patterns from the training dataset. Moreover, CV works by randomly dividing training dataset into 10 parts, one of which was left as a validation set, while the remaining nine were used by a training set. In this study, tenfold stratified cross-validation was performed 100 times to obtain more accurate results. Finally, we computed the arithmetic mean of the 100 repetitions as the final result. Note that, as a different feature had different scales, so in our case, we linearly ascend each training feature to imitate to a range between 0 and 1; the same scaling technique was then applied to the test dataset. As the number of selected features is small, in our case the RBF kernel performs better than other kernels.

### Measuring the Classification Performance

To assess the classification performance of each group we have applied two method: (i) ROC-AUC curve analysis and (ii) Statistical analysis using Cohen's kappa method.

#### ROC-AUC Analysis

The ROC-AUC is a fundamental graph in the evaluation of diagnostic tests and is also often used in biomedical research to test classification problem performance and prediction models for decision support, prognosis, and diagnosis. ROC analysis examines the accuracy of a proposed model to separate positive and negative cases or distinguish AD patients from different groups. It is particularly useful in assessing predictive models since it records the trade-off between specificity and sensitivity over that range. In a ROC curve, the true positive rate (known as the sensitivity) is arranged as a function of a false positive rate (known as the 1-specificity) for different cut-off values of parameters. Each point's level of the ROC curve characterizes a sensitivity/specificity pair, which corresponds to a specific decision threshold. This is generally depicted in a square box for convenience and it's both axes are from 0 to 1. The area under curve (AUC) is an effective and joint measure of sensitivity and specificity for assessing inherent validity of a diagnostic test. AUC curve shows us how well a factor can differentiate between two binary diagnostic groups (diseased/normal). A result with perfect discrimination has a 100% sensitivity, 100% specificity ROC curve. Therefore the closer the ROC curve to the upper left corner, the higher the overall accuracy of the test as suggested by Greiner et al. (2000). The AUC is commonly used to visualize the performance of binary classes, producing a classifier with two possible output classes. Accuracy is measured using the AUC. Here, an AUC of one signifies a perfect score, and an area of 0.5 represents a meaningless test.

The AUC plot provides two parameters:


Moreover, classification accuracy measures the effectiveness of predicting the true class label, but in our case, it should be noted that the number of subjects was not the same in each group, so only calculating accuracy may result in a misleading estimation of the performance. Therefore, four more performance metrics have been calculated, namely specificity, sensitivity, precision, and F1-score. We have reported the accuracy, specificity, sensitivity, precision, and F1-score values corresponding to the ideal point of the ROC curve.

$$Accuracy = \frac{TP + TN}{TP + FP + FN + TN} \tag{11}$$

$$F1 - score = 2 \ast \left[ \begin{array}{c} precision \ast recall \\ \hline precision + recall \end{array} \right] \tag{12}$$

where,

$$Precision = \frac{TP}{TP + FP}; \text{ Recall} = \text{Specificity} = \frac{TP}{TP + FN} \text{ (13)}$$

TABLE 3 | Obtained best CV score for six classification groups.


With TP, FP, TN, and FN denoting true positive, false positive, true negative, and false negative, respectively. Specificity (true negative rate) provides a amount for those not in the class, i.e., it is the percentage of those not in the class that were found not to be in the class. Precision [which is also termed as positive predictive value (PPV)] is the fraction of relevant incidences among the retrieved incidences, and F1-score (which is also called F-score or F-measure) is a amount related to a test's accuracy. Moreover, in our case, we have repeated the procedure 100 times, the reported AUC-ROC, accuracy, sensitivity, specificity, precision, and F1-score are the average over the 10 repetitions of the 10 fold stratified cross-validation procedure. We have followed this method for every classification groups.

#### Statistical Analysis Using Cohen's Kappa Method

Cohen's kappa statistic value for each classification problem was computed. This function calculates Cohen's kappa score, which demonstrate the level of agreement between two annotators or the level of agreement between two dissimilar groups in a binary classification problem defined as,

$$k = (p\_{\ o} - p\_{\epsilon})/(1 - p\_{\epsilon})\tag{14}$$

where, p<sup>o</sup> is the empirical probability of an agreement on the label assigned to any example (the observed agreement ratio), and, p<sup>e</sup> is the predictable agreement when both annotators assign labels randomly. Here, p<sup>e</sup> was assessed using a per-annotator empirical prior over the class labels. The kappa statistic value is always between −1 and 1. The maximum value means complete agreement between two groups, zero or lower value means a low probability of agreement.

### RESULTS

Here, all classification problems were performed using Ubuntu 16.04 LTS, running python 3.6, and using Scikit-learn library version 0.19.2. In this study, there were four classes of data, AD, MCIc, MCIs, and HC, separated using four different types of biomarker, sMRI and FDG-PET for imaging modalities, and CSF as a biochemical (or fluid vessel) that show results reflecting the formation of amyloid plaques inside the brain, and APOE genotypes as genetic features. Thus, we validated our proposed method on six different types of classification problem, i.e., six binary class problem (AD vs. HC, MCIc vs. MCIs, AD vs. MCIc, HC vs. MCIs, HC vs. MCIc, and AD vs. MCIs). At first, we

extracted the featured from each sMRI and FDG-PET images by using the NiftyReg registration process with 2-mm Brainnetome atlas template image. In total, we obtained 497 features for a single image, 246 ROI-based features from the sMRI and FDG-PET images, three feature values from the CSF data, and two features from the APOE genotype data. Moreover, we have applied a random tree embedding method which transformed obtained low dimension features into a higher dimensional state, after that an early fusion technique is processed to combine the multiple features into single form before passing them for further process. Additionally, we have also applied a feature selection technique using a truncated SVD dimensionality reduction method, which will select the effective features from all 497 high dimensional features and send these selected features to the classifier, to measure the performance of identifying each group. In our case, we used a kernel-based multiclass SVM as a classifier from a Scikit-learn library (0.19.2).

In order to attain unbiased estimates of performance, the set of participants were randomly split into two groups in a 75:25 ratios as training and testing sets, respectively.

In the training set, to find the right values for the hyperparameter (C and γ ) is very difficult, and their values influence the classification result. Moreover, we know that the parameter C, trades off the misclassification of training samples against the simplicity of a decision surface, a small C value makes the decision surface flat, while a high C value aims to classify all training samples correctly. Moreover, a γ value shows how much influence a single training sample has. The larger γ is, the closer other samples must be to be affected. Therefore, we have used cross-validation technique to get good optimal hyperparameter values for the regularization constant (C) and gamma (γ ). We can't know the best value for a model hyperparameter on a given problem. With the right values of hyperparameters will eliminate the chances of overfitting and underfitting. Therefore, to find the optimal hyperparameter values for a kernel-based SVM, have used a grid-search (which perform a comprehensive search over the specified parameter values for an estimator) and ten-fold stratified cross-validation technique on the training set. The grid search was performed over the ranges of C = 1 to 9 and γ = 1e-4 to 1. For each method, the gained optimized value of the hyperparameter was then used to train the classifier using the training set, and later the performance of the resulting classifier was then evaluated on the remaining 25% of data in the testing dataset, which was not used during the training step. The obtained optimized hyperparameter (C and γ ) value and their best CV accuracy are shown in **Table 3**. **Figure 4** is a plot of the classifier's CV accuracy with respect to (C and γ ) for AD vs. HC and MCIc vs. MCIs groups. In **Figure 4**, we can see the impact of having different C and γ values on the model. Moreover, the best found optimal hyperparameter combination for an AD vs. HC are C = 7, γ = 0.00316227766017 and for MCIs vs. MCIc are C = 9, γ = 0.001, these tuned optimal hypermeter values are automatically chosen from the given range of C = 1 to 9 and γ = 1e-4 to 1 with the help of grid search and ten-fold CV. In this way, we achieved unbiased estimates of the performance for each classification problem. In our experiment, the number of subjects was not the same in each group. Therefore, only calculating accuracy does not enable a comparison of the performances of the different classification experiments. Thus, we have considered five measures. For each group, we have calculated the accuracy, sensitivity, specificity, precision, and F1-score performance measure values. **Table 4** show the classification results for AD vs. HC, MCIc vs. MCIs, AD vs. MCIs, AD vs. MCIc, HC vs. MCIc, and HC vs. MCIs.

We conduct the AD vs. HC experiment using extracted APOE, CSF, FDG-PET, and sMRI features, and the classification outcome is shown in **Table 4**. For AD vs. HC classification, we had 38 AD and 38 HC subjects and only sMRI individual biomarker performed well while compared to other individual modalities of biomarkers. Moreover, the early fusion technique that we used to combine features from different modalities resulted in an AUC of 98.33, 98.42% of accuracy, 100% of sensitivity, 96.47% of specificity, 97.89% of precision, and 98.42% of F1-score. Furthermore, Cohen's kappa value is 0.93 for the combined method, which is very close to 1. Likewise, for the MCIs vs. MCIc classification problem, 82 subjects were included. Forty-six were in the MCIc group and the remaining 36 patients were in the MCIs group. **Table 4** shows the computed performance measure for this classification problem. Compared to other classification group problem this classification group (MCIs vs. MCIc) is difficult to classify because both groups show similar brain structure; however, there are slight differences in structure. For this group, APOE genotype individual biomarker performed well while compared to other individual modalities of biomarkers. Moreover, our proposed method has performed even better than the best output obtained by individual biomarkers for this group and the achieved measures are AUC of 93.59%, with 94.86% accuracy, 100% sensitivity, 88.71% specificity, 89.62% precision, and an F1-score of 91.67% compared to those of the single modalities. For MCIs vs. MCIc, Cohen's kappa value was 0.86, which is better than those of the single modalities. Our proposed method has performed very well when classifying this group. For AD vs. MCIs group, there were 38 AD and 36 MCIs subjects. First, we extracted the features from each subject and then we combined both imaging (PET and MRI) feature values with the other two (CSF and APOE genotype) feature values to measure the performance of AD vs. MCIs classification. **Table 4** shows the results from passing obtained features to the kernelbased multiclass SVM classifier. As can be seen from **Table 4**, our proposed method to combine all four modalities of a biomarker for distinguishing between AD and MCIs achieved good results compared to single modality biomarkers. For this classification problem, our proposed method achieved 96.65% of accuracy with a Cohen's kappa of 0.91. For AD vs. MCIc group, there were 38 AD and 46 MCIc. We trained kernel-based multiclass SVM classifiers using dimensionality-reduced features from truncated SVD to measures the performance of AD vs. MCIc group. The best performance was attained using a combination of four modalities of features, i.e., sMRI, FDG-PET, APOE and CSF, which had an accuracy of 92.26%, a sensitivity of 91.67%, a specificity of 92.86%, and an AUC of 94.64% with Cohen's kappa of 0.84. For the HC vs. MCIc distinction, our proposed method achieved 96.43% AUC, 94.13% accuracy, 94.75% sensitivity, 100% of specificity and precision, and 96.72% of F1-score. **Table 4** shows the classification performance result for HC vs. MCIc classification. In this case, the obtained Cohen's kappa index value is 0.88, which is near to the maximum level agreement value of 1. For the HC vs. MCIs classification problem, 74 subjects were included. Thirty-six were in the MCIs group and the remaining 38 patients were in the HC group. **Table 4** shows the results from passing obtained features to the kernel-based multiclass SVM classifier. As can be seen from **Table 4**, our proposed method to combine all four modalities of a biomarker for distinguishing between HC and MCIs achieved good results compared to single modality biomarkers. For this classification problem, our proposed method had achieved 95.24% of AUC, and 95.65% of accuracy with a Cohen's kappa of 0.90. Therefore, we can say that for all classification groups our proposed method has achieved a high level of performance while compared to single modality of biomarkers, ranging from 1 to 5%, and our proposed method has also achieved a high level of agreement between each other for all six classification groups while compared with single modalitybased methods. For AD vs. MCIs, AD vs. MCIc, HC vs. MCIs, and HC vs. MCIc groups, CSF individual biomarkers performed very well-compared to other individual modality of biomarkers, and the CSF achieved AUC for these groups are 94.17, 89.58, 94.05, and 92.5%.

TABLE 4 | Classification results for AD vs. HC, MCIs vs. MCIc, AD vs. MCIs, AD vs. MCIc, HC vs. MCIc, and HC vs. MCIs groups.


**Figure 5** shows Cohen's kappa statistics score for six classification problems, AD vs. HC, MCIs vs. MCIc, AD vs. MCIs, AD vs. MCIc, HC vs. MCIs, and HC vs. MCIc. From this graph, we can see that our proposed method has achieved a good level of agreement between different classification groups.

Here, **Figure 6** shows the AUC curve for AD vs. HC, MCIs vs. MCIc, AD vs. MCIs, AD vs. MCIc, HC vs. MCIs, and HC vs. MCIc. Total AUC-ROC curve is a single index for measuring the performance of a test. The larger the AUC, the better is the overall performance of the diagnostic test to correctly pick up diseased and non-diseased subjects. For AD vs. HC, our proposed model achieved 98.33% AUC, showing that our proposed model performed well when distinguishing positive and negative values. For MCIs vs. MCIc, our proposed model correctly distinguished converted patients from stable patients with an AUC of 93.59%, which is a great achievement for this complex group. Likewise, for AD vs. MCIs, AD vs. MCIc, HC vs. MCIs, and HC vs. MCIc, our proposed model achieved AUCs of 96.83, 94.64, 95.24, and 96.43%. Overall, for all classification methods, our proposed model performed well and its probabilities for the positive classes are well-separated from those of the negative classes.

### DISCUSSION

In this experiment, we proposed a novel technique to fuse data from multiple modalities for the classification of AD from different groups, using a kernel-based multiclass SVM method. In addition, earlier studies aimed only for AD vs. HC classification groups. In this paper, we studied six binary classification problem, AD vs. HC, MCIs vs. MCIc, AD vs. MCIs, AD vs. MCIc, HC vs. MCIs, and HC vs. MCIc. More importantly, we combined not only sMRI and FDG-PET images but also their CSF (biochemical) and APOE (genetic) genotype values. Our experiment result shows that each modality (sMRI, FDG-PET, CSF, and APOE) is indispensable in achieving good combination and good classification accuracy.

Some studies (Zhang et al., 2011, 2012; Young et al., 2013) have used a small number of features extracted from automatic or manual segmentation processes for the classification of AD from different groups. Their proposed model has achieved good performance for AD vs. HC; however, for MCIc vs. MCIs, the performance of their proposed model is poor.

FIGURE 5 | Cohen's kappa score for AD vs. HC, MCIs vs. MCIc, AD vs. MCIc, AD vs. MCIs, HC vs. MCIs, and HC vs. MCIc (each experiment obtained kappa value is shown by different solid color lines). This Cohen's kappa plot shows that combined features outperform the single modality features in all experiments.

Therefore, in our study, we tried to extract as many ROI from two imaging modalities using the 2-mm Brainnetome template image. To the best of our knowledge, this is the first experiment where 246 ROI was extracted from all 158 subjects and all features were used in the classification of AD and MCI subjects.

Furthermore, we later fused features from these two imaging modalities with three CSF and two APOE genotype features offered by the ADNI website for the distinction of AD from different groups using early fusion technique. Moreover, we use a more advanced segmented template image for feature extraction from both imaging modalities with the NiftyReg registration toolbox, compared to other studies (Walhovd et al., 2010; Davatzikos et al., 2011; Zhang et al., 2011; Beheshti et al., 2017; Li et al., 2017; Long et al., 2017). As we can see that from **Table 4**, single modality biomarkers (sMRI and APOE genotype) achieved a good performance for AD vs. HC and MCIs vs. MCIc (using all 246 extracted features and as well as with two APOE genotype feature from each subject) groups, when compared with the obtained outputs reported before (Zhang et al., 2011; Young et al., 2013). Likewise, from same **Table 4**, we can see that CSF individual modality of biomarkers has outperformed other individual biomarkers with 94.17, 89.58, 94.05, 92.5% of AUC for AD vs. MCIs, AD vs. MCIc, HC vs. MCIc, and HC vs. MCIs. Moreover, a lot of studies have shown that different modalities of biomarkers contain complementary

FIGURE 6 | Comparison of the ROC-AUC curve corresponding with the best performance of combined fusion method in each experiment is displayed by the green dashed lines. We also compare these ROC-AUC curves with those of single modality features. This comparison shows that combined features outperform the single modality features in all experiments, which can be seen from above figure (A) AD vs. HC, (B) MCIs vs. MCIc, (C) AD vs. MCIs, (D) AD vs. MCIc, (E) HC vs. MCIc, and (F) HC vs. MCIs.

information for the discrimination of AD and MCI subjects. Here, we quantitatively measure the discrimination agreement between any two different classification groups using the kappa index. For combined features (for AD vs. HC, AD vs. MCIs, and AD vs. MCIc), the obtained level of agreement between each group is 0.93, 0.91, and 0.84, respectively. Likewise, for HC vs. MCIs and HC vs. MCIc, the obtained level of agreement between each group are 0.90 and 0.88. Moreover, for MCIs vs. MCIc group the obtained level of agreement between each other is the 0.86, respectively. These all scores are achieved using a 10-fold stratified CV method on combined dataset. These results indicate that the combined feature (for AD vs. HC) group has the highest level of agreement between each other while compared to other groups and as well as while compared to the individual performance of each modality.

Recently, many studies have been published using a single modality of biomarkers (Chetelat et al., 2007; Fjell et al., 2010; Chen and Ishwaran, 2012; Beheshti et al., 2016; Jha et al., 2017; Lama et al., 2017; Long et al., 2017), including sMRI, FDG-PET, CSF, and APOE. Most of these studies used biomarkers from the sMRI, because it is practically difficult to get biomarkers from all modalities for the same patients due to the various reasons, including the availability of imaging equipment, cost, lack of patient consent, and patient death in longitudinal studies. Previously proposed models using a single modality have achieved good performance for AD vs. HC classification, where for MCIs vs. MCIc their classification accuracy is very low compared to our proposed multimodal technique. Here, we have performed an experiment to assess the classification performance using features from every single modality independently, as well as with the combination of multimodal biomarkers. A kernelbased multiclass SVM classifier was utilized, and the comparison of the obtained single modality results with the multimodal classification results are shown in **Figure 7**. In terms of accuracy and AUC, the classification performance using features from CSF is generally better than those using genetic and imaging features, which highlights the importance of Aβ plaques as biomarkers in the classification of AD, while in comparison to the performance with multimodal biomarkers, its performance is slightly lower. In addition, we can see that for the MCIs vs. MCIc comparison, each modality of biomarker has performed well. Different methods were used to evaluate the classification of AD using multimodal data. First, we combine all high-dimensional features from four modalities into a single feature vector for classification of AD and MCI subjects. After that, all features were normalized (to have a zero mean ± unit standard deviation) before using them in the classification process. This combined multimodal method provides a straightforward method of using multimodal data. Subsequently, we passed these features to the kernel-based multiclass SVM classifier for classification purposes with a 10-fold stratified CV strategy as described above, and obtained results are shown in **Table 4** and **Figure 7**. As we can see in **Table 4**, our early fusion combination method consistently outperforms the performance of individual modality of biomarkers.

Recently, several studies have investigated neuroimaging techniques for the early detection of AD, with the main focus on MCI subjects, who may or may not convert to AD, and separating patients with AD from healthy controls using multimodal data. However, it is difficult to make direct comparisons with these state-of-the-art methods since a majority of the studies have used different validation methods and datasets, which both crucially influence the classification problem. The first study by Zhang et al. (2012) obtained an accuracy of 76.8% (sensitivity and specificity of 79 and 68%) for the classification of converters and stable MCI subjects within 24 months.

These results were achieved using a multi-kernel SVM on a longitudinal ADNI dataset. Another study (Young et al., 2013) used a Gaussian process method for classification of MCIs vs. MCIc using several modalities. They reported an accuracy of 69.9% and AUC of 79.5%. Another study by Suk et al. (2014) used shared features from two imaging modalities, MRI and PET, using a combination of hierarchical and deep Boltzmann machines for a deep learning process; their proposed method achieved 74.66% accuracy and 95.23% AUC when comparing MCI-C vs. MCI-NC. In another study (Cheng et al., 2015), the authors introduced domain transfer learning using multimodal data (i.e., MRI, CSF, and PET) with an accuracy


TABLE 5 | Classification performance for the proposed method compared with published state-of-the art methods for differentiating between MCIs vs. MCIc.

of 79.4% for MCIs vs. MCIc with an AUC of 84.8%. In another study (Moradi et al., 2015), the authors employed a VBM analysis of gray matter as a feature, combining age and cognitive measures. They reported an AUC of 90.20% with 81.72% accuracy comparing MCIc vs. MCIs sample. Another study (Beheshti et al., 2017), used feature ranking and a genetic algorithm (GA) for selection of optimal features for the classifier. Their method achieved an accuracy of 75%, sensitivity of 76.92%, specificity of 73.23%, and AUC of 75.08% for pMCI vs. sMCI. Liu et al. (2017) proposed combining two imaging modalities using independent component analysis and the Cox model for prediction of MCI progression. They achieved 80.8% AUC with 73.5% accuracy in comparisons of MCIc vs. MCIs. Recently, another author (Long et al., 2017) used Free surfer software to segment 3-T T1 images into many different parts and later used a multi-dimensional scaling method for feature selection before sending the selected features to the classifier. Their proposed method achieved an AUC of 93.2%, accuracy of 88.88%, sensitivity of 86.32, and specificity of 90.91% when differentiating sMCI from pMCI using only specific amygdala features. As shown in **Table 5**, the performance of the proposed system was highly competitive in performance terms when compared to the other systems reported in the literature for MCIs vs. MCIc classification.

### CONCLUSION

In this study, we have proposed a novel method that shows how to extract 246 ROI from two imaging modalities, PET and sMRI, using a Brainnetome template image and then combined these features obtained from imaging with CSF and APOE genotype features from the same subjects. In the proposed method, we used a random tree embedding method to transform obtained features to a higher dimensional state and later we used a truncated SVD dimensionality reduction method to select only the important features, which increased the classification accuracy using kernel-based multiclass SVM classifier. The obtained experimental results prove that a combination of biomarkers from all four modalities is a reliable technique for the early prediction of AD or prediction of MCI conversion, especially with regards to high-dimensional data pattern recognition. In addition, our proposed method achieved 94.86% accuracy with 93.59% AUC and a Cohen's kappa index of 0.86 when distinguishing between MCIs vs. MCIc subjects. The performance of the proposed computeraided system was measured using 158 subjects from the ADNI dataset with a 10-fold stratified cross-validation technique. The experimental results show that the performance of the proposed approach can compete strongly with other state-ofthe-art techniques using biomarkers from all four modalities mentioned in the literature.

In future, we plan to combine demographic information of the studied subjects as features with the proposed model for the classification of AD and we will also carry out an investigation of the multimodal multiclass classification of AD using AV-45 and DTI modality of biomarkers.

### DATA AVAILABILITY STATEMENT

The dataset used in this study were acquired from ADNI homepage, which is available freely for all researcher and scientist for experiments on Alzheimer's disease and can be easily downloaded from ADNI websites: http://adni.loni.usc.edu/ about/contact-us/.

### AUTHOR CONTRIBUTIONS

YG and RL designed the study, collected the original imagining data from ADNI home page, and wrote the manuscript. RL and G-RK managed and analyzed the imaging data. All authors contributed to and have approved the final manuscript.

### FUNDING

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2019R1A4A1029769, NRF-2019R1F1A1060166).

### ACKNOWLEDGMENTS

Data collection and sharing for this project was funded by the Alzheimer's Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http:// adni.loni.usc.edu/wp-content/uploads/how\_to\_apply/ADNI\_

Acknowledgement\_List.pdf. ADNI was funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer's Association; Alzheimer's Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer's Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.

### REFERENCES


to Clinical Applications, eds H.-J. Gertz, and Th. Arendt (Vienna: Springer Vienna), 97–106. doi: 10.1007/978-3-7091-7508-8\_9


diagnostic discrimination and cognitive correlations. Neurology 73, 287–293. doi: 10.1212/WNL.0b013e3181af79e5


**Conflict of Interest:** The authors declare that data used in preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the funder and the investigators within ADNI contributed to the data collection, but did not participate in analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

Copyright © 2019 Gupta, Lama, Kwon and the Alzheimer's Disease Neuroimaging Initiative. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Diagnosis of Brain Diseases via Multi-Scale Time-Series Model

Zehua Zhang<sup>1</sup> , Junhai Xu<sup>2</sup> , Jijun Tang1,3, Quan Zou<sup>4</sup> \* and Fei Guo<sup>1</sup> \*

*<sup>1</sup> School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China, <sup>2</sup> School of Artificial Intelligence, College of Intelligence and Computing, Tianjin University, Tianjin, China, <sup>3</sup> Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, United States, <sup>4</sup> Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China*

The functional magnetic resonance imaging (fMRI) data and brain network analysis have been widely applied to automated diagnosis of neural diseases or brain diseases. The fMRI time series data not only contains specific numerical information, but also involves rich dynamic temporal information, those previous graph theory approaches focus on local topology structure and lose contextual information and global fluctuation information. Here, we propose a novel multi-scale functional connectivity for identifying the brain disease via fMRI data. We calculate the discrete probability distribution of co-activity between different brain regions with various intervals. Also, we consider nonsynchronous information under different time dimensions, for analyzing the contextual information in the fMRI data. Therefore, our proposed method can be applied to more disease diagnosis and other fMRI data, particularly automated diagnosis of neural diseases or brain diseases. Finally, we adopt Support Vector Machine (SVM) on our proposed time-series features, which can be applied to do the brain disease classification and even deal with all time-series data. Experimental results verify the effectiveness of our proposed method compared with other outstanding approaches on Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset and Major Depressive Disorder (MDD) dataset. Therefore, we provide an efficient system via a novel perspective to study brain networks.

*South China University of Technology, China*

*Peng Cheng, Agency for Science, Technology and Research (A\*STAR), Singapore*

#### \*Correspondence:

Edited by: *Nianyin Zeng, Xiamen University, China*

> Reviewed by: *Shengfeng He,*

*Quan Zou zouquan@nclab.net Fei Guo fguo@tju.edu.cn*

#### Specialty section:

*This article was submitted to Brain Imaging Methods, a section of the journal Frontiers in Neuroscience*

Received: *14 January 2019* Accepted: *19 February 2019* Published: *14 March 2019*

#### Citation:

*Zhang Z, Xu J, Tang J, Zou Q and Guo F (2019) Diagnosis of Brain Diseases via Multi-Scale Time-Series Model. Front. Neurosci. 13:197. doi: 10.3389/fnins.2019.00197* Keywords: functional magnetic resonance imaging, multi-scale time-series, Alzheimer's disease, major depressive disorder, functional connection

### 1. INTRODUCTION

The functional Magnetic Resonance Imaging (fMRI) technique provides an opportunity to quantify functional integration via measuring the correlation between intrinsic Blood-Oxygen-Level-Dependent (BOLD) signal fluctuations of distributed brain regions at rest. The BOLD signal is sensitive to spontaneous neural activity within brain regions, thus it can be used as an efficient and noninvasive way for investigating neurological disorders at the whole-brain level. Functional connectivity (FC), defined as the temporal correlation of BOLD signals in different brain regions, can exhibit how structurally segregated and functionally specialized brain regions interact with each other. Therefore, the brain network analysis using fMRI data will provide great advantages to automated diagnosis of neural diseases or brain diseases.

Some researchers model the FC information as a specific network by using graph theoretic techniques. Differences between normal and disrupted FC networks caused by pathological attacks provide important biomarkers to understand pathological underpinnings, in terms of the topological structure and connection strength. The network analysis has been becoming an increasingly useful tool for understanding the cerebral working mechanism and mining sensitive biomarkers for neural or mental diseases. Zeng et al. (2018) propose a new switching delayed particle swarm optimization (SDPSO) algorithm is proposed to optimize the SVM parameters. Using graph theories, the brain network analysis provides an effective solution to concisely quantify the connectivity properties of brain networks, where each node denotes a particular anatomical element or a brain region, and each edge represents the relationship between a pair of nodes, such as anatomical, functional or effective connections (Friston, 2011). The anatomical connection typically corresponds to white matter tracts between many pairs of brain regions. The functional connection corresponds to magnitudes of temporal correlations in activity and occurs between some pairs of anatomically unconnected regions, which may reflect linear or nonlinear interactions, as well as interactions within different time scales (Zhou et al., 2009). The effective connection represents direct or indirect causal influences of one region on another region, which may be estimated from observed perturbations whether synchronous or asynchronous (Friston et al., 2003). As a brain network analysis approach, the graph theory offers two important advantages (Tijms et al., 2013). One is that it provides quantitative measurement, which can preserve the connectivity information in the network and thus reflect the segregated and integrated nature of local brain activity. The other is that, it provides a general framework for comparing heterogeneous graphs constructed by different types of data, such as anatomical and functional data.

However, these graph theory approaches have many drawbacks that must be overcome. First, the graph theory has many limitations, on the one hand, common graph theory features such as edge weights, path lengths and clustering coefficients (Rubinov and Sporns, 2010; Chen et al., 2011) usually focus on local topology structure and lose their global topology characteristics (Sanz-Arigita et al., 2010; Jie et al., 2018); on the other hand, each node in the brain networks is uniquely corresponding to a specific brain region, mostly ignoring the label information of each node (Jie et al., 2018). Second. the functional connectivity is more sensitive to local information rather than the global topology, but some recent studies (Hutchison et al., 2013; Leonardi et al., 2013; Zeng et al., 2013, 2014; Allen et al., 2014) indicate that the FC network contains rich dynamic temporal information. To be more concrete, for each brain region, a sliding window approach is performed to generate a set of BOLD subseries on schizophrenia disease diagnosis (Damaraju et al., 2014) and others (Chen et al., 2016; Wee et al., 2016). Third, the raw functional data is underutilized, building brain network from raw data may lose the temporal or context information. For example, Pearson's Correlation Coefficient (PCC) is the simplest and most commonly scheme in functional connectivity estimation, which is the covariance of the two variables divided by the product of their standard deviations. Clearly, according to the mathematical definition, the PCC value is context-independent or order-independent in time series, not considering nonsynchronous information under different time dimensions.

In view of the above, the fMRI time series not only contains specific numerical information, but also involves contextual information and global fluctuation information. In this paper, we propose a novel time-series model based on Jensen-Shannon divergence for identifying the brain disease via fMRI data, and the flow chart is shown in **Figure 1**. First, we calculate the discrete probability distribution of co-activity between different brain regions with various intervals in multi-scale time series data. Second, the contextual information is taken into account in analyzing the correlation and causality among the fMRI data. Third, we design a novel method based on time-series to measure the similarity between two object co-activity intensity of brain functional connectivity. Finally, we adopt Support Vector Machine (SVM) on our proposed time-series features, which can be applied to do the brain disease classification and even deal with all time-series data. Experimental results verify the effectiveness of our proposed method compared with other outstanding approaches on Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset and Major Depressive Disorder (MDD) dataset. The rest of this paper is organized as follows. We start by a brief review of dataset and pre-processing. Then, we formulate the problem and present our proposed method. Finally, experimental results are reported, followed by the conclusion of this work.

### 2. MATERIALS AND METHODS

In this section, we introduce the flow of our method. First, we preprocessed the original data, removed the noise from the original data, and segmented the fMRI image data through the brain region template. Next, we extract information or features from the perspective of functional connection between brain regions. To overcome the shortcoming of traditional Pearson Correlation Coefficient (PCC) methods, we propose a novel framework for feature extraction of brain functional connection. Then, through feature selection, we use the classification model for predicting brain disease. Finally, we discuss parameter settings in the model.

### 2.1. Dataset

We carry out experiments on two different datasets. One is a public Alzheimer's Disease Neuroimaging Initiative database (Jack et al., 2010), and another one is a volunteer experiment of Major Depressive Disorder (Geng et al., 2018). In the data pre-processing, we deal with the raw data by a widely used software package (SPM12), and then divide one brain into 116 brain regions.

### 2.1.1. ADNI

In Alzheimer's Disease Neuroimaging Initiative database, we emply a total of 169 subjects, including 87 Alzheimer's patients (49 females and 38 males) and 82 normal controls (46 females and 36 males). We download the ADNI data from website http:// adni.loni.usc.edu/.

### 2.1.2. MDD

In volunteer experiment, we use a total of 60 subjects, including 31 volunteers with Major Depressive Disorder (MDD) (22 females and 8 males, aged 60.5 ± 11.2 years, range 25 − 65 years) and 29 healthy volunteers (18 females and 11 males, aged 50.1 ± 10.6 years, range 25 − 65 years). Those major depressive disorder subjects without comorbidity had a minimum duration of illness more than 3 months. Each participant provided written informed consent and the study was conducted in accordance with the local Ethics Committee.

### 2.2. Pre-processing

We perform image pre-processing for the fMRI data using a standard pipeline, carried out via the statistical parametric mapping (SPM12, www.fil.ion.ucl.ac.uk/spm/software/spm12/) software package on Matlab. The data pre-processing procedure includes slice timing, realign, segment, normalization and bandpass filtered. For more detailed data pre-processing procedure, please refer to website.

The whole brain of each subject in fMRI space is parcellated into 116 brain regions of interest (ROI) according to the Automated Anatomical Labeling (AAL) template. This atlas divided the brain into 78 cortical regions, 26 cerebellar regions and 12 subcortical regions according to anatomy, details in literature (Tzourio-Mazoyer et al., 2002). For each of the 116 ROIs, the mean time series was calculated by averaging the Blood-Oxygen-Level-Dependent (BOLD) signals among all voxels within the specifically ROI. There exist many similar templates such as Brainnetome template (Fan et al., 2016) and Harvard-Oxford template.

### 2.3. Feature Extraction

After pre-processing, how to excavate the location and cause of lesions is the focus of our research and attention. The most common method is to calculate the correlation between two brain regions through Pearson Correlation Coefficient (PCC), and analyze lesions by observing the changes of correlation. However, the PCC value is context-independent or order-independent, that is not considering nonsynchronous information at different time intervals. Here, we first give a basic introduction to PCC, and then elaborate on our approach.

### 2.3.1. Pearson Correlation Coefficient

Pearson's correlation coefficient (PCC) is the simplest and most commonly scheme in functional connectivity estimation. For any two brain regions, the coordination degree of blood-oxygen-level dependent fluctuation is calculated as the functional connection strength between these two brain regions. Typically, in the case of the AAL template, this step extracts the 6,670-dimensional features. Mathematical definition is the covariance of the two variables divided by the product of their standard deviations, as follows:

$$\text{PCC}\_{X,Y} = \frac{\text{E}[(X - \mu\_X)(Y - \mu\_Y)]}{\sigma\_X \sigma\_Y} \tag{1}$$

Clearly, according to the formula, the value of the Pearson's correlation coefficient is context-independent or orderindependent in time series, which it only limits alignment at the same time, so information about the time dimension or context is missing.

### 2.3.2. Multi-Scale Functional Connectivity of Brain Regions

We extract the discrete probability distribution of co-activity in time series data. First, we use the function φ(·) to evaluate temporal dynamic property of the time series data. In addition, we convert φ(·) to g(·), defined as follows:

$$\phi(t\_{i\_1,j\_1}^{k\_1}, t\_{i\_2,j\_2}^{k\_2}) = \lg(f\_{\phi}(t\_{i\_1,j\_1}^{k\_1}), f\_{\phi}(t\_{i\_2,j\_2}^{k\_2})) \tag{2}$$

where f(·) represents a mapping function that makes use of prior knowledge in order to map the original time series into another specific form, g(·) represents the function to evaluate temporal information after the mapping operation.

We utilize the prior knowledge in order to map the original multivariate time series data into another specific form, such as a mapping of numeric, state and character. The mapping function is defined as follows:

$$\begin{split} f\_{\boldsymbol{\varphi}}(A\_k) &= f\_{\boldsymbol{\varphi}}(T\_1^k, T\_2^k, \dots, T\_i^k, \dots, T\_N^k) \\ &= \{ f\_{\boldsymbol{\varphi}}(T\_1^k), f\_{\boldsymbol{\varphi}}(T\_2^k), \dots, f\_{\boldsymbol{\varphi}}(T\_N^k) \} \end{split} \tag{3}$$

where A<sup>k</sup> denotes the original time series data, and ϕ denotes the prior knowledge.

In the multivariate time series data A<sup>k</sup> , the correlation value between T k i and T k j is defined as follows:

$$C^{k}\_{\phi(\cdot)}(i,j) = \sum\_{m=1}^{M} \phi(t^{k}\_{i,m}, t^{k}\_{j,m}) \tag{4}$$

In addition, the correlation value between T k i and T k j in interval I<sup>t</sup> = [r<sup>t</sup> ,st] is defined as follows:

$$C^k\_{\phi(\cdot)}(i, j, I\_l) = \sum\_{m=1}^{M} \sum\_{l=r\_l}^{s\_l} \phi(t\_{i,m}^k, t\_{j,m+l}^k) \tag{5}$$

Notably, it is obvious that C k φ(·) (i, j,It) 6= C k φ(·) (j, i,It).

Generally, we explore the correlation of time series data in multiple intervals. Let C k φ(·) ∈ RN×N×<sup>T</sup> denotes the multi-scale weighted correlation coefficient in multivariate time series data Ak . Here, C k φ(·) is a 3-order tensor, N is the number of time series data, T is the number of intervals.

Next, we transform the tensor C k φ(·) into a discrete probability distribution P k φ(·) for analyzing co-activity in multi-scale time series data, as follows:

$$P^k\_{\phi(\cdot)} = \{ p^k\_{\phi(\cdot)}(i, j, I\_t) | i, j \in [1, N], I\_t \in I \} \tag{6}$$

where p k φ(·) (i, j,It) represents the proportion of correlation value between i-th time series data and j-th time series data based on function φ(·) in interval I<sup>t</sup> , defined as follows:

$$p^k\_{\phi(\cdot)}(i,j,I\_t) = \frac{C^k\_{\phi(\cdot)}(i,j,I\_t)}{\sum\_{i=1}^N \sum\_{j=1}^N \sum\_{t=1}^T C^k\_{\phi(\cdot)}(i,j,I\_t)}\tag{7}$$

### 2.4. Classification Model for Predicting Brain Disease

In disease prediction, the number of samples is limited, but the feature dimension is usually large, so we need to both compress the feature space to improve the accuracy and analyze the etiology with more meaningful features. We use t-test for feature selection, and then we use Support Vector Machine (SVM) as the learning model, which is described in detail as follows.

### 2.4.1. Feature Selection

We use the two-sample t-test as the feature selection method. We assume that one feature of positive and negative samples is subject to the distribution of the same mean, and we set the significance parameter p = 0.05.

#### 2.4.2. Support Vector Machine

We adopt Support Vector Machine (SVM) technique developed by Cortes and Vapnik (1995) for solve the binary classification problem. Also, various kinds of binary classification model can be applied in many other biomedical prediction problems (Guo et al., 2014, 2015, 2016; Ding et al., 2016a,b, 2017a,b; Liu et al., 2016; Zeng et al., 2016; Shen et al., 2017a,b; Xuan et al., 2017; Pan et al., 2018). The decision function is shown as follows:

$$\chi(A\_k) = \text{sign}\{\sum\_{i=1}^{K} \alpha\_i \wp\_i \cdot \mathcal{K}(A\_k, A\_i) + b\} \tag{8}$$

where K(A<sup>k</sup> , Ai) represents our proposed novel time-series kernel function, and α<sup>i</sup> is calculated as follows:

$$\begin{aligned} \text{Maximize } \quad & \sum\_{i=1}^{K} \alpha\_i - \frac{1}{2} \sum\_{i=1}^{K} \sum\_{j=1}^{K} \alpha\_i \alpha\_j \cdot \boldsymbol{\chi}\_i \boldsymbol{\y}\_j \cdot \mathcal{K}(A\_i, A\_j) \\ \text{s.t. } \quad & 0 \le \alpha\_i \le C \\ & \sum\_{i=1}^{K} \alpha\_i \boldsymbol{\chi}\_i = 0 \end{aligned} \tag{9}$$

where C is a regularization parameter that controls the tradeoff between margin and misclassification error.

### 2.5. Model Parameter

In practice, we make more detailed discussion for parameters in our method. We discuss some prior knowledge and assumptions in our problem of Alzheimer's disease and Major Depression Disorder diagnosis, and some details need to be clarified. The time series data not only carry specific numerical information, but also include contextual and fluctuation trend information.

Here, due to the BOLD imaging principle, we pay more attention to the time points of high activity state, that is, time points with high values in time series. We define a dynamic or soft threshold to distinguish whether a time point is active or not, that is, converting a numeric sequence into a state sequence or 0/1 sequence.

For all active time points in one set of time series, we count the number of time points of simultaneous responses in other sets of time series. Moreover, we analyze the co-active between two sets of time series in asynchronous. As we get more details with asynchronous analysis, we'll get more essential information. In the experiments, it is also proved by the higher classification accuracy.

#### 2.5.1. Time Series Mapping

We adopt a empirical rule to indicate the dynamic threshold, called three-sigma method (WalterA, 1986). This method converts a numeric sequence into a state sequence, the dynamic threshold represented as follows:

$$th(T\_i^k) = \mu(T\_i^k) + \eta \cdot \sigma(T\_i^k) \tag{10}$$

where

$$\mu(T\_i^k) = \frac{\sum\_{m=1}^{M} t\_{i,m}^k}{|T\_i^k|} \tag{11}$$

and

$$\sigma(T\_i^k) = \frac{\sum\_{m=1}^{M} (t\_{i,m}^k - \mu(T\_i^k))^2}{|T\_i^k| - 1} \tag{12}$$

In a multivariate time series A<sup>k</sup> , we calculate a corresponding dynamic threshold th(T k i ) for each set of time series T k i . Then, for a set of time series T k i , we convert a numeric sequence into a 0/1 sequence according to mapping function f(·), as follows:

$$f(t\_{i,m}^k) = \begin{cases} 1, & t\_{i,m}^k \ge th(T\_i^k) \\ 0, & \text{else} \end{cases} \tag{13}$$

The magnitude of η indicates the sensitivity of our method to the active state. In our experiment, η is set to 1.

#### 2.5.2. Correlation Function φ

The correlation function represents the relationship between a couple of time points in time series. In disease diagnosis, we only focus on co-activity, that is, both brain region i in time point m and brain region j in time point n are in active states. To be more concrete, t k i,m and t k j,n are greater than the threshold th(T k i ) and th(T k j ), respectively.

$$\{g(f(t\_{i,m}^k), f(t\_{j,n}^k))\} = \begin{cases} 1, & f(t\_{i,m}^k) = f(t\_{j,n}^k) = 1 \\ 0, & \text{else} \end{cases} \tag{14}$$

Corresponding to Formula 2 above, φ(·) in our experiment is:

$$\phi(t\_{i,m}^k, t\_{j,n}^k) = \begin{cases} 1, & t\_{i,m}^k \ge th(T\_i^k) \text{ \& } t\_{j,n}^k \ge th(T\_j^k) \\ 0, & \text{else} \end{cases} \tag{15}$$

#### 2.5.3. Interval Set I

For a collection of multiple intervals I, we extract local information by the element of interval, that is, greater element, more detailed information. Easy to be over-fit and sparse; if the element of interval is little, we may lose some key information. Also, for a interval I<sup>t</sup> ∈ I, if I<sup>t</sup> is close to zero, it means that two time points that we're interested in are very close; if It is far from zero, it indicates that we extract long-distance asynchronous information.

In our experiments, the interval collection I is set to {[0, 0], [1, 1], [2, 2], [3, 12]}. Here, [0, 0] represents information for synchronization, [1, 1] and [2, 2] represent short-distance correlation for asynchronism, [3, 12] represents a loose interval for asynchronism. Empirically, it is sensitive to close interval of zero and loose for long distances.

### 3. RESULTS

Our experiment consists of three parts. To proof the effectiveness of our approach, we perform on automated diagnoses of Alzheimer's disease and Major Depressive Disorder, respectively. We evaluate the classification performance using the leave-oneout cross-validation (LOOCV). And also, we adopt Accuracy, Sensitivity, Specificity and AUC as evaluation standards. First, we compare the results of the traditional PCC method and our feature extraction method in the two data sets of AD and MDD. Then, we compare the effects of different classifiers. Finally, we compare our approach with some recent research works.

### 3.1. Comparison of Different Features

Here, we compare the performance of traditional PCC method and our feature extraction method to analyze fMRI data. In addition to feature extraction, we use the same experimental steps and parameters, including preprocessing, feature selection and classifier. The results are shown in **Table 1**.

On Alzheimer's disease and major depressive disorder database, we compare our method to traditional PCC method, and classification results are summarized in **Table 1**. The information extracted by our multi-scale functional connection (Multi-Scale FC) method is used for predicting brain disease, which is obviously higher than the traditional PCC method. On Alzheimer's disease dataset, our method achieves best specificity of 0.9268. Moreover, by combining PCC and our method, we achieve better results, with ACC of 0.8935 and AUC of 0.8748. On MDD dataset, our method also achieve the best results, but the difference is that PCC and multi-scale functional connection are actually lower when combined. The experimental results indicate that our approach is more effective than traditional PCC or graph theory feature-based methods. Combining different methods will yield better results, but there is also a risk of over-fitting.

### 3.2. Comparison of Different Classifiers

In this part, we use the feature extraction model in the previous step to compare the performance of different classifiers. Specifically, we compare three classifiers: random forest (RF), logistic regression (LR) and support vector machine (SVM). The results are shown in **Table 2**.


TABLE 2 | Comparison of different classifiers.


TABLE 3 | Comparison of different existing methods on ADNI.


In this part, we use our proposed multi-scale functional connection method to extract features, and compare the results of different classifiers. Comparing these three classifiers, SVM can achieve the highest AUC in both AD dataset and MDD dataset, the best ACC can also be obtained on the AD data set, which is generally a stable classifier. In addition, RF can obtain the best ACC on the MDD dataset, and LR can obtain the best Spe on the AD dataset. Overall, all three classifiers can achieve good accuracy, indicating that the information extracted by our method is effective and stable.

### 3.3. Comparison of Different Existing Methods

We compare our proposed method to recent outstanding studies. Baseline represents the traditional graph theory featurebased method. Moreover, the state-of-the-art methods represent three major groups of graph kernels on edge, subtree and shortest-path, respectively. These graph kernel belong to the Weisfeiler-Lehman graph kernel framework (Shervashidze et al., 2011), denoted as WL-edge, WL-subtree and WLshortestpath, respectively. In addition, in the Alzheimer's disease diagnosis, we also compare with the graph kernel method with shortest-path (Shortest-path) (Borgwardt and Kriegel, 2006), the sliding window method (FON: 70-length sliding window with 1-step) (Chen et al., 2016) and the sub-network kernel method (SKL) (Jie et al., 2018). In the Major Depressive Disorder classification problem, we compare to the method of Geng et al. (2018).

On Alzheimer's Disease Neuroimaging Initiative database, we compare our method to seven existing methods, and classification results are summarized in **Table 3**. Our method achieves best accuracy of 0.8876 and best AUC of 0.8562. TABLE 4 | Comparison of different existing methods on MDD.


However, the accuracy values for Baseline, FON, Shortest-path, WL-edge, WL-subtree, WL-Shortestpath and SKL are 0.5858, 0.8580, 0.7396, 0.6272, 0.7811, and 0.6095, respectively. Also, the AUC values for these seven methods are 0.5612, 0.8195, 0.6938, 0.6084, 0.7645, and 0.5735, respectively. Comparing to these methods, our method achieves accuracy improvement of 0.0296 and AUC improvement of 0.0367, respectively. The experimental results indicate that our approach is far better than traditional graph theory feature-based methods, and slightly better than the state-of-the-art graph kernel-based methods.

On the volunteer experiments of Major Depressive Disorder, we compare our method to three existing methods, and classification results are summarized in **Table 4**. Our method achieves best accuracy of 0.9000 and best AUC of 0.9295. However, the accuracy values for Baseline, Shortest-path and method of Xu et al. are 0.6167, 0.7833, and 0.8667, respectively. Also, the AUC values for these three methods are 0.6514, 0.8135, and 0.9103, respectively. Comparing to these methods, our method achieves accuracy improvement of 0.0333 and AUC improvement of 0.0192, respectively. The experimental results indicate that our approach is far better than traditional graph methods, and slightly better than the current outstanding methods.

## 4. CONCLUSIONS

The fMRI time series data not only contains specific numerical information, but also involves rich dynamic temporal information. However, those previous graph theory approaches focus on local topology structure and lose contextual information and global fluctuation information. Here, we propose a novel multi-scale functional connectivity for identifying the brain disease via fMRI data. We calculate the discrete probability distribution of co-activity between different brain regions with various intervals. Also, we consider nonsynchronous information under different time dimensions, for analyzing the contextual information in the fMRI data. Therefore, our proposed method can be applied to more disease diagnosis and other fMRI data, particularly automated diagnosis of neural diseases or brain diseases. Experimental results verify the effectiveness of our proposed method, so we provide an efficient system via a novel perspective to study brain networks.In the future, parallel computing (Zou et al., 2017), computational intelligence (Xu et al., 2017; Zou et al., 2017) and neural networks (Song et al., 2018; Xu et al., 2018) can be considered with the growing of dataset.

### DATA AVAILABILITY

Publicly available datasets were analyzed in this study. This data can be found here: http://adni.loni.usc.edu/. The results and codes for this study can be found in the https://github.com/ guofei-tju/Multi-Scale-FC-Frontier-in-NeuroSci.git.

### AUTHOR CONTRIBUTIONS

FG and QZ conceived and designed the experiments. ZZ and JX performed the experiments and analyzed the data. FG and

### REFERENCES


ZZ wrote the paper. FG and JT supervised the experiments and reviewed the manuscript.

### FUNDING

This work is supported by a grant from the National Science Foundation of China (NSFC 61772362) and the Tianjin Research Program of Application Foundation and Advanced Technology (16JCQNJC00200), and National Key R&D Program of China (2018YFC0910405, 2017YFC0908400).


cellular neural network approach. IEEE Trans. Med. Imaging 33, 1129–1136. doi: 10.1109/TMI.2014.2305394


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Zhang, Xu, Tang, Zou and Guo. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Low-Rank Based Image Analyses for Pathological MR Image Segmentation and Recovery

Chuanlu Lin, Yi Wang\*, Tianfu Wang and Dong Ni

*National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Provincial Key Laboratory of Biomedical Measurements and Ultrasound Imaging, Health Science Center, School of Biomedical Engineering, Shenzhen University, Shenzhen, China*

The presence of pathologies in magnetic resonance (MR) brain images causes challenges in various image analysis areas, such as registration, atlas construction and atlas-based segmentation. We propose a novel method for the simultaneous recovery and segmentation of pathological MR brain images. Low-rank and sparse decomposition (LSD) approaches have been widely used in this field, decomposing pathological images into (1) low-rank components as recovered images, and (2) sparse components as pathological segmentation. However, conventional LSD approaches often fail to produce recovered images reliably, due to the lack of constraint between low-rank and sparse components. To tackle this problem, we propose a transformed low-rank and structured sparse decomposition (TLS2D) method. The proposed TLS2D integrates the structured sparse constraint, LSD and image alignment into a unified scheme, which is robust for distinguishing pathological regions. Furthermore, the well recovered images can be obtained using TLS2D with the combined structured sparse and computed image saliency as the adaptive sparsity constraint. The efficacy of the proposed method is verified on synthetic and real MR brain tumor images. Experimental results demonstrate that our method can effectively provide satisfactory image recovery and tumor segmentation.

Keywords: MR brain images, image recovery, tumor segmentation, structured sparsity, low-rank, matrix decomposition

## 1. INTRODUCTION

Automated image computing routines (e.g., segmentation, registration, atlas construction) that can analyze the magnetic resonance (MR) brain tumor scans are of essential importance for improved disease diagnosis, treatment planning and follow-up of individual patients (Iglesias and Sabuncu, 2015; Mai et al., 2015; Menze et al., 2015; Chen et al., 2018). Lately, a wave of deep learning is taking over traditional computer aided diagnosis techniques, by learning abundant multi-level features from large amount of training repository for image representation and analyzing (Litjens et al., 2017; Shen et al., 2017). Various architectures of deep convolutional neural networks have been developed and employed for brain tumor segmentation (Pereira et al., 2016; Havaei et al., 2017; Kamnitsas et al., 2017; Zhao et al., 2018). Despite achieving satisfactory performance, deep learning based approaches require enormous amount of labeled images to train a segmentation model. Collecting and labeling useful training samples may last a lengthy duration thus sometimes is

### Edited by:

*Yangming Ou, Harvard Medical School, United States*

### Reviewed by:

*Xiaoxiao Liu, CuraCloud Corporation, United States Suyash P. Awate, Indian Institute of Technology Bombay, India*

\*Correspondence:

*Yi Wang onewang@szu.edu.cn*

#### Specialty section:

*This article was submitted to Brain Imaging Methods, a section of the journal Frontiers in Neuroscience*

Received: *13 January 2019* Accepted: *21 March 2019* Published: *09 April 2019*

#### Citation:

*Lin C, Wang Y, Wang T and Ni D (2019) Low-Rank Based Image Analyses for Pathological MR Image Segmentation and Recovery. Front. Neurosci. 13:333. doi: 10.3389/fnins.2019.00333* clinically impractical. In addition, the presence of pathologies in MR brain images causes difficulties in most of other image analyses, such as image registration, atlas construction and atlasbased anatomical segmentation. The recovery of pathological regions with normal brain appearances can facilitate subsequent image computing procedures. For example, the recovered images could further be used for atlas construction and specific patient's follow-up (Joshi et al., 2004; Liu et al., 2014; Zheng et al., 2017; Han et al., 2018). However, there is lack of deep learning based methods developed for pathological medical image recovery. In contrast, the low-rank and sparse decomposition (LSD) (Wright et al., 2009; Candès et al., 2011) scheme, learning normal image appearance from unlabeled population data, has been widely employed to decompose pathological MR brain images into recovered normal brain appearances and pathological regions (Liu et al., 2015; Tang et al., 2018).

Although the low-rank and sparse analyses of computational brain tumor segmentation has attracted considerable attention during last decade, it remains several challenges. First, conventional LSD methods have to be computed on a series of aligned images (Otazo et al., 2015; Tang et al., 2018), because the image misalignment causes undesired structure differences that would interfere the representation of sparse component. Thus, the image alignment should be conducted before/during the LSD computation; however, the image alignment itself is a challenging task. Second, specific spatial constraint should be imposed on sparse component to restrict the structured sparsity of the tumor region in the whole image. Third, LSD methods often produce recovered images (i.e., low-rank component) with distorted pathological regions (Liu et al., 2015), due to the lack of effective constraint between low-rank and sparse components. Thus it is essential to adaptively balance the low-rank and sparse components to reliably recover tumor regions meanwhile retaining normal brain regions.

To address aforementioned issues, this paper presents a novel method for the simultaneous recovery and segmentation of pathological MR brain images (see **Figure 1**). Specifically, we propose a transformed low-rank and structured sparse decomposition (TLS2D) method. The proposed TLS2D integrates the structured sparsity constraint, LSD and image alignment into a unified framework, which is robust for extracting pathological regions. Furthermore, the well recovered images can be obtained using TLS2D with the combined structured sparse and computed image saliency as the adaptive sparsity constraint. Experimental results on synthetic and real MR brain tumor images demonstrate that the proposed TLS2D can effectively extract and recover tumor regions.

### 2. METHODS

The proposed recovery and segmentation framework is shown in **Figure 2**. Our TLS2D first iteratively aligns all images and decomposes aligned images into low-rank and structured sparse components. Then the structured sparse components are combined with the computed saliency maps to generate tumor probability maps as the adaptive sparsity constraint. The final

recovery and segmentation is obtained by imposing the adaptive sparsity constraint on the TLS2D.

The following subsections present a brief review of classical LSD, the details of our method and elaborate the novel TLS2D.

### 2.1. Review of Low-Rank and Sparse Decomposition (LSD)

Suppose we are given n previously aligned MR brain images A1, A2, ..., A<sup>n</sup> ∈ R w×h , where w and h denotes width and height of the image, respectively. We can vectorize each image matrix A<sup>n</sup> to form the column of A = [vec(A1), vec(A2), ..., vec(An)] ∈ R m×n , where m = w × h.

The conventional LSD method decomposes A into a lowrank matrix L and a sparse matrix S, where L indicates the linearly correlated normal images, and S represents sparse tumor regions. The decomposition can be solved by the following convex optimization:

$$\min\_{L, \mathcal{S}} \quad \|L\|\_{\*} + \lambda \left\|\mathcal{S}\right\|\_{1} \quad \text{s.t.} \ A = L + \mathcal{S}, \tag{1}$$

where kLk<sup>∗</sup> is the nuclear norm of L (i.e., the sum of its singular values), kSk<sup>1</sup> is the ℓ<sup>1</sup> norm of S, and regularizing parameter λ weights the relationship between lowrank and sparse components. The optimization in Equation (1) can be solved by augmented Lagrangian multiplier (ALM) method (Lin et al., 2010).

To realize practical and reliable recovery and segmentation of pathological MR images, the LSD remains three issues to be addressed: (1) all images shall be aligned in the same spatial domain; (2) S shall be structured sparse to better represent

the structured sparsity of the contiguous tumor region in the whole image; (3) as illustrated in **Figure 3**, as the parameter λ becomes smaller, the low-rank images can recover tumor regions more reliably, but also generate more blurred appearances in originally normal regions. Therefore, regularizing parameter λ shall be different regarding to normal and tumor regions, thus to adaptively balance the low-rank and sparse components to reliably recover tumor regions meanwhile retaining normal brain regions.

### 2.2. Transformed Low-Rank and Structured Sparse Decomposition (TLS2D)

To tackle the issues in LSD, we propose a transformed lowrank and structured sparse decomposition. Firstly, considering the tumor region usually occupies a contiguous portion of the brain image, thus it is reasonable to model the tumor region using the structured sparsity norm. Inspired by the structured sparsity in Jia et al. (2012), we introduce a structured sparsity norm (S) to model tumor region, and define low-rank and structured sparse decomposition (LS2D) as:

$$\min\_{L, \mathcal{S}} \quad \|L\|\_{\*} + \lambda \mathcal{Q}\left(\mathcal{S}\right) \quad \text{s.t.} \; A = L + \mathcal{S}, \tag{2}$$

where

$$\mathfrak{Q}\left(\mathbb{S}\right) = \sum\_{i=1}^{n} \sum\_{\mathcal{S} \in \mathcal{G}} \left\| \operatorname{mat}(\mathbb{S}\_{i})\_{\mathcal{S}} \right\|\_{\infty}.\tag{3}$$

In Equation (3), S<sup>i</sup> ∈ R <sup>m</sup> is the i th column in S; mat(Si) ∈ R w×h is the matrix form of S<sup>i</sup> . We define 3 × 3 overlapping-patch groups G in mat(Si), and g ∈ G represents each 3 × 3 group. Each group overlaps 6 pixels with its neighbor group. k·k<sup>∞</sup> is the ℓ<sup>∞</sup> norm (i.e., the maximum value in a group g). The structured sparsity norm (S) in Equation (2) can constrain S to be structured distribution thus better representing tumor region.

During the decomposition, the spatial mismatch between different images may cause undesired sparse noise. To alleviate the spatial mismatch, we perform image alignment in our decomposition procedure (Zheng et al., 2017). The proposed TLS2D is defined as follows:

$$\min\_{L, \mathcal{S}, \mathfrak{x}} \parallel L \parallel\_{\mathfrak{s}} + \lambda \mathfrak{Q} \left( \mathcal{S} \right) \quad \text{s.t.} \ A \circ \mathfrak{x} = L + \mathcal{S}, \tag{4}$$

where τ denotes a set of n affine transformations τ1, τ2, ..., τ<sup>n</sup> that warps A to align all images; A ◦ τ = [vec(A<sup>1</sup> ◦ τ1), vec(A<sup>2</sup> ◦ τ2), ..., vec(A<sup>n</sup> ◦ τn)] ∈ R m×n .

The optimization of our TLS2D in Equation (4) is non-convex and difficult to solve directly due to the nonlinearity of the τ. To tackle this issue, we can iteratively linearize about the estimate of τ according to Boyd et al. (2011) and Wang et al. (2018). Specifically, we linearize the constraint by using the local first order Taylor approximation for each image as A ◦ (τ + ∆τ ) ≈ A ◦ τ + P<sup>n</sup> i=1 Ji∆τiǫiǫ T i , where ∆τ = [∆τ1, ∆τ2, ..., ∆τn] ∈ R p×n , and each ∆τ<sup>i</sup> ∈ R p is defined by p parameters of the transformation; J<sup>i</sup> = ∂ ∂ζ vec(A<sup>i</sup> ◦ <sup>ζ</sup> )|ζ=τ<sup>i</sup> <sup>∈</sup> <sup>R</sup> m×p is the Jacobian of the image A<sup>i</sup> with respect to the transformation τ<sup>i</sup> , and {ǫi} denotes the standard basis for R n . Thus, Equation (4) can be relaxed into the following optimization:

$$\min\_{L, \mathcal{S}, \Delta \mathbf{t}} \|L\|\_{\ast} + \lambda \Omega \left( \mathcal{S} \right) \quad \text{s.t.} \ A \circ \mathfrak{r} + \sum\_{i=1}^{n} J\_i \Delta \mathfrak{r}\_i \epsilon\_i \epsilon\_i^T = L + \mathcal{S}. \tag{5}$$

Then the resulting convex programming in Equation (5) can be solved by ALM method (Lin et al., 2010). We formulate the following augmented Lagrangian function:

$$\begin{split} \mathcal{L}(\mathsf{L}, \mathsf{S}, \Delta \mathsf{r}, Y; \mu) &= \left\| L \right\|\_{\*} + \lambda \mathfrak{Q} \left( \mathsf{S} \right) + \left\langle Y, h(\mathsf{L}, \mathsf{S}, \Delta \mathsf{r}) \right\rangle \\ &+ \frac{\mu}{2} \left\| h(\mathsf{L}, \mathsf{S}, \Delta \mathsf{r}) \right\|\_{F}^{2}, \end{split} \tag{6}$$

where h(L, S, ∆τ) = A ◦ τ + P<sup>n</sup> i=1 Ji∆τiǫiǫ T <sup>i</sup> <sup>−</sup> <sup>L</sup> <sup>−</sup> <sup>S</sup>; <sup>Y</sup> <sup>∈</sup> <sup>R</sup> m×n is the Lagrangian multiplier and µ is a positive hyperparameter;

h·, ·i denotes the matrix inner product, and k·k<sup>F</sup> is the Frobenius norm. The ALM algorithm then estimates both the optimal solution and the Lagrange multiplier by iteratively solving the following four subproblems:

$$\begin{split} L^{t+1} &= \arg\min\_{L} \mathcal{L}(L, \mathbb{S}^{t}, \Delta \mathbb{1}^{t}, Y^{t}; \mu^{t}), \\ \mathbb{S}^{t+1} &= \arg\min\_{\mathbb{S}} \mathcal{L}(L^{t+1}, \mathbb{S}, \Delta \mathbb{1}^{t}, Y^{t}; \mu^{t}), \\ \Delta \mathbb{1}^{t+1} &= \arg\min\_{\Delta \mathbb{1}} \mathcal{L}(L^{t+1}, \mathbb{S}^{t+1}, \Delta \mathbb{1}^{t}, Y^{t}; \mu^{t}), \\ Y^{t+1} &= Y^{t} + \mu^{t} h(L^{t+1}, \mathbb{S}^{t+1}, \Delta \mathbb{1}^{t+1}), \end{split} \tag{7}$$

where superscript t denotes the iteration. In each iteration, the first problem in Equation (7) can be expressed as

$$L^{t+1} = \arg\min\_{L} \left\{ \|L\|\_{\*} + \frac{\mu^t}{2} \|H\_L - L\|\_F^2 \right\},\tag{8}$$

where H<sup>L</sup> = A ◦ τ + P<sup>n</sup> i=1 Ji∆τ t i ǫiǫ T <sup>i</sup> − S <sup>t</sup> + Y t /µ<sup>t</sup> . The problem in Equation (8) has a simple closed-form solution by soft thresholding operator (Parikh et al., 2014). Suppose the singular value decomposition of H<sup>L</sup> is (U, 6,V) = svd(HL), then L <sup>t</sup>+<sup>1</sup> = US <sup>1</sup> µt [6]V T , where S <sup>1</sup> µ (x) = {[x− 1 µ ]<sup>+</sup> −[−x− 1 µ ]+} is the soft thresholding operator and [·]<sup>+</sup> = max(·, 0).

The second problem in Equation (7) can be rewritten as

$$\mathcal{S}^{t+1} = \arg\min\_{\mathcal{S}} \left\{ \frac{\mu^t}{2} \left\| H\_{\mathcal{S}} - \mathcal{S} \right\|\_F^2 + \lambda \mathcal{Q} \left( \mathcal{S} \right) \right\},\tag{9}$$

where H<sup>S</sup> = A ◦ τ + P<sup>n</sup> i=1 Ji∆τ <sup>t</sup> i ǫiǫ T <sup>i</sup> − L <sup>t</sup>+<sup>1</sup> + Y t /µ<sup>t</sup> . The problem in Equation (9) is the proximal operator associated with the structured sparsity norm, which can be calculated by solving a quadratic min-cost flow problem (Mairal et al., 2010).

Then given the current estimated L t+1 and S t+1 , the solution of the third problem in Equation (7) can be calculated as

$$\Delta \mathbf{t}^{t+1} = \sum\_{i=1}^{n} I\_i^\dagger (\mathbf{L}^{t+1} + \mathbf{S}^{t+1} - \mathbf{A} \circ \mathbf{\tau} - \mathbf{Y}^t / \mu^t) \epsilon\_i \epsilon\_i^T,\tag{10}$$

where J † i denotes the Moore-Penrose pseudoinverse of J<sup>i</sup> . We summarize the solver for Equation (4) in Algorithm 1.

### 2.3. Recovery and Segmentation Framework

In our recovery and segmentation framework, at the first step we employ the proposed TLS2D to align all MR images and meanwhile obtaining low-rank and structured sparse images (see **Figure 2**). The low-rank images at this step blur the tumor region and yet cannot reliably recover the normal image appearances. To address this problem, we propose to leverage the obtained structured sparse component to adjust the regularizing parameter λ in Equation (4) for the adaptive sparsity constraint.

Specifically, we compute the saliency maps of the MR images using (Perazzi et al., 2012). The saliency map indicates the saliency of each pixel to catch the human attention, with value 1 denoting the highest attention and 0 denoting no attention. According to (Perazzi et al., 2012), in order to calculate the saliency of an image, we first abstract this image into perceptually homogeneous elements using (Achanta et al., 2012). We then employ a set of high-dimensional Gaussian filters (Adams et al., 2010) to calculate two contrast measures (i.e., the uniqueness and spatial distribution of elements), and use these two measures to predict the final saliency of each pixel. In pathological MR images, the most salient part shall be the tumor regions. We then obtain the tumor probability map of an image by computing the dot product between its binary structured sparse image and its corresponding saliency map, as shown in **Figure 2**.

FIGURE 5 | The structural similarity index (SSIM) between each of the original MR images and the corresponding recovered images by different methods. The "Initial" indicates the SSIM between the synthetic tumor images and the corresponding original images.


The tumor probability map indicates the probability of each pixel being tumor region. We denote tumor probability map P = [vec(P1), vec(P2), ..., vec(Pn)] ∈ R m×n .

Finally, we use the tumor probability map to adaptively adjust the regularizing parameter λ in Equation (4). We define the adaptive TLS2D to obtain the final tumor segmentation and well recovered quasi-normal images:

$$\min\_{L, \text{S.r}} \|L\|\_{\ast} + \lambda (1 - P) \odot \Omega \text{ (S)}\quad \text{s.t.} \\ A \diamond \text{tr} = L + \text{S}, \qquad \text{(11)}$$

where **1** ∈ R m×n , with each element equals to 1. λ(**1** − P) is the adaptive regularizing matrix. ⊙ denotes dot product. In such a way, the sparse constraints for tumor and normal regions are set differently, thus our TLS2D can reliably recover tumor regions meanwhile retaining normal regions.

### 3. EXPERIMENTS AND RESULTS

The proposed TLS2D method was evaluated on both synthetic and real MR brain tumor images. We also extensively compared our method with state of the art, including Robust Principal Component Analysis (RPCA) (Candès et al., 2011), Robust Alignment by Sparse and Low-rank decomposition (RASL) (Peng et al., 2012), and Spatially COnstraint LOw-Rank (SCOLOR) (Tang et al., 2018). Specifically, the RPCA method is one of the most classical and successful low-rank and sparse decomposition schemes; the RASL method considers

TABLE 1 | Dice values of different methods on synthetic and real MR brain tumor images.


*Best results are highlighted in bold.*

spatial mismatch between different images and hence adds image alignment into the low-rank based decomposition procedure; the SCOLOR method imposes spatial constraint on sparse component to restrict its structured sparsity.

The metrics employed to quantitatively evaluate recovery and segmentation performance was structural similarity index (SSIM) (Wang et al., 2004) and Dice index (Chang et al., 2009), respectively. The SSIM index is the most popular metric to evaluate the similarity of two images by using structural information. The SSIM of two images x and y is:

$$SSIM(\mathbf{x}, \boldsymbol{\nu}) = \frac{(2\mu\_{\mathbf{x}}\mu\_{\boldsymbol{\mathcal{Y}}} + c\_1)(2\sigma\_{\mathbf{x}\boldsymbol{\mathcal{Y}}} + c\_2)}{(\mu\_{\mathbf{x}}^2 + \mu\_{\boldsymbol{\mathcal{Y}}}^2 + c\_1)(\sigma\_{\mathbf{x}}^2 + \sigma\_{\boldsymbol{\mathcal{Y}}}^2 + c\_2)},\tag{12}$$

where µ<sup>x</sup> and µ<sup>y</sup> is the average of x and y; σ<sup>x</sup> and σ<sup>y</sup> is the variance of x and y, respectively; σxy is the covariance of x and y; c<sup>1</sup> and c<sup>2</sup> are two constants to stabilize the division. The Dice index is used for comparing the similarity of two regions, and can be calculated as:

$$\text{Dice} = \frac{2|G \cap T|}{|G| + |T|},\tag{13}$$

where T and G denotes the segmented tumor region and ground truth, respectively.

### 3.1. Validation on Synthetic MR Brain Tumor Images

We first quantitatively evaluated the recovery performance of our method on synthetic tumor images. The synthetic MR brain tumor images are based on images from a public dataset LPBA40 (Shattuck et al., 2008). The LPBA40 dataset includes 40 T1-weighted MR normal brain images. Some example normal images from LPBA40 are shown in **Figure 4D**. We generated the synthetic tumor images by fusing tumor regions derived from a real MR tumor image dataset BRATS2018 (Menze et al., 2015) (see **Figure 4A**).

**Figure 4** visualizes some recovery and segmentation results obtained by our method. It can be observed that our method can reliably extract the tumor regions, and recover these regions with normal brain appearances. **Figure 5** further illustrates the quantitative SSIM values between the original MR images and the recovered images by different methods. Our TLS2D method consistently achieves the most similar image appearance to the original images from LPBA40. In addition, **Table 1** lists the Dice indices of the segmented tumor regions by different methods. Our TLS2D achieves the best segmentation performance.

### 3.2. Evaluation on Real MR Brain Tumor Images

We further evaluated the efficacy of our method on 124 real T2-weighted FLAIR MR brain tumor images from the dataset BRATS2018 (Menze et al., 2015). **Table 1** demonstrates that our TLS2D method achieves the best tumor segmentation results. **Figure 6** illustrates some example recovery and segmentation results obtained by our method. It can be seen from **Figure 6** that our method can achieve satisfactory recovery and segmentation performance. The recovered images by our method could infer the plausible brain structures, see red arrows in **Figure 6B**.

### 3.3. Application to Multi-Atlas Segmentation

The recovery of pathological regions with normal brain appearances is beneficial for other image computing tasks, such as multi-atlas segmentation (MAS). The MAS attempts to register multiple normal brain atlases to a new brain image, thus to map their corresponding anatomical labels to the new brain image for the brain segmentation. Conventional MAS methods may not perform well when images are with tumor regions, because the appearance change induced by these regions cause difficulties in registering multiple atlases to the brain tumor image. We conducted multi-atlas segmentation based on the recovered images to demonstrate the benefit of our method on image recovery.

We used 40 T1-weighted MR images and their corresponding segmentation labels from LPBA40 (Shattuck et al., 2008) to conduct MAS. For each time of MAS, we chose one image to generate synthetic tumor image, and employed the remaining 39 images as multiple atlases. As shown in **Figure 7**, we then used the proposed TLS2D method to obtain the recovered image, and utilized an intensity-based non-rigid registration

method (Myronenko and Song, 2010) to map multiple atlases to the recovered image for the brain segmentation via majority vote based label fusion. **Figure 8** shows some MAS results obtained by using the recovered images and original images, respectively. It can be observed that the brain segmentations using our recovered images outperform those using original tumor images, especially in the regions tumor occupied. It also can be observed from **Figure 8** that compared to SCOLOR method, our method can produce much clearer recovered images. **Figure 9** further illustrates the average Dice indices of different brain regions of 40 segmented brain tumor images using MAS+original images, MAS+SCOLOR recovered images and MAS+our recovered images, respectively. The MAS using our recovered images consistently achieve better Dice indices compared to the MAS using original images and recovered images from SCOLOR, which demonstrates our method is potentially useful to improve the MAS when images are with pathological regions.

### 4. DISCUSSION AND CONCLUSION

In this study, we have proposed a novel low-rank based method, called transformed low-rank and structured sparse

decomposition (TLS2D), for the reliable recovery and segmentation of pathological MR brain images. By integrating the structured sparsity, image alignment, and adaptive spatial constraint into a unified matrix decomposition framework, our method is robust for extracting pathological regions, and also is reliable for recovering quasi-normal MR appearances. The recovered image is beneficial for subsequent image computing procedures, such as atlas-based segmentation. We have compared the proposed TLS2D method with several state-of-theart low-rank based approaches on synthetic and real MR brain images. Regarding these compared methods, the RPCA method is a conventional low-rank and sparse decomposition method; the RASL method embeds image alignment into LSD framework; the SCOLOR method imposes spatial constraint on sparse component. Experimental results show our method consistently outperforms all compared methods, which demonstrates the contribution of the proposed transformed low-rank and structured sparse decomposition with adaptive sparse constraint on simultaneous recovery and segmentation.

Computer aided methods that can assist clinicians to analyze the MR brain tumor scans are of essential significance for improved diagnosis, treatment planning and patients' followup. Automated tumor segmentation is the primary research task for analyzing the pathological images, and has been extensively investigated in the literature (Gordillo et al., 2013; Menze et al., 2015; Zhou et al., 2017). However, in addition to tumor segmentation task, the presence of pathologies in MR images poses challenges in other image computing tasks, such as intensity-/feature-based image registration (Sotiras et al., 2013) and atlas-based segmentation of brain structures (Cabezas et al., 2011), due to the structure and appearance changes of pathological brain images. Thus the recovery of pathological regions with normal brain appearances is beneficial for most image computing procedures. To this end, we consider to integrate the registration, segmentation and recovery procedures into a unified decomposition framework. The proposed TLS2D is a generic method for analyzing the MR brain tumor scans. It is worth noting that although our method is able to provide recovered images with quasi-normal brain appearances, the recovered regions may have some artifacts, located in the region around original tumor boundary, as shown in **Figure 8**. This is mainly due to the distinction of sparse constraints between inner boundary (tumor region) and outer boundary (normal region). Even so, compared to the original pathological images, our recovered images are more similar to the normal brain images, thus are more convenient to be used for other image computing tasks, such as multi-atlas segmentation shown in section 3.3.

The tumor region usually occupies a contiguous portion in the MR brain image, thus the distributions of tumor pixels are not pixel-wised sparse but structurally sparse. This motivates us to model the tumor region using the structured sparsity norm. Considering that the structured sparsity norm described in Jia et al. (2012) can effectively encourage sparse component to distribute in structured patterns and also its facility to be implemented in the low-rank and sparse decomposition scheme, we employ this structured sparsity norm (Jia et al., 2012) to model tumor region in this study. Note that the structured sparsity (Jia et al., 2012) could be replaced by sparsity in a different basis (e.g., a wavelet basis), but such sparsity needs to take into account the spatial connection of the sparse pixels.

The tumor segmentation performance of our method still could be improved, especially compared with the state-ofthe-art deep learning based segmentation models (Pereira et al., 2016; Havaei et al., 2017). However, these deep learning based methods typically require enormous amount of highquality labeled images to train a model for medical image segmentation. Although some recent approaches (Mlynarski et al., 2018; Shah et al., 2018) proposed a mixed-supervision scheme, which employed a minority of images with high-quality per-pixel labels and a majority of images with coarse-level annotations (bounding boxes, landmarks or image-level annotations) to train the deep neural networks; preparing annotations such as bounding boxes and landmarks is still laborious. Compared with deep learning based methods, our advantage is that the proposed TLS2D does not require labeled images to train a segmentation model; it extracts tumor regions by analyzing normal MR image appearances from unlabeled population data. What's more, the segmentation results of our method can alleviate the image labeling procedure by the clinicians. Our segmentation results

### REFERENCES


could further be used as label information for the semisupervised training of deep learning based segmentation models (Papandreou et al., 2015; Bai et al., 2017).

### AUTHOR CONTRIBUTIONS

CL, YW, TW, and DN: response for study design; CL: implemented the research and conducted the experiments; CL and YW: conceived the experiments, analyzed the results, and wrote the main manuscript text and prepared the figures. All authors reviewed the manuscript.

### ACKNOWLEDGMENTS

The work described in this paper was supported in part by the National Natural Science Foundation of China (No. 61701312), in part by the Natural Science Foundation of Shenzhen University (No. 2018010), and in part by the Shenzhen Peacock Plan under Grant KQTD2016053112051497.


background and dynamic components. Magn. Reson. Med. 73, 1125–1136. doi: 10.1002/mrm.25240


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Lin, Wang, Wang and Ni. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Improved Wavelet Threshold for Image De-noising

#### Yang Zhang<sup>1</sup> \*, Weifeng Ding<sup>2</sup> , Zhifang Pan<sup>3</sup> and Jing Qin<sup>4</sup>

*<sup>1</sup> School of Information and Engineering, The First Affiliated Hospital of WenZhou Medical University, WenZhou Medical University, WenZhou, China, <sup>2</sup> The Chinese People's Liberation Army 118 Hospital, WenZhou, China, <sup>3</sup> School of Information and Engineering and Information Technology Centre, WenZhou Medical University, WenZhou, China, <sup>4</sup> School of Nursing, Hong Kong Polytechnic University, Kowloon, Hong Kong*

With the development of communication technology and network technology, as well as the rising popularity of digital electronic products, an image has become an important carrier of access to outside information. However, images are vulnerable to noise interference during collection, transmission and storage, thereby decreasing image quality. Therefore, image noise reduction processing is necessary to obtain higher-quality images. For the characteristics of its multi-analysis, relativity removal, low entropy, and flexible bases, the wavelet transform has become a powerful tool in the field of image de-noising. The wavelet transform in application mathematics has a rapid development. De-noising methods based on wavelet transform is proposed and achieved with good results, but shortcomings still remain. Traditional threshold functions have some deficiencies in image de-noising. A hard threshold function is discontinuous, whereas a soft threshold function causes constant deviation. To address these shortcomings, a method for removing image noise is proposed in this paper. First, the method decomposes the noise image to determine the wavelet coefficients. Second, the wavelet coefficient is applied on the high-frequency part of the threshold processing by using the improved threshold function. Finally, the de-noised images are obtained to rebuild the images in accordance with the estimation in the wavelet-based conditions. Experiment results show that this method, discussed in this paper, is better than traditional hard threshold de-noising and soft threshold de-noising methods, in terms of objective effects and subjective visual effects.

Keywords: wavelet threshold, wavelet transform, image de-noising, MSE, PSNR

## INTRODUCTION

The transmission, detection and collection of signals are subject to pollution of varying degrees of random noise, influenced by the environment and due to the nature of the work. Thus, the implementation of signal de-noising is necessary. How to filter out the noise in the real signal to obtain effective information, is a current research hotspot. Wavelet transform has a time-frequency local analysis function, and its de-noising results are relatively good. Thus, its application is also very extensive.

In recent years, with the deepening of the intersection and research, along with the application of mathematics and other disciplines, the application of fuzzy mathematics, mathematical morphology, intelligent optimization, neural network, and wavelet theory and technology in image processing, as well as some new methods of noise resistance have emerged. In the early stage, the

### Edited by:

*Nianyin Zeng, Xiamen University, China*

#### Reviewed by:

*Qin Ma, The Ohio State University, United States Quan Zou, University of Electronic Science and Technology of China, China Leyi Wei, The University of Tokyo, Japan*

> \*Correspondence: *Yang Zhang zhangyang0408@wmu.edu.cn*

#### Specialty section:

*This article was submitted to Brain Imaging Methods, a section of the journal Frontiers in Neuroscience*

Received: *14 December 2018* Accepted: *15 January 2019* Published: *08 February 2019*

#### Citation:

*Zhang Y, Ding W, Pan Z and Qin J (2019) Improved Wavelet Threshold for Image De-noising. Front. Neurosci. 13:39. doi: 10.3389/fnins.2019.00039*

**129**

traditional de-noising method has a low pass filter method, which mainly includes median filtering, linear filtering and adaptive filtering.

During image collection, coding and transmission, all images are visible or invisible to varying degrees of noise. The image noise is divided into three main categories. The first is Gauss noise, which belongs to the category of electronic noise that is produced by a sensitive element caused by the random thermal motion of the electronic components. The second is Poisson noise, which is produced during the process of photoelectric conversion; it has an apparent effect under a weakened light. The third is particle noise, which is produced during the process of photography and can be found under a microscope. The smooth images that can be seen in the photo will display random particle images under the microscope (Auber and Kornprobst, 2006). The purpose of image processing is to perform some operations or processing on the digitized image information, in order to improve the image quality or to achieve a desired effect. For example, the non-uniformity of the sensitivity of sensitive components in photoelectric conversion, transmission error and human factors during the digitization diminishes the quality of an image, which contains various random noises. Sometimes, this random noise will greatly affect the image quality. The noise image affects not only the visual effect of the viewed image, but also affects image processing. Image de-noising aims to retain useful information and reduce or eliminate the interference of noise in the image. De-noising is a key link in image processing. In practical applications, this process is often used as a pretreatment of image processing and recognition, which is the basis of subsequent high-level image processing. Thus far, all studies on image de-noising have focused on this effect and has achieved great progress. However, with the emergence of new problems, people have higher standards of image quality. The traditional image noise removal algorithm is based on the spectrum distribution. In frequency, wavelet de-noising is the commonly used method to separate useful information and noise from images (Johnstone and Silverman, 2005, Othman and Qian, 2006). Other methods include the Markov field model, partial differential equation and LP regularization method (Baske, 2011). This method is also a drawback on regularizing noise. The convergence rate is slow in regions with minimal changes. Sinha and Dougherty (Thomas Asaki and Kevin Vixie, 2010) combined fuzzy mathematics with mathematical morphology and applied it to image processing. In recent years, the feed forward BP neural network was proposed as a filter to de-noise (Noh et al., 2011; Swami et al., 2017). Wavelet transform has also greatly contributed to image denoising (Michal et al., 2006; Apotsos et al., 2008; Patil, 2015). The correlation coefficient method is based on the correlation between the wavelet coefficients at the corresponding positions for each scale, whereas the noise is neither correlated nor has a weak correlation on each scale to remove the noise. Noise is mainly concentrated in high frequencies, provided that highfrequency processing can achieve the effect of noise reduction. In 2006, Elad and Aharon (2006) proposed a de-noising method on the basis of sparse representation and KSVD dictionary learning. The dictionary learned by the KSVD algorithm (Oey et al., 2013) was used for image de-noising. However, the KSVD algorithm ignores the similarity of the image, and the KSVD algorithm cannot use the detailed information of the image when learning the dictionary on a single scale. At present, the popular multi-scale directional transformation mainly includes: curved wave transformation (Palakkal and Prabhu, 2012), contour wave trans-formation and non-sub sampling contour wave transformation (Amisha et al., 2013). The multi-scale transformation methods can use the inherent geometric features of the natural image data, and all relative wavelet transforms have remarkably improved in direction selection. The 3D block matching algorithm (BM3D) (Lebrun, 2012) is an effective denoising method for Gauss noise. This algorithm can preserve information such as edge and texture. BM3D comprehensively utilizes non-locality, linear transformation threshold, Wiener filtering, and sparse representation. BM3D also reveals details of different sub-block classes and retains the basic characteristics of each sub-block. This method can improve the resolution in noisy images, however the computation is very large, as each similar block needs to be computed. Pizarro et al. (2010) selected non-local constraints as fidelity items. In similarity measure, the error of noise image and real images was minimal. Moreover, the high-order smoothing of the de-noised image was used as a regularization term, and a non-local data smoothing model was proposed. The model was applied to the similarity between images to obtain a further general model. A selected unsuitable threshold can easily present a Gibbs phenomenon (Huang et al., 2005, Chen et al., 2005). Mallat presented alternating projection (AP) for de-noising. The AP (Mallat and Hwang, 1992; Zhu et al., 2017) method obtains the modulus maxima at each scale after the signal is differentiated on each scale. Then, the non-propagating modulus maxima should restore the signal. The disadvantage of the alternating projection method is that the computation is very large and, the iteration is prone to instability. Li proposed a novel hybrid model based on an extreme learning machine, k-nearest neighbor regression and wavelet de-noising (Li et al., 2017).Using the linear mode to reduce noise will lead to the loss of detail in textured images. The static wavelet transforms (SWT) use time invariance to achieve image de-noising (Wang et al., 2003). Some researchers (Zou et al., 2015; Liu et al., 2017) proposed an approach that searches for candidate matching blocks along the edges that are well-adapted to image details. All similar blocks form a 3D group. De-noising is performed by shrinking the coefficients of the 3D transform applied on these groups. The non-linear diffusion filtering method based on PDE, is a non-linear anisotropic de-noising method (Lee et al., 2005). A non-linear model for de-noising can be excessive in the smoothing of images. Scholars have also studied how to improve the speed of de-noising. The Non-linear Diffusion techniques and PDE-based variational models are very popular in image restoring and processing. The researchers proposed (Fazli et al., 2010; Zeng et al., 2012, 2018) that a heuristic method such as Particle Swarm Optimization (PSO), be used for Complex PDE parameter tuning by minimizing the Structural SIMilarity (SSIM) measure. Tasdizen (2009) enhanced the algorithm efficiency by clustering the blocks with PCA, selecting the similarity between block features as the measurement of block similarity, and optimally estimating the parameters. Mahmoudi and Sapiro (2005) proposed to accelerate the algorithm by eliminating irrelevant neighborhoods in the weighted averaging process. Currently, many researchers have proposed a combination of ways for de-noising. For example, machine learning and random walks are combined with traditional noise removal methods (Huang et al., 2006; Jieru et al., 2016; Liu et al., 2018). Zeng et al. (2017), proposed de-noising and deblurring gold immune chromatographic strip images via gradient projection algorithms.

Presently, details of images and how to remove noise from them has received increased attention. In this paper, we present an improved threshold to de-noising of MRI images. Experimental results show that the de-noising effect is better than the hard and soft threshold.

### PRINCIPLE OF WAVELET DE-NOISING MEDTHOD

In current research, there are numerous ways to eliminate noise from images. The application of wavelet de-noising is very extensive. The wavelet method for removing noise has numerous advantages. Not only is the algorithm simple to implement, but it also has a particularly superb effect of de-noising. This method has therefore achieved great results in practical applications. The main principle of wavelet threshold de-noising is based on the strong correlation of the wavelet. The energy concentration of the signal after wavelet transform is often concentrated on the large wavelet coefficient. The noise energy after wavelet transform does not have concentrated characteristics, because the noise does not have the correlation of wavelets. Wavelet coefficients with large amplitude values are mostly signals, whereas the coefficients with small amplitude values are largely noise. The threshold is set on the basis of this property. The hard and soft threshold function method was proposed by Donoho (Donoho, 1995) et al.

The hard threshold is expressed as follows:

$$\hat{\boldsymbol{w}}\_{j,k} = \begin{cases} \boldsymbol{w}\_{j,k}, |\boldsymbol{w}\_{j,k}| > = \lambda \\ 0, |\boldsymbol{w}\_{j,k}| < \lambda \end{cases} \tag{1}$$

The soft threshold is calculated as follows:

$$\hat{w}\_{j,k} = \begin{cases} \text{sgn}\left(\left.w\_{j,k}\right\vert (\left|w\_{j,k}\right|-\lambda), \left|w\_{j,k}\right|> = \lambda \\ \text{0, } |w\_{j,k}| < \lambda \end{cases} \right. \tag{2}$$

The Semi-threshold function is expressed as follows:

$$\hat{\boldsymbol{w}}\_{j,k} = \begin{cases} 0, |\boldsymbol{w}\_{j,k}| & \asymp \lambda \\ \text{sgn}\left(\boldsymbol{w}\_{j,k}\right) \frac{\lambda\_2(|\boldsymbol{w}\_{j,k}| - \lambda\_1)}{\lambda\_2 - \lambda\_1}, \lambda\_1 < |\boldsymbol{w}\_{j,k}| < \lambda\_2 \\ \boldsymbol{w}\_{j,k}, |\boldsymbol{w}\_{j,k}| > \lambda \end{cases} \tag{3}$$

Although the soft, hard thresholds and semi- thresholds have achieved some results, they all still have drawbacks. The hard threshold function can better preserve boundary information however, the hard threshold function is discontinuous at closed values, thus removing the noise cancellation effect remains rough. Furthermore, its application has some limitations; this function only processes wavelet coefficients smaller than the threshold and does not manage wavelet coefficients larger than the threshold. Therefore, the de-noising result is relatively different. The resulting estimated signal produces additional oscillations. Furthermore, the interference of the noise signal is often mixed in with the wavelet coefficients greater than the closed value function. The soft threshold function has improved overall continuity, and the de-noising result is relatively smooth. However, after noise cancellation, the signal is easily overwhelmed by noise, thereby resulting in difficulties at higherorder derivatives, causing de-noising distortion. Moreover, the soft threshold function performs constant value compression on the wavelet coefficients rather than the threshold. This function directly affects the degree of approximation of the reconstructed signals. The semi-threshold function not only retains a large coefficient, but also has continuity.

The calculation of complexity through this function is higher. In the semi-threshold function, determining the threshold is a difficult point. Therefore, the traditional threshold function has its own defects and has certain limitations in its application, which affects the effect of de-noising.

In this article, we proposed a new threshold function. We improved the threshold to compensate for the deficiency of soft and hard thresholds. In our experiment, we analyzed the experimental results of subjective and objective experiments and concluded that the improved threshold function de-noising effect is better than the hard and soft threshold de-noising.

### IMPROVED WAVELET THRESHOLD DE-NOISING METHOD

For the method of threshold de-noising, using hard and soft closed-valued functions, the basic idea is to remove relatively small wavelet coefficients as much as possible. When a hard threshold function is used to de-noise, although it can save the effective part of the original signal relatively well, the reconstructed signal after the noise processing will be very rough. When de-noising with a soft threshold function, the reconstructed signal will easily lose useful signals.

The key to threshold shrinkage is the determination of threshold and threshold functions. If the threshold is selected as large, details will be lost. If the threshold is selected small, then the noise still exists. Although a hard threshold de-noising is simple and easy to implement, it will generate a pseudo-Gibbs phenomenon at the image boundary. In comparison with hard thresholds, soft thresholds are continuous, and the structure of wavelet coefficients is maintained, thereby effectively reducing the pseudo Gibbs phenomenon. However, when wavelet coefficients with an absolute value greater than the threshold value are processed, the image edges will become blurred. To achieve improved results for de-noising, we have enhanced the threshold functions.

### Improved Threshold Functions

The improved threshold function is as follows:

$$\hat{\boldsymbol{w}}\_{j,k} = \begin{cases} \boldsymbol{w}\_{j,k} - \frac{\boldsymbol{w}\_{jk}}{2} \cdot \frac{1}{100^{|\boldsymbol{w}\_{j,k}| \lambda} - 99}, |\boldsymbol{w}\_{j,k}| > \lambda\\ \frac{\boldsymbol{w}\_{jk}}{2} \cdot \frac{1}{1 - \log\_{100} |\boldsymbol{\lambda}|}, |\boldsymbol{w}\_{j,k}| << \lambda \end{cases} \tag{4}$$

The threshold function after adding an adjustment factor is as follows:

$$\hat{\boldsymbol{w}}\_{j,k} = \begin{cases} \boldsymbol{w}\_{j,k} - \frac{\boldsymbol{w}\_{j,k} \cdot \boldsymbol{m}}{2} . \frac{1}{100^{|\boldsymbol{w}\_{j,k}| \lambda} - 99}, |\boldsymbol{w}\_{j,k}| > \lambda\\ \frac{\boldsymbol{w}\_{j,k} \cdot \boldsymbol{m}}{2} . \frac{1}{1 - \log\_{100}(\lambda)}, |\boldsymbol{w}\_{j,k}| << \lambda \end{cases} \tag{5}$$

Wherem ∈ Z.

When|wj,<sup>k</sup> | −→ λ <sup>+</sup>, the first inequality of Equation (4) can be written as:

$$\lim\_{|\omega\_{j,k}| \to \lambda^{+}} (\frac{w\_{j,k}m}{2}. \frac{1}{100^{|w\_{j,k}|\lambda} - 99}) = \frac{\lambda}{2} \tag{6}$$

When|wj,<sup>k</sup> | −→ λ <sup>−</sup>, the second inequality of formula (4) can be written:

$$\lim\_{|\omega\_{j,k}| \to \lambda^{-}} (\frac{w\_{j,k} \cdot m}{2} \cdot \frac{1}{1 - \log\_{100}^{|w\_{j,k}|/\lambda}}) = \frac{\lambda}{2} \tag{7}$$

The threshold is continuous at the ±λ point and has high-order derivatives. The threshold function is continuous, and the high order is steerable. The second inequality slowly approaches zero. Here in adjusts the shape of the threshold function; m adjusts the variation of wavelet coefficient; k determines the asymptote of the threshold function. When k = 1,we proposed that the threshold function approaches the hard threshold function. When k = 0, the threshold function approaches the soft threshold function. Thus, the parameter k was adjusted; we proposed that the threshold function can vary between the interval values of soft threshold function and hard threshold function.

The new threshold function proposed in this paper combines the advantages of soft and hard threshold functions. This approach enables the smooth transition of the wavelet threshold curve. The same continuity is achieved in the wavelet domain as the traditional soft threshold function, which improves the shortcomings of hard threshold function discontinuity. Moreover, pseudo-Gibbs phenomenon can be avoided. The new threshold function is a high-order steerable between the intervals of |wj,<sup>k</sup> | > λ and |wj,<sup>k</sup> | <= λ . This type of conductivity enables the elimination of the generated oscillation phenomenon in threshold de-noising and the improves the suppression of overkill of the detail coefficient. Thus, the signal after reconstruction can be made smoother.

### Improved Threshold Selection

The threshold is vital in image threshold de-noising, and Donoho (1995) proposed a unified threshold method.

$$
\lambda = \delta \sqrt{2 \log(M \ast N)} \tag{8}
$$

However, this method is not ideal in practical applications and causes over-segmentation (Grace et al., 2000). Through analysis, it was found that the decomposition of the image by wavelet increases with the number of decomposition layers. The energy of noise will become smaller and smaller, and the energy of image information will become increasingly larger. Wavelet decomposition is performed in accordance with the high and low frequency characteristics of a wavelet. This method proposes the following hierarchical threshold estimation.

$$
\lambda = \delta \sqrt{2 \log M \ast N} \ast (1 - \alpha^\* j) \tag{9}
$$

Where j is the resolution scale. M × N represents image size. 0 < α < 1, and α denotes the adjustment parameter. When we calculate the high-frequency threshold, α is a smaller value, resulting in a slightly larger threshold. When we calculate the low-frequency, α is a larger value, resulting in a slightly smaller threshold. By adjusting α to the threshold parameter α, the accuracy of the threshold estimation is microscopically improved.

### EXPERIMENT ANALYSIS

In this paper, the experimental analysis consists mainly of two parts. The objective and subjective evaluation.

### Objective Evaluation

To illustrate the effectiveness of the wavelet threshold algorithm in medical image de-noising, the traditional threshold method and the proposed method was compared. Objective evaluation index is described by peak signal-to-noise ratio (PSNR) and mean square error (MSE).

The PSNR is expressed as follows:

$$PSNR = 10 \ast \lg(\frac{255^2}{MSE}) \tag{10}$$

The MSE is calculated as follows:

$$MSE = \frac{1}{M \ast N} \left[ \sum\_{i=1}^{M} \sum\_{j=1}^{M} \left( \left( \mathbf{g}(i, j) - \hat{\mathbf{g}}(i, j) \right) \right)^2 \right] \tag{11}$$

Where M ∗ N is the size of image; g(i, j)denotes original image, and gˆ(i, j)represents the restoration image. Our data were obtained from the Chinese People's Liberation Army 118 Hospital. The results shown in **Table 1** compare the hard threshold method, the soft threshold method and the proposed method.

Through simulation experiments, the data in **Tables 1**–**3** show that the proposed method obtains a large peak signal-to-noise ratio and a smaller mean square error. Thus, our improved wavelet de-noising effect is better.

### Subjective Evaluation

The experiment was programmed in MATLAB2014 (b). MRI brain images were used to prove the effectiveness of the improved threshold function in medical image de-noising. After

Frontiers in Neuroscience | www.frontiersin.org

TABLE 1 | De-noising results in different ways of MRI 1.


TABLE 2 | De-noising results in different ways of MRI 2.


TABLE 3 | De-noising results in different ways of MRI 3.


decomposition, the threshold was calculated using Equation (9) and processed by the corresponding threshold. Finally, the image was reconstructed to obtain the image after de-noising. The subjective experimental results show that the method proposed in this paper can achieve improved de-noising effects. De-noising effects are achieved when the mean value is 0 and the variance is as follows: 0.01, 0.03, 0.05, and 0.1. The experimental results are shown in **Figures 1**–**3**

After adding noise, the original image was almost drowned by noise. Using soft and hard thresholds to remove noise, considerable noises remained in the image. Given the increase in noise, the image appears smoother by using soft and hard thresholds to remove noise. The method in this paper, removed all the noise in the image, and the image was relatively clear. By contrasting the experiments, we suggest that the proposed method has a better effect than hard and soft threshold methods.

### CONCLUSION

In this study, we analyzed the shortcomings of traditional hard and soft threshold functions for medical image de-noising. We proposed an improved threshold function for de-noising. The mediation factor was increased to find the best estimate of the wavelet coefficient function. The wavelet coefficients were

### REFERENCES


smoothed by the soft threshold function. Thus, the image looks smooth when noise is removed via soft threshold. Through subjective and objective evaluations, the results show that the effect of the hard threshold function is better than that of the soft threshold. However, the signal will produce jumping points when generating additional shocks and the original signal will not be the smooth. The hard threshold method will predict the ringing effect. Improved threshold selection based on the multi-layer wavelet transform, overcomes the disadvantages of soft and hard thresholds. Experimental results showed that the proposed method in this paper can effectively improve the de-noising performance of both soft and hard threshold functions.

### AUTHOR CONTRIBUTIONS

YZ conceived the study, designed model, and wrote the draft. WD provided data, acquired the pre-processed the data and analyzed the data. ZP and JQ gave critical revision. All authors consent for this submission.

### FUNDING

This work was supported by Scientific Research Task in Department Education of ZheJiang (Y201328002 & kg2015243). Talents stating Task of WenZhou Medical University (QTJ11008). Wenzhou Science and technology bureau (Y20150086 & 2018ZG016), Zhejiang Provincial Natural Science foundation of China (LY16F030010).


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Zhang, Ding, Pan and Qin. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Brain Differences Between Men and Women: Evidence From Deep Learning

Jiang Xin<sup>1</sup> , Yaoxue Zhang<sup>1</sup> , Yan Tang1,2 \* and Yuan Yang<sup>3</sup> \*

*<sup>1</sup> School of Computer Science and Engineering, Central South University, Changsha, China, <sup>2</sup> Department of Neurology, Xiangya Hospital, Central South University, Changsha, China, <sup>3</sup> Department of Physical Therapy and Human Movement Sciences, Feinberg School of Medicine, Northwestern University, Chicago, IL, United States*

Do men and women have different brains? Previous neuroimage studies sought to answer this question based on morphological difference between specific brain regions, reporting unfortunately conflicting results. In the present study, we aim to use a deep learning technique to address this challenge based on a large open-access, diffusion MRI database recorded from 1,065 young healthy subjects, including 490 men and 575 women healthy subjects. Different from commonly used 2D Convolutional Neural Network (CNN), we proposed a 3D CNN method with a newly designed structure including three hidden layers in cascade with a linear layer and a terminal Softmax layer. The proposed 3D CNN was applied to the maps of factional anisotropy (FA) in the whole-brain as well as specific brain regions. The entropy measure was applied to the lowest-level image features extracted from the first hidden layer to examine the difference of brain structure complexity between men and women. The obtained results compared with the results from using the Support Vector Machine (SVM) and Tract-Based Spatial Statistics (TBSS). The proposed 3D CNN yielded a better classification result (93.3%) than the SVM (78.2%) on the whole-brain FA images, indicating gender-related differences likely exist in the whole-brain range. Moreover, high classification accuracies are also shown in several specific brain regions including the left precuneus, the left postcentral gyrus, the left cingulate gyrus, the right orbital gyrus of frontal lobe, and the left occipital thalamus in the gray matter, and middle cerebellum peduncle, genu of corpus callosum, the right anterior corona radiata, the right superior corona radiata and the left anterior limb of internal capsule in the while matter. This study provides a new insight into the structure difference between men and women, which highlights the importance of considering sex as a biological variable in brain research.

Keywords: gender difference, deep learning, neural network, diffusion MRI, entropy

### INTRODUCTION

Recent studies indicate that gender may have a substantial influence on human cognitive functions, including emotion, memory, perception, etc., (Cahill, 2006). Men and women appear to have different ways to encode memories, sense emotions, recognize faces, solve certain problems, and make decisions. Since the brain controls cognition and behaviors, these gender-related functional differences may be associated with the gender-specific structure of the brain (Cosgrove et al., 2007).

### Edited by:

*Nianyin Zeng, Xiamen University, China*

#### Reviewed by:

*Gan Huang, UCLouvain, Belgium Xia-an Bi, Hunan Normal University, China*

\*Correspondence:

*Yan Tang tangyan@csu.edu.cn Yuan Yang yuan.yang@northwestern.edu*

#### Specialty section:

*This article was submitted to Brain Imaging Methods, a section of the journal Frontiers in Neuroscience*

Received: *13 December 2018* Accepted: *15 February 2019* Published: *08 March 2019*

### Citation:

*Xin J, Zhang Y, Tang Y and Yang Y (2019) Brain Differences Between Men and Women: Evidence From Deep Learning. Front. Neurosci. 13:185. doi: 10.3389/fnins.2019.00185*

**136**

Diffusion tensor imaging (DTI) is an effective tool for characterizing nerve fibers architecture. By computing fractional anisotropy (FA) parameters in DTI, the anisotropy of nerve fibers can be quantitatively evaluated (Lasi et al., 2014). Differences in FA values are thought to associate with developmental processes of axon caliber, myelination, and/or fiber organization of nerve fibers pathways. By computing FA, researchers has revealed subtle changes related to normal brain development (Westlye et al., 2009), learning (Golestani et al., 2006), and healthy aging (Kochunov et al., 2007). Nevertheless, existing studies are yet to provide consistent results on exploring the difference of brain structure between men and women. Ingalhalikar et al. (2014) argued that the men have greater intra-hemispheric connection via the corpus callosum while women have greater interhemispheric connectivity. However, other studies reported no significant gender difference in brain structure (Raz et al., 2001; Salat et al., 2005). A recent critical opinion article suggested that more research is needed to investigate whether men and women really have different brain structures (Joel and Tarrasch, 2014).

Most existing DTI studies used the group-level statistical methods such as Tract-Based Spatial Statistics (TBSS) (Thatcher et al., 2010; Mueller et al., 2011; Shiino et al., 2017). However, recent studies indicated that machine learning techniques may provide us with a more powerful tool for analyzing brain images (Shen et al., 2010; Lu et al., 2017; Tang et al., 2018). Especially, deep learning can extract non-linear network structure, realize approximation of complex function, characterize distributed representation of input data, and demonstrate the powerful ability to learn the essential features of datasets based on a small size of samples (Zeng et al., 2016, 2018a; Tian et al., 2018; Wen et al., 2018). In particular, the deep convolutional neural network (CNN) uses the convolution kernels to extract the features of image and can find the characteristic spatial difference in brain images, which may promise a better result than using other conventional machine learning and statistical methods (Cole et al., 2017).

In this study, we performed CNN-based analyses on the FA images and extracts the features of the hidden layers to investigate the difference between man and woman brains. Different from commonly used 2D CNN model, we innovatively proposed a 3D CNN model with a new structure including 3 hidden layers, a linear layer and a softmax layer. Each hidden layer is comprised of a convolutional layer, a batch normalization layer, an activation layer and followed by a pooling layer. This novel CNN model allows using the whole 3D brain image (i.e., DTI) as the input to the model. The linear layer between the hidden layers and the softmax layer reduces the number of parameters and therefore avoids over-fitting problems.

### MATERIALS AND METHODS

### MRI Data Acquisition and Preprocessing

The database used in this work is from the Human Connectome Project (HCP) (Van Essen et al., 2013). This open-access database contains data from 1,065 subjects, including 490 men and 575 women. The ages range is from 22 to 36. This database represents a relatively large sample size compared to most neuroimaging studies. Using this open-access dataset allows replication and extension of this work by other researchers.

We performed DTI data preprocessing includes format conversion, b0 image extraction, brain extraction, eddy current correction, and tensor FA calculation. The first four steps were processed with the HCP diffusion pipeline, including diffusion weighting (bvals), direction (bvecs), time series, brain mask, a file (grad\_dev.nii.gz) for gradient non-linearities during model fitting, and log files of EDDY processing. In the final step we use dtifit to calculate the tensors to get the FA, as well as mean diffusivity (MD), axial diffusivity (AD), and radial diffusivity (RD) values.

The original data were too large to train the model and it would cause RESOURCE EXAUSTED problem while training due to the insufficient of GPU memory. The GPU we used in the experiment is NVIDIAN TITAN\_XP with 12G memory each. To solve the problem, we scaled the size of FA image to [58 × 70 × 58]. This procedure may lead to a better classification result, since a smaller size of the input image can provide a larger receptive field to the CNN model. In order to perform the image scaling, "dipy" (http://nipy.org/ dipy/) was used to read the .nii data of FA. Then "ndimage" in the SciPy (http://www.scipy.org) was used to reduce the size of the data. Scaled data was written into the TFRecord files (http://www.tensorflow.org) with the corresponding labels. TFRecord file format is a simple record oriented binary format that is widely used in Tensorflow application for the training data to get a high performance of input efficiency. The labels were processed into the format of one-hot. We implemented a pipeline to read data asynchronously from TFRecord according to the interface specification provided by Tensorflow (Abadi et al., 2016). The pipeline included the reading of TFRecord files, data decoding, data type conversion, and reshape of data.

### CNN Model

We did the experiments on a GPU work station, which has four NVIDIA TITAN Xp GPUs. The operation system of the GPU work station was Ubutnu16.04. We used FSL to preprocess the data. The CNN model was designed using the open source machine learning framework Tensorflow (Abadi et al., 2016).

### Model Design

The commonly used CNN structures are based on 2D images. When using a 2D CNN to process 3D MRI images, it needs to map the original image from different directions to get 2D images, which will lose the spatial structure information of the image. In this study, we designed a 3D CNN with 3D convolutional kernels, which allowed us to extract 3D structural features from FA images. Besides, traditional CNN model usually uses several fully connected layers to connect the hidden layers and the output layer. The fully connected layer may be prone to the over-fitting problem in binary classification when the number of samples is limited (like our data). To address this problem, we used a linear layer to replace the fully connected layer. The linear layer integrates the outputs of hidden layers (i.e., a 3D matrix comprised of multiple featuremaps) into the inputs (i.e., a 1D vector) of the output layer which is a softmax classifier. Moreover, we performed a Batch Normalization (Ioffe and Szegedy, 2015) after each convolution operation. The Batch Normalization is used to avoid internal covariate shift problem in training the CNN model. Therefore, our designed model is a 3D "pure" CNN (3D PCNN). The architecture of the 3D PCNN model is shown in **Figure 1**. The 3D PCNN consists of three hidden layers, a linear layer and a softmax layer. Each of the hidden layer contains a convolutional layer, a Batch Normalization layer, an activation layer, a pooling layer with several feature maps as the outputs.

### **Convolutional layer**

The process of convolutional layer is to convolve the input vector I with the convolution kernel K, represented by I NK. The shape of the input vector in our 3D PCNN model was [n, d, w, h, c], where d, w, h, c represent the depth, width, height and channel numbers (which is 1 for a grayscale image) of the input vector, respectively, and n is the batch size which is a hyperparameter that was set to 45 (an empirical value) in this paper. In the first layer, the input size was 58 × 70 × 58 × 1, which was the 3D size (58 × 70 × 58) of the input image plus a single channel (grayscale image). The shape of the convolution kernel was [d<sup>k</sup> ,wk, h<sup>k</sup> , cin,cout], where d<sup>k</sup> ,wk, h<sup>k</sup> represents the depth, width, and height of the convolution kernel, respectively. In all three hidden layers, the kernel size was set to3 × 3 × 3, which means that d<sup>k</sup> = w<sup>k</sup> = h<sup>k</sup> = 3. The cin is the number of input channels which is equal to the channel number of the input vector. The cout is the number of output channels. As each kernel has an output channel, cout is equal to the number of convolution kernels, and is also the same as the number of input channels for the next hidden layer. In all convolution layers, the moving stride of the kernel was set to 1 and padding mode was to "SAME."

### **Batch normalization layer**

Batch normalization was performed after the convolutional layer. Batch normalization is a kind of training trick which normalizes the data of each mini-batch (with zero mean and variance of one) in the hidden layers of the network. To alleviate the gradient internal covariate shift phenomenon and speed up the CNN training, an Adam Gradient Decent method was used to train the model (Kingma and Ba, 2015).

### **Activation layer**

After the batch normalization operation, an activation function was used to non-linearize the convolution result. The activation function we used in the model was the Rectified linear unit, ReLU (Nair and Hinton, 2010).

### **Pooling layer**

Pooling layer was added after the activation layer. Pooling layers in the CNN summarize the outputs of neighboring groups of neurons in the same kernel map (Krizhevsky et al., 2012). Maxpooling method was used in this layer.

The outputs of each hidden layer were feature maps, which were the features extracted from the input images to the hidden layer. The outputs from the previous hidden layer were the inputs to the next layer. In our model, the first hidden layer generated 32 feature maps, the second hidden layer produced 64 feature maps, and the third hidden layer yielded 128 feature maps. Finally, we integrated the last 128 feature maps into the input of the softmax layer through a linear layer, and then got the final classification results from the softmax layer.

In our model, the input X ∈ {x (1) , x (2) , . . . , x (n) }, x (i) was the ith subject's FA value. Y ∈ {y (1) , y (2) , . . . , y (n) }, y (i) was the ith subject's label that were processed to onehot vector where [1 0] represents man and [0 1] woman. We used h(θ, x) to represent the proposed 3D PCNN model. Then we had:

$$
\hat{\boldsymbol{y}} = \boldsymbol{h}(\boldsymbol{\theta}, \boldsymbol{x}) \tag{1}
$$

where yˆ represents the predicted value obtained using the 3D PCNN on a sample x.

### Parameters Optimization

The initial values of the weights of the convolution kernels were random values selected from a truncated normal distribution with standard deviation of 0.1. We defined a cost function to adjust these weights based on the softmax cross entropy (Dunne and Campbell, 1997):

$$J(\theta, \mathbf{x}) = -\sum\_{i=1}^{n} \hat{\mathbf{y}}^{(i)} \log P\left(\hat{\mathbf{y}}^{(i)} = \mathbf{y}^{(i)} \; \middle| \; \mathbf{x} = \mathbf{x}^{(i)} \; \right) \tag{2}$$

As such, the task of adjusting the weight value became an optimization problem with J (θ, x) as the optimization goal, where a small penalty was given if the classification result was correct, and vice versa. We used the Adam Gradient Descent (Kingma and Ba, 2015) optimization algorithm to achieve this goal in the model training. All parameters in the Adam algorithm were set to the empirical values recommended by Kingma and Ba (2015), i.e., learning rate was α = 0.001, exponential decay rates for the moment estimates were β<sup>1</sup> = 0.9, β<sup>1</sup> = 0.999, ε = 10<sup>−</sup> <sup>8</sup> .

### Cross-Validation

To ensure the independent training and testing in the crossvalidation. The process of cross-validation is shown in **Figure 2**. We implemented a two-loop nested cross-validation scheme (Varoquaux et al., 2017). We divided the data set into three parts, i.e., 80% of the data as the training set for model training, 10% as the verification set for parameter selection, and 10% as the testing set for evaluating the generalization ability of the model. To eliminate the random error of model training, we run 10 fold cross validation and then took the average of classification accuracies as the final result.

### Features in First Hidden Layer

CNN has an advantage that it can extract key features by itself (Zeng et al., 2018c). However, these features may be difficult to interpret since they are highly abstract features. Thus, in this

study, we only analyzed the features obtained in the first hidden layer, since they are the direct outputs from the convolution on the grayscale FA images. In this case, the convolution operation of the first layer is equivalent to applying a convolution kernel based spatial filter on the FA images. The obtained features are less abstractive than those from the second and three hidden layers. There are totally 32 features in the first hidden layer. These features are the lowest-level features which may represent the structural features of FA images. We firstly computed the mean of voxel values across all subjects in each group (man vs. woman) for each feature and then evaluated their group-level difference using a two-sample t-test. Besides, we also computed the entropy on each feature for each individual:

$$H = -\sum\_{i=0}^{255} p\_i \log p\_i \tag{3}$$

where p<sup>i</sup> indicates the frequency of pixel with value i appears in the image. The entropy of each feature likely indicates the complexity of brain structural encoded in that feature. We also performed a two-sample t-test on entropy results to explore the differences between men and women. A strict Bonferroni correction was applied for multiple comparisons with the threshold of 0.05/32 = 1.56 × 10−<sup>3</sup> to remove spurious significance.

### Discriminative Power of Brain Regions

In order to determine which brain regions may play important role in gender-related brain structural differences, we repeated the same 3D PCNN-based classification on each specific brain region. We segmented each FA image into 246 gray matter regions of interests (ROIs) according to the Human Brainnetome Atlas (Fan et al., 2016) and 48 white matter ROIs according to the ICBM-DTI-81 White-Matter Labels Atlas (Mori et al., 2005). The classification accuracy was then obtained for each ROIs. A higher accuracy indicates a more important role of that ROI in gender-related difference. A map was then obtained based on the classification accuracies of different ROIs to show their distribution in the brain.

### Comparisons With Tract Based Spatial Statistics and Support Vector Machine

To justify the effectiveness of our method, the Tract Based Spatial Statistics (TBSS) and Support Vector Machine (SVM) were applied to our dataset as comparisons, since these are two popular methods for data analysis in neuroimaging studies (Bach et al., 2014; Zeng et al., 2018b). We compared the results in following two conditions: (1) We used the SVM as the classifier while keeping the same preprocessing procedure in order to compare its results with our 3D PCNN method. We flatten each sample from the 3D FA matrix into a vector, and then fed the SVM with the vector. (2) We used the TBSS to identify the brain regions where are shown the statistically significant gender-related difference.

### RESULTS

### Classification Results on the Whole-Brain FA Images

Using our 3D PCNN methods on the whole-brain FA images, we can well-distinguish men and women with the classification accuracy of 93.3%. This result is much better than using the SVM, whose classification accuracy is only 78.2%.

As comparisons, we also used MD, AD, and RD to repeat the same analysis. The classification accuracy of MD is 65.8%, AD is 69.9%, and RD is 67.8%. All of them are lower than the classification accuracy obtained by using FA.

### Feature Analysis in the First Hidden Layer of 3D PCNN

The result of two-sample t-test of 32 features of men and women shows that there are 25 features had significant gender differences including 13 features that women have larger values and 12 features that men have larger values (see **Figure 3**). Interestingly, men have significantly higher entropy than women for all features (see **Figure 4**).

### Classification on Each Specific ROI

TBSS could not detect any statistically significant gender-related difference in this dataset. However, using 3D PCNN, we did find gender-related differences in all ROIs in the both gray and white matters, as the classification accuracies (>75%) are much higher than the chance level (50%) for all ROIs. The maps of classification accuracies for different ROIs are shown in **Figure 5**. The detail classification results are provided in the supplement (see **Table S1** for gray matter and **Table S2** for white matter). In the gray matter, the top 5 regions with highest classification accuracies are the left precuneus (Broadman area, BA 31, 87.2%), the left postcentral gyrus (BA 1/2/3 trunk region, 87.2%), the left cingulate gyrus (BA 32 subgenual area, 87.2%), the right orbital gyrus of frontal lobe (BA 13, 87.1%) and the left occipital thalamus (86.9%). In the white matter, the top 5 regions with highest classification accuracies are middle cerebellum peduncle (89.7%), genu of corpus callosum (88.4%), the right anterior corona radiata (88.3%), the right superior corona radiata (86%), and the left anterior limb of internal capsule (85.4%).

### DISCUSSIONS

### Classification on the Whole-Brain FA

The proposed 3D PCNN model achieved 93.3% classification accuracy in the whole-brain FA. The high classification accuracy rate indicates that the proposed model can accurately find the brain structure difference between men and women, which is the basis of subsequent feature analysis and subreginal analysis. Most existing classification, regression, and other machine learning methods are shallow learning algorithms, such as the SVM, Boosting, maximum entropy, and Logistic Regression. When complex functions need to be expressed, the models obtained by these algorithms will then have a limitation with small size of samples and limited computational resources. Thus, the generalization ability will be deteriorated as we demonstrated in the results from the SVM. The benefit of deep learning algorithms, using multiple layers in the artificial neural network, is that one can represent complex functions with few parameters. The CNN is one of the widely used deep learning algorithms. In compared to the method like SVM, which is just a classifier, 3D CNN is a method that can extract the 3D spatial structure features of the input image. Through constructing the 3D PCNN model, we extracted highly abstract features from FA images, which may, thusly, improve the classification accuracy. FA describes the partial anisotropy index, which indicates the difference between one direction and others (Feldman et al., 2010). It can reflect alterations in various tissue properties including axonal size, axonal packing density, and degree of myelination (Chung et al., 2016). In this study, we also run the same analysis using MD, AD, and RD images for comparisons. All their results are lower than that of FA, indicating that using FA is more effective to find the

structure difference between men and women's brain than using other images.

### Feature Analysis in the First Hidden Layer of 3D PCNN

The degree of the macroscopic diffusion anisotropy is often quantified by the FA (Lasi et al., 2014). Previous studies found that wider skeleton of white matter in woman's brain but wider region of gray matter in man's brain (Witelson et al., 1995; Zaidi, 2010; Gong et al., 2011; Menzler et al., 2011). These mean that men appear to have more gray matter, made up of active neurons, while women may have more white matter for the neuronal communication between different areas of the brain. Furthermore, a recent study found that men had higher FA values than women in middle aged to elderly (between 44 and 77 years old) people by using a statistical analysis (Ritchie et al., 2018). This study focuses on the young healthy individuals with the age range between 22 and 36 years old. The structural features extracted from 3D PCNN reflect the brain structure difference between men and women. In the first hidden layer of 3D PCNN model, we found 25 features that have significant difference between men and women in voxels value. Moreover, using entropy measure, we found that men's brains likely have more complex features as reflected by significantly higher entropy. These results indicated that the gender-related differences likely exist in the whole-brain range including both white and gray matters.

### Most Discriminative Brain Regions

Using FA images from each specific brain region as the input to the 3D PCNN, we found all tested brain regions may have genderrelated difference, though the TBSS analysis cannot detect these differences. The brain regions with high classification accuracies include the left precuneus (Broadman area, BA 31, 87.2%), the left postcentral gyrus (BA 1/2/3 trunk region, 87.2%), the left cingulate gyrus (BA 32 subgenual area, 87.2%), the right orbital

for features.

gyrus of frontal lobe (BA 13, 87.1%), and the left occipital thalamus (86.9%) in the gray matter, and middle cerebellum peduncle (89.7%), genu of corpus callosum (88.4%), the right anterior corona radiata (88.3%), the right superior corona radiata (86%), and the left anterior limb of internal capsule (85.4%).

The gender-related morphological difference at the corpus callosum has been previously reported, which may be associated with interhemispheric interaction (Sullivan et al., 2001; Luders et al., 2003; Prendergast et al., 2015). However, likely due to the limitation of applied methods, not all previous studies have reported this difference (Abe et al., 2002). Those likely results in the inconsistent findings were across different studies. Through 3D PCNN model, our results confirm that there is likely a morphological difference at the genu of corpus callosum between man and women.

The middle cerebellum peduncle is the brain area connected to the pons and receiving the inputs mainly from the pontine nuclei (Glickstein and Doron, 2008), which are the nuclei of the pons involved in motor activity (Wiesendanger et al., 1979). Raz et al. (2001) found larger volume in the cerebellum of men than women. The cerebellar cells release diffusible substances that promote the survival of thalamic neurons (Tracey et al., 1980; Hisanaga and Sharp, 1990). Previous studies have reported gender-difference differences in the basic glucose metabolism in the thalamus of young subjects between the ages of 20 and 40 (Fujimoto et al., 2008). Beside the thalamus and cerebellum, the postcentral gyrus was also found in our results as the brain region with high classification accuracy. Thus, there is very likely a gender-related difference in the cerebellar-thalamiccortical circuitry. This difference may also be related to the reported gender differences in neurological degenerative diseases such as Parkinson's Disease (Lyons et al., 1998; Dluzen and Mcdermott, 2000; Miller and Cronin-Golomb, 2010), where the pathological changes are usually found in the cerebellarthalamic-cortical circuitry.

The findings of the current study also indicated the genderrelated difference in the limbic-thalamo-cortical circuitry. Anterior corona radiata is part of the limbic-thalamo-cortical

circuitry and includes thalamic projections from the internal capsule to the prefrontal cortex. White matter changes in the anterior corona radiata could result in many of the cognitive and emotion regulation disturbances (Drevets, 2001). The orbital gyrus of frontal cortex gray matter areas and cingulate gyrus have also been reported to be associated with the emotion regulation system (Fan et al., 2005). Thus, the gender-related difference in the limbic-thalamo-cortical circuitry may explain the gender differences in thalamic activation during the processing of emotional stimuli or unpleasant linguistic information concerning interpersonal difficulties as demonstrated by previous fMRI (Lee and Kondziolka, 2005; Shirao et al., 2005).

In summary, by using the designed 3D PCNN algorithm, we confirmed that the gender-related differences exist in the wholebrain FA images as well as in each specific brain regions. These gender-related brain structural differences might be related to gender differences in cognition, emotional control as well as neurological disorders.

### DATA AVAILABILITY

Publicly available datasets were analyzed in this study. This data can be found here: https://www.humanconnectome.org/.

### AUTHOR CONTRIBUTIONS

JX, YT, and YY contributed to the conception and design of the study. YT, JX, and YZ performed data analysis. YT and JX drafted manuscript. YT and YY participated in editing the manuscript.

### FUNDING

JX and YZ are supported by 111 Project (No. B18059). YT is supported by grant 2016JJ4090 from the Natural Science Foundation of Hunan Province and grants 2017T100613

### REFERENCES


and 2016M592452 from the China Postdoctoral Science Foundation, China. YY is supported by the Dixon Translational Research Grants Initiative (PI: YY) from the Northwestern Memorial Foundation (NMF) and the Northwestern University Clinical and Translational Sciences (NUCATS) Institute.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fnins. 2019.00185/full#supplementary-material


method and a hierarchical neural network. Front. Comput. Neurosci. 12:96. doi: 10.3389/fncom.2018.00096


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Xin, Zhang, Tang and Yang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A New Pulse Coupled Neural Network (PCNN) for Brain Medical Image Fusion Empowered by Shuffled Frog Leaping Algorithm

Chenxi Huang<sup>1</sup> , Ganxun Tian<sup>1</sup> , Yisha Lan<sup>1</sup> , Yonghong Peng<sup>2</sup> , E. Y. K. Ng<sup>3</sup> , Yongtao Hao<sup>1</sup> \*, Yongqiang Cheng<sup>4</sup> \* and Wenliang Che<sup>5</sup> \*

<sup>1</sup> Department of Computer Science and Technology, Tongji University, Shanghai, China, <sup>2</sup> Faculty of Computer Science, University of Sunderland, Sunderland, United Kingdom, <sup>3</sup> School of Mechanical and Aerospace Engineering, Nanyang Technological University, Singapore, Singapore, <sup>4</sup> School of Engineering and Computer Science, University of Hull, Kingston upon Hull, United Kingdom, <sup>5</sup> Department of Cardiology, Shanghai Tenth People's Hospital, Tongji University School of Medicine, Shanghai, China

#### Edited by:

Nianyin Zeng, Xiamen University, China

#### Reviewed by:

Ming Zeng, Xiamen University, China Cheng Wang, Huaqiao University, China Yingchun Ren, Jiaxing University, China

#### \*Correspondence:

Yongtao Hao hao0yt@163.com Yongqiang Cheng Y.Cheng@hull.ac.uk Wenliang Che chewenliang@tongji.edu.cn

#### Specialty section:

This article was submitted to Brain Imaging Methods, a section of the journal Frontiers in Neuroscience

Received: 21 October 2018 Accepted: 25 February 2019 Published: 20 March 2019

#### Citation:

Huang C, Tian G, Lan Y, Peng Y, Ng EYK, Hao Y, Cheng Y and Che W (2019) A New Pulse Coupled Neural Network (PCNN) for Brain Medical Image Fusion Empowered by Shuffled Frog Leaping Algorithm. Front. Neurosci. 13:210. doi: 10.3389/fnins.2019.00210 Recent research has reported the application of image fusion technologies in medical images in a wide range of aspects, such as in the diagnosis of brain diseases, the detection of glioma and the diagnosis of Alzheimer's disease. In our study, a new fusion method based on the combination of the shuffled frog leaping algorithm (SFLA) and the pulse coupled neural network (PCNN) is proposed for the fusion of SPECT and CT images to improve the quality of fused brain images. First, the intensity-hue-saturation (IHS) of a SPECT and CT image are decomposed using a nonsubsampled contourlet transform (NSCT) independently, where both low-frequency and high-frequency images, using NSCT, are obtained. We then used the combined SFLA and PCNN to fuse the high-frequency sub-band images and low-frequency images. The SFLA is considered to optimize the PCNN network parameters. Finally, the fused image was produced from the reversed NSCT and reversed IHS transforms. We evaluated our algorithms against standard deviation (SD), mean gradient (G¯ ), spatial frequency (SF) and information entropy (E) using three different sets of brain images. The experimental results demonstrated the superior performance of the proposed fusion method to enhance both precision and spatial resolution significantly.

Keywords: single-photon emission computed tomography image, computed tomography image, image fusion, pulse coupled neural network, shuffled frog leaping

### INTRODUCTION

In 1895 Rontgen obtained the first human medical image by X-ray, after which research of medical images gained momentum, laying the foundation for medical image fusion. With the development of both medical imaging technology and hardware facilities, a series of medical images with different characteristics and information were obtained, contributing to a key source of information for disease diagnosis. At present, clinical medical images mainly include Computed Tomography (CT) images, Magnetic Resonance Imaging (MRI) images, Single-Photon Emission Computed

Tomography (SPECT) images, Dynamic Single-Photon Emission Computed Tomography (DSPECT) and ultrasonic images, etc. (Jodoin et al., 2015; Hansen et al., 2017; Zhang J. et al., 2017). It is necessary to fuse different modes of medical images into more informative images based on fusion algorithms, in order to provide doctors with more reliable information during clinical diagnosis (Kavitha and Chellamuthu, 2014; Zeng et al., 2014). At present, medical image fusion has been considered in many aspects, such as the localization of brain diseases, the detection of glioma, the diagnosis of AD (Alzheimer's disease), etc. (Huang, 1996; Singh et al., 2015; Zeng et al., 2018).

Image fusion is the synthesis of images into a new image using a specific algorithm. The space-time relativity and complementarity of information in fused images can be fully used in the process of image fusion, contributing to a more comprehensive expression of the scene (Wu et al., 2005; Choi, 2006). Conventional methods of SPECT and CT fusion images mainly include component substitution and multi-resolution analysis (Amolins et al., 2007; Huang and Du, 2008; Huang and Jiang, 2012). Component substitution mainly refers to intensityhue-saturation (IHS) transform, with the advantage of improving the spatial resolution of SPECT images (Huang, 1999; Rahmani et al., 2010). The limitation of transform invariance leads to difficulty in extracting both image contour and edge details. In order to solve this problem, contourlet transform was proposed by Da et al. (2006), Zhao et al. (2012), Xin and Deng (2013). Moreover, non-subsampled contourlet transform (NSCT) was also proposed to fully extract the directional information of SPECT images and CT images to be fused, providing better performance in image decomposition (Da et al., 2006; Wang and Zhou, 2010; Yang et al., 2016).

The Pulse Coupled Neural Network (PCNN) was discovered by Eckhorn et al. (1989) in the 1990s while studying the imaging mechanisms of the visual cortex of small mammals. No training process is required in the PCNN and useful information can be obtained from a complex background through the PCNN. Nevertheless, the PCNN has its shortcomings, such as the numerous parameters and the complicated process of setting parameters. Thus, novel algorithms to optimize the PCNN parameters has been introduced to improve the calculation speed of PCNN (Huang, 2004; Huang et al., 2004; Jiang et al., 2014; Xiang et al., 2015). SFLA is a new heuristic algorithm first presented by Eusuff and Lansey, which combines the advantages of the memetic algorithm and particle swarm optimization. The algorithm can search and analyze the optimal value in a complex space with fewer parameters and has a higher performance and robustness (Samuel and Asir Rajan, 2015; Sapkheyli et al., 2015; Kaur and Mehta, 2017).

In our study, a new fusion approach based on the SFLA and PCNN is proposed to address the limitations discussed above. Our proposed method not only innovatively uses SFLA optimization to effectively learn the PCNN parameters, but also produces high quality fused images. A series of contrasting experiments are discussed in view of image quality and objective evaluations.

The remaining part of the paper is organized as follows. Related work is introduced in Section "Related Works." The fusion method is proposed in Section "Materials and Methods." The experimental results are presented in Sections "Result" and "Conclusion" concludes the paper with an outlook on future work.

## RELATED WORKS

Image fusion involves a wide range of disciplines and can be classified under the category of information fusion, where a series of methods have been presented. A novel fusion method, for multi-scale images has been presented by Zhang X. et al. (2017) using Empirical Wavelet Transform (EWT). In the proposed method, simultaneous empirical wavelet transforms (SEWT) were used for one-dimensional and two-dimensional signals, to ensure the optimal wavelets for processed signals. A satisfying visual perception was achieved through a series of experiments and in terms of objective evaluations, it was demonstrated that the method was superior to other traditional algorithms. However, time consumption of the proposed method is high, mainly during the process of image decomposition, causing application difficulties in a real time system. Noised images should also be considered in future work where the process of generating optimal wavelets may be affected (Zeng et al., 2016b; Zhang X. et al., 2017).

Aishwarya and Thangammal (2017) also proposed a fusion method based on a supervised dictionary learning approach. During the dictionary training, in order to reduce the number of input patches, gradient information was first obtained for every patch in the training set. Second, both the information content and edge strength was measured for each gradient patch. Finally, the patches with better focus features were selected by a selection rule, to train the over complete dictionary. Additionally, in the process of fusion, the globally learned dictionary was used to achieve better visual quality. Nevertheless, high computational costs also exist in this proposed approach during the process of sparse coding and final fusion performance, which may be affected by high frequency noise (Zeng et al., 2016a; Aishwarya and Thangammal, 2017).

Moreover, an algorithm for the fusion of thermal and visual images was introduced by M Kanmani et al. in order to obtain a single comprehensive fused image. A novel method called self tuning particle swarm optimization (STPSO) was presented to calculate the optimal weights. A weighted averaging fusion rule was also used to fuse the low frequency- and high frequency coefficients, obtained through Dual Tree Discrete Wavelet Transform (DT-DWT) (Kanmani and Narasimhan, 2017; Zeng et al., 2017a). Xinxia Ji et al. proposed a new fusion algorithm based on an adaptive weighted method in combination with the idea of fuzzy theory. In the algorithm, a membership function with fuzzy logic variables were designed to achieve the transformation of different leveled coefficients by different weights. Experimental results indicated that the proposed algorithm outperformed existing algorithms in aspects of visual quality and objective measures (Ji and Zhang, 2017; Zeng et al., 2017b).

### MATERIALS AND METHODS

fnins-13-00210 March 19, 2019 Time: 9:50 # 3

### The Image Fusion Method Based on PCNN and SFLA

The algorithm 3.1 represents an image fusion algorithm based on the PCNN and SFLA, where SPECT and CT images are fused. In our proposed algorithm, a SPECT image is first decomposed on three components using IHS transform, which include saturation S, hue H and intensity I. Component I is then decomposed to a low-frequency and high-frequency image through NSCT decomposition. Additionally, a CT image is decomposed into a low-frequency and high-frequency image through NSCT decomposition. Moreover, the two low-frequency images obtained above are fused in a new low-frequency image through the SFLA and PCNN combination fusion rules, while the two high-frequency images obtained above are fused into a new high-frequency image through the SFLA and PCNN combination fusion rules. Next, the new low-frequency and new high-frequency images are fused to generate a new image with intensity I' using reversed NSCT. Finally, the target image is achieved by using reversed IHS transform to integrate the three components S, H and I'.

Algorithm 1: An image fusion algorithm based on PCNN and SFLA

Input: A SPECT image A and a CT image B

Output: A fused image F

Step 1: Obtain three components of image A using IHS transform; saturation S, hue H and intensity I.

Step 2: Image decomposition

(1) Decompose the component I of image A to a low-frequency image AL and high-frequency image AH through NSCT decomposition.

(2) Decompose image B to a low-frequency image BL and high-frequency image BH through NSCT decomposition. Step 3: Image fusion

(1) Fuse the low-frequency images AL and BL to a new low-frequency image CL through the SFLA and PCNN combination fusion rules.

(2) Fuse the high-frequency images AH and BH to form a new high-frequency image CH through the SFLA and PCNN combination fusion rules.

Step 4: Inverse transform

Fuse the low-frequency image CL and high-frequency image CH to a new image with intensity I' using reversed NSCT.

Step 5: Reversed IHS transform

Through the reversed IHS transform, integrate the three components S, H and I', then obtain the target image F.

The overall method of the proposed algorithm for the fusion of a SPECT and CT image is outlined in **Figure 1**.

### Decomposition Rule

In our proposed method, the SPECT image and CT image are decomposed into a low-frequency and high-frequency image using NSCT.

Non-subsampled contourlet transform (Huang, 1999; Rahmani et al., 2010) is composed of a non-subsampled pyramid filter bank (NSPFB) and a non-subsampled directional filter bank (NSDFB). The source image is decomposed into a highfrequency sub-band and a low-frequency sub-band by NSPFB. The high-frequency sub-band is then decomposed into a subband of each direction by NSDFB. The structure diagram of the two-level decomposition of NSCT is shown in **Figure 2**.

An analysis filter {H<sup>1</sup> (z), H<sup>2</sup> (z)} and a synthesis filter {G<sup>1</sup> (z),G<sup>2</sup> (z)} are used when using NSCT to decompose images and the two filters satisfy H1(z)G1(z) + H2(z)G2(z) = 1. The source image can generate low-frequency and high-frequency sub-band images when it is decomposed by NSP. The next level of NSP decomposition is performed on low-frequency components obtained by the upper-level decomposition. An analysis filter {U<sup>1</sup> (z), U<sup>2</sup> (z)} and synthesis filters {V<sup>1</sup> (z),V<sup>2</sup> (z)} are contained in the design structure of NSDFB with the requirement of U1(z)V1(z) + U2(z)V2(z) = 1. The high-pass sub-band image decomposed by J-level NSP is decomposed by L-level NSDFB, and the high-frequency sub-band coefficients can be obtained at the number of 2<sup>n</sup> , where n is an integer higher than 0. A fused image with clearer contours and translation invariants can be obtained through the fusion method based on NSCT (Xin and Deng, 2013).

### Fusion Rule

Fusion rules affect image performance, so the selection of fusion rules largely determines the quality of the final fused image. In this section, the PCNN fusion algorithm based on SFLA is introduced for low-frequency and high-frequency sub-band images decomposed by NSCT.

### Pulse Coupled Neural Network

The PCNN is a neural network model of single-cortex feedback, to simulate the processing mechanism of visual signals in the cerebral cortex of cats. It consists of several neurons connected to each other, where each neuron is composed of three parts: the receiving domain, the coupled linking modulation domain and the pulse generator. In image fusion using the PCNN, the M <sup>∗</sup> N neurons of a two-dimensional PCNN network correspond to the M <sup>∗</sup> N pixels of the two-dimensional input image, and the gray value of the pixel is taken as the external stimulus of the network neuron. Initially, the internal activation of neurons is equal to the external stimulation. When the external stimulus is greater than the threshold value, a natural ignition will occur. When a neuron ignites, its threshold will increase sharply and then decay exponentially with time. When the threshold attenuates to less than the corresponding internal activation, the neuron will ignite again, and the neuron will generate a pulse sequence signal. The ignited neurons stimulate the ignition of adjacent neurons by interacting with adjacent neurons, thereby generating an automatic wave in the activation region to propagate outward (Ge et al., 2009).

The parameters of the PCNN affect the quality of image fusion, and most current research uses the method of regressively exploring the values of parameters, which is subjective to a certain degree. Therefore, how to reasonably set the parameters of the PCNN is the key to improving its performance. In our paper, SFLA is used to optimize the PCNN network parameters.

### Shuffled Frog Leaping Algorithm

Shuffled frog leaping algorithm is a particle swarm search method based on groups to obtain optimal results. The flowchart of SFLA is shown in **Figure 3**. First, the population size F, the number of sub populations m, the maximum iterations of local search for each sub population N and the number of frogs in each sub population n were defined. Second, a population was initialed, and the fitness value of each frog was calculated and sorted in a descending order. A memetic algorithm is used in the process of the search, and the search is carried out in groups. All groups are then fused, and the frogs are sorted according to an established rule. Moreover, the frog population is divided based on the established rules, and the overall information exchange is achieved using this method until the number of iterations are equal to the maximum iterations N (Li et al., 2018).

F(x) is defined as a fitness function and is a feasible domain. In each iteration, P<sup>g</sup> is the best frog for a frog population, P<sup>b</sup> represents the best frog for each group and P<sup>w</sup> is the worst frog for each group. The algorithm adopts the following update strategy to carry out a local search in each group:

$$\begin{cases} \mathsf{S}\_{\mathsf{j}} = rand(\mathsf{j} \cdot (P\_{\mathsf{b}} - P\_{\mathsf{w}}), & -\mathsf{S}\_{\max} \le \mathsf{S}\_{\mathsf{j}} \le \mathsf{S}\_{\max} \\\ P\_{\mathsf{w}, \text{new}} = P\_{\mathsf{w}} + \mathsf{S}\_{\mathsf{j}} \end{cases} \tag{1}$$

where S<sup>j</sup> represents the updated value of frog leaping, rand () is defined as the random number between 0 and 1, Smax is described as the maximum leaping value, and Pw,new is the worst frog of updated group. If Pw,new ∈ and F(Pw,new) > F(Pw), P<sup>w</sup> can be replaced by Pw,new, otherwise, P<sup>b</sup> will be replaced by Pg. At the same time, if P 0 <sup>w</sup>,new ∈ and F(P 0 <sup>w</sup>,new) > F(Pw), P<sup>w</sup> can be replaced by P 0 <sup>w</sup>,new, otherwise P<sup>w</sup> can be replaced by a new frog and then the process of iteration will continue until the maximum iterations is reached.

### PCNN Fusion Algorithm Based on SFLA

Three parameters αθ, β and V<sup>θ</sup> in PCNN are essential for the results of image fusion. Therefore, as it is shown in **Figure 4**, in our study, the SFLA is used to optimize the PCNN in order to achieve the optimal solution of the PCNN parameters. Each frog is defined as a spatial solution X(αθ, β,Vθ) and the optimal configuration scheme of the PCNN parameters can finally be obtained by searching for the best frog Xb(αθ, β,Vθ).

In our proposed method, possible configuration schemes of parameters are defined, which constitute a solution space for the parameter optimization. After generating an initial frog solution space, F frogs in the population are divided into m groups, and each group is dependent on one another. Starting from the initial solution, the frogs in each group first carry out an intraclass optimization by a local search, thereby continuously updating their own fitness values. In N iterations of local optimization, the quality of the whole frog population is optimized with the improvement of the quality of frogs in all groups. The frogs of the population are then fused and regrouped according to the established rule, and local optimization within the group is

carried out until reaching the final iteration conditions. Finally, the global optimal solution of the frog population is defined as the optimal PCNN parameter configuration. The final fusion image is thus obtained using the optimal parameter configuration above.

### RESULTS

In order to verify the accuracy and preservation of the edge details in our proposed method, three sets of CT and SPECT images were fused based on our method. The results of each set were compared with four fusion methods; IHS, NSCT+FL, DWT, NSCT+PCNN. In the method of NSCT+FL, images are first decomposed by NSCT to obtain high-frequency and lowfrequency coefficients, and then fusion images are obtained by taking large value high-frequency coefficients and taking average value low-frequency coefficients. In NSCT+PCNN, images are decomposed by NSCT and fused by the PCNN.

## Subjective Evaluations of Experimental Results

Experiments were implemented on the image database from the Whole Brain Web Site of Harvard Medical School (Johnson and Becker, 2001) which contains two groups of images including CT and SPECT images. Each group has three examples including normal brain images, glioma brain images and brain images of patients diagnosed with Alzheimer's disease. The testing images have been used in many related papers (Du et al., 2016a,b,c) and the platform is MATLAB R2018a.

A series of fusion results of SPECT and CT images, based on different methods including IHS, NSCT+FL, DWT, NSCT+PCNN, and our proposed method is shown in **Figures 5–7**. The fusion results of a set of normal brain images are shown in **Figure 5**, the fusion results of a set of glioma brain images are presented in **Figure 6**, while a set of brain images of patients diagnosed with Alzheimer's disease are shown

method.

Huang et al. Brain Medical Image Fusion

in **Figure 7**. In **Figures 5–7**, (a), (h) and (o) are source CT images; (b), (i), (p) are source SPECT images; (c), (j) and (q) are fused images based on IHS; (d), (k) and (r) are fused images based on NSCT+FL; (e), (l) and (s) are fused images based on DWT; (f), (m) and (t) are fused images based on the combination of NSCT+PCNN; (g), (n) and (u) are fused images based on the proposed method. It can be seen that the fusion results based on our proposed method are more accurate and clearer than those based on various other methods. Our proposed method contributes to a higher brightness of fusion images and more information on the edge details.

### Objective Evaluations of Experimental Results

A set of metrics is used to compare the performance of the fusion methods including IHS, DWT, NSCT, PCNN, a combination of NSCT and the PCNN, and our proposed method. The evaluation metrics including standard deviation (SD), mean gradient (G¯ ), spatial frequency (SF) and information entropy (E) are entailed as follows (Huang et al., 2018):

(1) Standard deviation

Standard deviation is used to evaluate the contrast of the fused image, which is defined as

$$\sigma = \sqrt{\sum\_{i=1}^{M} \sum\_{j=1}^{N} (Z(i,j) - \bar{Z})^2 / (M \times N)} \tag{2}$$

where Z(i, j) represents the pixel value of the fused image and Z¯ is the mean value of the pixel values of the image.

The SD reflects the discrete image gray scale relative to the mean value of gray scale. And a higher value of SD demonstrates the performance of a fused image.

(2) Mean gradient (G¯ )

G¯ corresponds to the ability of a fused image to represent the contrast of tiny details sensitively. It can be mathematically described as

$$\bar{G} = \frac{1}{(M-1)(N-1)} \sum\_{i=1}^{M-1} \sum\_{j=1}^{N-1}$$

$$\times \sqrt{((\frac{\partial Z(\mathbf{x}\_i, \boldsymbol{y}\_j)}{\partial \mathbf{x}\_i})^2 + (\frac{\partial Z(\mathbf{x}\_i, \boldsymbol{y}\_j)}{\partial \boldsymbol{y}\_i})^2)/2} \tag{3}$$

The fused image is clearer when the value of mean gradient is higher.

(3) Spatial frequency (SF)

Spatial frequency is the measure of the overall activity in a fused image. For an image with a gray value Z(xi, yj) at position (xi, yj), the spatial frequency is defined as

$$SF = \sqrt{RF^2 + CF^2} \tag{4}$$

Where row frequency

$$RF = \sqrt{\frac{1}{M \times N} \sum\_{i=1}^{M} \sum\_{j=2}^{N} [Z(\mathbf{x}\_i, \boldsymbol{\uprho}\_j) - Z(\mathbf{x}\_i, \boldsymbol{\uprho}\_{j-1})]^2} \tag{5}$$

Column frequency

$$CF = \sqrt{\frac{1}{M \times N} \sum\_{i=2}^{M} \sum\_{j=1}^{N} [Z(\mathbf{x}\_i, \boldsymbol{\uprho}\_j) - Z(\mathbf{x}\_{i-1}, \boldsymbol{\uprho}\_j)]^2} \tag{6}$$

The higher the value of frequency, the better the fused image quality.

(4) Information entropy (E)

Information entropy is provided by the below equation

$$E = -\sum\_{i=0}^{L-1} p\_i \log\_2 p\_i \tag{7}$$

TABLE 1 | Performance evaluations on normal brain fused images based on different methods.


TABLE 2 | Performance evaluations on glioma brain fused images based on different methods.


TABLE 3 | Performance evaluations on fused brain images of patients diagnosed with Alzheimer's disease, based on different methods.


where L is image gray scale and Pi is the proportion of the pixel of the gray value i in whole pixels. A higher value of entropy indicates more information contained in the fused image.

Experiment results on fused images of SPECT images and CT images are shown in **Tables 1–3**. The fusion results of a set of normal brain images are shown in **Table 1**, the fusion results of a set of glioma brain images are presented in **Table 2**, while a set of brain images of patients diagnosed with Alzheimer's disease are shown **Table 3**. It can be seen that compared to other fusion methods, our proposed method generally has higher values in SD, G¯ , SF and E. The experimental results demonstrate that information of fusion images obtained by our proposed method is more abundant, the inheritance of detail information performs better, while the resolution is significantly improved.

### CONCLUSION

In this paper, a new fusion method for SPECT brain and CT brain images was put forward. First, NSCT was used to decompose the IHS transform of a SPECT and CT image. The fusion rules, based on the regional average energy, was then used for low-frequency coefficients and the combination of SFLA and the PCNN was used for high-frequency sub-bands. Finally, the fused image was produced by reversed NSCT and reversed IHS transform. Both subjective evaluations and objective evaluations were used to analyze the quality of the fused images. The results demonstrated that the method we put forward can retain the information of source images better and reveal more details in integration. It can be seen that the proposed method is valid and effective in

### REFERENCES


achieving satisfactory fusion results, leading to a wide range of applications in practice.

The paper focuses on multi-mode medical image fusion. However, there is a negative correlation between the realtime processing speed and the effectiveness of medical image fusion. Under the premise of ensuring the quality of fusion results, how to improve the efficiency of the method should be considered in the future.

### DATA AVAILABILITY

Publicly available datasets were analyzed in this study. This data can be found here: http://www.med.harvard.edu/aanlib/.

### AUTHOR CONTRIBUTIONS

CH conceived the study. GT and CH designed the model. YC and YP analyzed the data. YL and WC wrote the draft. EN and YH interpreted the results. All authors gave critical revision and consent for this submission.

## FUNDING

This work was supported in part by the Tongji University Short-term Study Abroad Program under Grant 2018020017, National Science and Technology Support Program under Grant 2015BAF10B01, and National Natural Science Foundation of China under Grants 81670403, 81500381, and 81201069. CH acknowledges support from Tongji University for the exchange with Nanyang Technological University.



NSCT domain. Infrared Phys. Technol. 69, 53–61. doi: 10.1016/j.infrared.2015. 01.002


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Huang, Tian, Lan, Peng, Ng, Hao, Cheng and Che. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Multiple Sclerosis Identification by 14-Layer Convolutional Neural Network With Batch Normalization, Dropout, and Stochastic Pooling

Shui-Hua Wang1,2†, Chaosheng Tang1†, Junding Sun1†, Jingyuan Yang3†, Chenxi Huang<sup>4</sup> \*, Preetha Phillips <sup>5</sup> \* and Yu-Dong Zhang1,6 \* †

*<sup>1</sup> School of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China, <sup>2</sup> School of Architecture Building and Civil Engineering, Loughborough University, Loughborough, United Kingdom, <sup>3</sup> The Faculty of Computer Science and Engineering, Xi'an University of Technology, Xi'an, China, <sup>4</sup> Department of Computer Science and Technology, Tongji University, Shanghai, China, <sup>5</sup> West Virginia School of Osteopathic Medicine, Lewisburg, WV, United States, <sup>6</sup> Department of Informatics, University of Leicester, Leicester, United Kingdom*

#### Edited by:

*Nianyin Zeng, Xiamen University, China*

#### Reviewed by:

*Xia-an Bi, Hunan Normal University, China Victor Chang, Xi'an Jiaotong-Liverpool University, China*

#### \*Correspondence:

*Chenxi Huang 1710051@tongji.edu.cn Preetha Phillips pphillips@osteo.wvsom.edu Yu-Dong Zhang yudongzhang@ieee.org*

*†These authors have contributed equally to this work*

#### Specialty section:

*This article was submitted to Brain Imaging Methods, a section of the journal Frontiers in Neuroscience*

Received: *12 September 2018* Accepted: *19 October 2018* Published: *08 November 2018*

#### Citation:

*Wang S-H, Tang C, Sun J, Yang J, Huang C, Phillips P and Zhang Y-D (2018) Multiple Sclerosis Identification by 14-Layer Convolutional Neural Network With Batch Normalization, Dropout, and Stochastic Pooling. Front. Neurosci. 12:818. doi: 10.3389/fnins.2018.00818* Aim: Multiple sclerosis is a severe brain and/or spinal cord disease. It may lead to a wide range of symptoms. Hence, the early diagnosis and treatment is quite important.

Method: This study proposed a 14-layer convolutional neural network, combined with three advanced techniques: batch normalization, dropout, and stochastic pooling. The output of the stochastic pooling was obtained via sampling from a multinomial distribution formed from the activations of each pooling region. In addition, we used data augmentation method to enhance the training set. In total 10 runs were implemented with the hold-out randomly set for each run.

Results: The results showed that our 14-layer CNN secured a sensitivity of 98.77 ± 0.35%, a specificity of 98.76 ± 0.58%, and an accuracy of 98.77 ± 0.39%.

Conclusion: Our results were compared with CNN using maximum pooling and average pooling. The comparison shows stochastic pooling gives better performance than other two pooling methods. Furthermore, we compared our proposed method with six state-of-the-art approaches, including five traditional artificial intelligence methods and one deep learning method. The comparison shows our method is superior to all other six state-of-the-art approaches.

Keywords: multiple sclerosis, deep learning, convolutional neural network, batch normalization, dropout, stochastic pooling

## INTRODUCTION

Multiple sclerosis (abbreviated as MS) is a condition that affects the brain and/or spinal cord (Chavoshi Tarzjani et al., 2018). It will lead to a wide range of probable symptoms, likely with balance (Shiri et al., 2018), vision, movement, sensation (Demura et al., 2016), etc. It has two main types: (i) relapsing remitting MS and (ii) primary progressive MS. More than eight out of every ten diagnosed MS patients are of the "relapsing remitting" type (Guillamó et al., 2018).

Wang et al. MS by 14-Layer CNN-DO-BN-SP

MS diagnosis may be confused with other white matter diseases, such as neuromyelitis optica (NMO) (Lana-Peixoto et al., 2018), acute cerebral infarction (ACI) (Deguchi et al., 2018), acute disseminated encephalomyelitis (ADEM) (Desse et al., 2018), etc. Hence, accurate diagnosis of MS is important for patients and following treatments. In this study, a preliminary study that identifies MS from healthy controls with the help of magnetic resonance imaging (MRI) was investigated and implemented.

Recently, researchers tend to use computer vision and image processing ( Zhang and Wu, 2008, 2009; Zhang et al., 2009a,b, 2010a,b) techniques to accomplish MS automatic-identification tasks. For instances, Murray et al. (2010) proposed to use multiscale amplitude modulation and frequency modulation (AM-FM) to identify MS. Nayak et al. (2016) presented a novel method, combining AdaBoost with random forest (ARF). Wang et al. (2016) combined biorthogonal wavelet transform (BWT) and logistic regression (LR). Wu and Lopez (2017) used four-level Haar wavelet transform (HWT). Zhang et al. (2017) proposed a novel MS identification system based on Minkowski-Bouligand Dimension (MBD).

Above methods secured promising results. Nevertheless, their methods need to extract features beforehand, and they need to validate their hand-extracted features effective (Chang, 2018a,b,c; Lee et al., 2018). Recently, convolutional neural network (CNN) attracts the research interest of scholars, since it can mechanically develop the features by its early layers. CNN has already been applied to many fields, such as biometric identification (Das et al., 2019), manipulation detection (Bayar and Stamm, 2018), etc. Zhang et al. (2018) is the first to apply CNN to identify MS, and their method achieved an overall accuracy of 98.23%.

This study is based on the CNN structure of Zhang et al. (2018). We proposed two other improvements: batch normalization and stochastic pooling. In addition, we used dynamic learning rate to accelerate the convergence. Learning rate is a parameter to control how quickly the proposed model converge to a local minimal. Low learning rate means a slow speed toward the downward slope. However, it can certain that we won't miss the local minimum but a long time to converge. Therefore, in our research, we set the learning rate a large value and reduce it by every given number of epochs instead of the fixed small learning rate until achieve convergence.

The rest of this paper is organized as follows: section Data Preprocessing described the data processing including data sources and data preprocessing. Section Methodology illustrates the method used in our research. Section Experiments, Results, and Discussions provided the experiment result and discussion.

### DATA PREPROCESSING

### Two Sources

The dataset in this study were obtained from Zhang et al. (2018). First, MS images were obtained from the eHealth laboratory (2018). All brain lesions were identified and delineated by experienced MS neurologists, and were confirmed by radiologists. Second, the healthy controls were used from 681 slices of 26 healthy controls provided in Zhang et al. (2018). **Table 1** shows the demographic characteristics of two datasets.

**Figure 1A** shows the original slice, and **Figure 1B** shows the delineated results with four plaques, Areas surrounded by red line denotes the plaque. **Figures 1C,D** presents two slices from healthy controls.

### Contrast Normalization

The brain slices are from two different sources; hence, the scanner machines may have different hardware setting (scanning sequence) and software settings (reconstruction from k-space, the store format, etc.). It is necessary to match the two sources of images in terms of gray-level intensities. This is also called contrast normalization, with aim of achieving consistency in dynamic range of various sources of data.

Histogram stretching (HS) method (Li et al., 2018) was chosen due to ease of implementation. HS aims to enhance the contrast by stretching the range of intensity values of two sources of

TABLE 1 | Demographic characteristics of two datasets.


FIGURE 3 | Pipeline of conv layer.

images to the same range, providing the effect of inter-scan normalization. The contrast normalization is implemented in following way.

FIGURE 4 | A toy example of max pooling and average pooling.

Let us assume µ is the original brain image, and ϕ is the contrastnormalized image, the process of HS can be described as

$$\varphi(\mathbf{x}, \boldsymbol{\uprho}) = \frac{\mu(\mathbf{x}, \boldsymbol{\uprho}) - \mu\_{\min}}{\mu\_{\max} - \mu\_{\min}} \tag{1}$$

where (x, y) represents the coordinate of pixel, µmin and µmax represents the minimum and maximum intensity values of

original brain image µ.

$$
\mu\_{\min} = \min\_{\mathfrak{x}} \min\_{\mathfrak{y}} \text{ (}\mu(\mathfrak{x}, \mathfrak{y})\text{)}\tag{2}
$$

$$
\mu\_{\text{max}} = \max\_{\mathbf{x}} \max\_{\mathbf{y}} \left( \mu(\mathbf{x}, \mathbf{y}) \right) \tag{3}
$$

We do contrast normalization for both two data of different sources, and finally combine them together, forming a 676+681 = 1,357-image dataset.

### METHODOLOGY

Convolutional neural network is usually composed of conv layers, pooling layer, and fully connected layers. **Figure 2** gives a toy example that consists of two conv layers, two pooling layers, and two fully connected layers. CNN can achieve comparable or even better performance than traditional AI approaches, while it does not need to manual design the features (Zeng et al., 2014, 2016a,b, 2017a,b).

TABLE 2 | Variables used in batch normalization.


### Conv Layer

The conv layers performed Two-dimensional convolution along the width and height directions (Yu et al., 2018). It is worth noting that the weights in CNN are learned from backpropagation, except for initialization that weights are given randomly. **Figure 3** shows the pipeline of data passing through a conv layer. Suppose there is an input with size of

$$\text{Input} : H\_I \times W\_I \times D \tag{4}$$

where H<sup>I</sup> , W<sup>I</sup> , and C represent the height, width, and channels of the input, respectively.

Suppose the size of filter is

$$\begin{aligned} \text{Filter 1}: &H\_F \times W\_F \times D \\ \cdots & \text{Filer } Z: &W\_F \times D \end{aligned} \tag{5}$$

where H<sup>F</sup> and W<sup>F</sup> are height and width of each filter, and the channels of filter should be the same as that of the input. Z denotes the number of filters. Those filters move with stride of M and padding of N, then the channels of output activation map should be Z. The output size is:

$$\text{Output}: H\_O \times W\_O \times Z \tag{6}$$

where H<sup>O</sup> and W<sup>O</sup> are the height and width of the output. Their values are:

$$H\_O = 1 + \left\lfloor \frac{2N + H\_I - H\_F}{M} \right\rfloor \tag{7}$$

$$W\_O = 1 + \left\lfloor \frac{2N + W\_I - W\_F}{M} \right\rfloor \tag{8}$$

where ⌊⌋ denotes the floor function. The outputs of conv layer are usually passed through a non-linear activation function, which normally chooses as rectified linear unit (ReLU) function.


### Pooling Layer

The activation map contains too much features which can lead to overfitting and computational burden. Pooling layer is often used to implement dimension reduction. Furthermore, pooling can help to obtain invariance to translation. There are two commonly-used pooling methods: average pooling (AP), max pooling (MP).

The average pooling (Ibrahim et al., 2018) is to calculate the average value of the elements in each pooling region, while the max pooling is to select the max value of the pooling region. Suppose the region R contains pixelsχ , the average pooling and max pooling are defined as:

$$\text{AP:} \{\mathbf{y}\_{j} = \mathbf{x}\_{i} / \sum\_{i \in R\_{j}} \mathbf{x}\_{i} \} \tag{9}$$

$$\text{MP:} \{ \mathbf{y}\_j = \max\_{i \in R\_j} \chi\_i \} \tag{10}$$

**Figure 4** shows the difference, where the kernel size equals 2 and stride equals 2. The max pooling finally outputs the maximum values of all four quadrants, while the average pooling outputs the average values.

### Softmax and Fully-Connected Layer

In fully connected (FC) layer, each neuron connects to all neurons of the previous layer, which makes this layer produce many parameters in this layer. The fully connected layer multiplied the input by a weight matrix and added to a bias vector. Suppose layer k contains m neurons, layer (k+1) contains n neurons. The weight matrix will be of size of m × n, and the bias vector will be size of 1 × n. **Figure 5** shows the structure of FC layer.

Meanwhile, fully connected layer is often followed by a softmax function used to convert the input to a probability distribution. Here the "softmax" in this study only denotes the softmax function. While some literature will add a fullyconnected layer before the softmax function and call the both layers as "softmax function."

### Dropout

Deep neural network provides strong learning ability even for very complex function which is hard to understand by human. However, one problem often happened during the training of the deep neural network is overfitting, which means the error based on the training set is very small, but the error is large when the test data is provided to the neural network. We name it as bad generation to new dataset.

Dropout was proposed to overcome the problem of overfitting. Dropout works as randomly set some neurons to zero in each forward pass. Each unit has a fixed probability p independent of the other units to be dropped out. The probability p is commonly set as 0.5. **Figure 6** shows an example of dropout neural network, where the empty circle denotes a normal neuron, and a circle with X inside denotes a dropout neuron. It is obvious using dropout can reduce the links and make the neural network easy to train.

### Batch Normalization

As the change of each layer's input distribution caused by the updating of the parameter in the previous layer, which is called as internal covariate shift, can result the slow training. Thus, to solve this problem, we employ the batch normalization to normalizes the layer's inputs over a mini batch to make the input layer have a uniform distribution. All the variables are listed in **Table 2**, then the batch normalization can be implemented as follows:

$$\alpha^{l} = \frac{1}{m} \sum\_{i} z^{li} \tag{11}$$

$$
\sigma^{l2} = \frac{1}{m} \sum\_{i} \left( z^{li} - \alpha^{l} \right)^{2} \tag{12}
$$

$$\mathbf{z}\_{norm}^{li} = \frac{\mathbf{z}^{li} - \alpha^{l}}{\sqrt{\delta^{l2} + \varepsilon}} \tag{13}$$

$$\widetilde{\mathbf{z}}^{li} = \boldsymbol{\lambda}^{l} \mathbf{z}\_{norm}^{li} + \boldsymbol{\beta}^{l} \tag{14}$$

Here, ε is employed to improve numerical stability while the mini-batch variance is very small. Usually is set as default value e −5 . However, the offset β and scale factor γ are updated during training as learnable parameters.

### Stochastic Pooling

The stochastic pooling is proposed to overcome the problems caused by the max pooling and average pooling. The average pooling has a drawback, that all elements in the pooling region are considered, thus it may down-weight strong activation due to many near-zero elements. The max pooling solves this problem,

but it easily overfits the training set. Hence, max pooling does not generalize well to test set.

Instead of calculating the mean value or the max value of each pooling region, the output of the stochastic pooling is obtained via sampling from a multinomial distribution formed from the activations of each pooling region R<sup>j</sup> . The procedure can be expressed as follows:

(1) Calculate the probability p of each element χ within the pooling region.

$$p\_i = \frac{\chi\_i}{\sum\_{k \in R\_{\hat{\mathbb{I}}}} \chi\_k} \tag{15}$$

in which, k is the index of the elements within the pooling region.

TABLE 4 | Hyperparameters of Conv layers.


TABLE 5 | Hyperparameters of Fully-connected layers.


(2) Pick a location l within the pooling region according to the probability p. It is calculated by scanning the pooling region from left to right and up to bottom.

$$\mathbf{A}\_{j} = \chi\_{l}, l \sim P(\mathcal{P}\_{1}, \dots, \mathcal{P}\_{\lfloor \mathcal{R}\_{j} \rfloor}) \tag{16}$$

Instead of considering the max values only, stochastic pooling may use non-maximal activations within the pooling region. **Figure 7** shows a toy example of using stochastic pooling. We first output the probabilities of the input matrix, then the roulette wheel falls within the pie of 0.2. Hence the location l is finally chosen as 2, and the output is the value at second position.

### EXPERIMENTS, RESULTS, AND DISCUSSIONS

### Division of the Dataset

Hold-out validation method (Monteiro et al., 2016) was used to divide the dataset. In the training set, there are 350 MS images and 350 HC images. In the test set, we have 326 MS images and 331 HC images. **Table 3** presents the setting hold-out validation method.

The dataset is divided into two parts without validation dataset for our research: training dataset and test dataset as shown in **Table 3**. The missing of validation set is mainly because of following reasons: First, according to the past research, validation

TABLE 6 | Statistical analysis of 10 runs. Run Sensitivity Specificity Precision Accuracy 1 98.77 98.19 98.17 98.48 2 98.47 97.58 97.57 98.02 3 98.47 98.79 98.77 98.63 4 98.16 98.79 98.77 98.48 5 99.08 98.79 98.78 98.93 6 98.77 98.79 98.77 98.78 7 99.39 99.40 99.39 99.39 8 99.08 98.49 98.48 98.78 9 98.77 99.40 99.38 99.09 10 98.77 99.40 99.38 99.09 Average 98.77 ± 0.35 98.76 ± 0.58 98.75 ± 0.58 98.77 ± 0.39

FIGURE 11 | Confusion matrixes of each run.


set error rate may tend to overestimate the test error rate for the model fit on the entire data set (Bylander, 2002; Whiting et al., 2004). Second, as in order to avoid the overfitting, in addition of the training and test datasets, the validation dataset is necessary to tune the classification parameters. However, in this paper, we employed the drop out to overcome the problem of overfitting. The experiment result showed that there is no overfitting existing. Therefore, validation dataset is not used in our research.

### Data Augmentation Results

The deep learning usually needs a large amount of samples. However, ass it is a well-known challenge to collect biomedical data so as to generate more data from the limited data. Meanwhile, data augmentation has been shown to overcome TABLE 8 | Pooling method comparison and *p*-values of singed-rank test.


*Bold means the p-values are less than 0.05.*

TABLE 9 | Comparison of the approach with and without data augmentation.


the overfitting and increase the accuracy of classification tasks (Wong et al., 2016; Velasco et al., 2018). Therefore, in this study, we employed five different data augmentation (DA) methods to enlarge the training set (Velasco et al., 2018). First, we used image rotation. The rotation angle θ was set from −30 to 30◦ in step of 2◦ . The second DA method was scaling. The scaling factors varied from 0.7 to 1.3 with step of 0.02. The third DA method was noise injection. The zero-mean Gaussian noise with variance of 0.01 was added to the original image to generate 30 new noise-contaminated images due to the random seed. The fourth DA method used was random translation by 30 times for each original image. The value of random translation t falls within the range of [0, 15] pixels, and obeys uniform distribution. The fifth DA method was gamma correction. The gamma-value r varied from 0.4 to 1.6 with step of 0.04.

The original training is presented in **Figures 1A**, **8** shows the pipeline of the data preprocessing, where the augmented training set is used to create a deep convolutional neural network model, and this trained model was tested over the test set, with final performance reported in **Table 6**. **Figure 9A** shows the results of image rotation. **Figure 9B** shows the image scaling results. **Figures 9C–E** shows the results of noise injection, random

#### TABLE 10 | Comparison to traditional AI approaches.


TABLE 11 | Comparison to deep learning approaches.


CNN-DO-BN-SP (Ours) 98.77 ± 0.35 98.76 ± 0.58 98.75 ± 0.58 98.77 ± 0.39

translation, and Gamma correction, respectively. As is shown, one training image can generate 150 new images, and thus, the data-augmented training image set is now 151x size of original training set.

### Structure of Proposed CNN

We built a 14-layer CNN model, with 11 conv layers and 3 fullyconnected layers. Here we did not the number of other layers as convention. The hyperparameters were fine-tuned and their values were listed in **Tables 4**, **5**. The padding values of all layers are set as "same." **Figure 10** shows the activation map of each layer. It is obvious that the height of width of output of each layer shrinks as going to the late layers.

### Statistical Results

We used our 14-layer CNN with "DO-BN-SP." We ran the test 10 times, each time the hold-out division was updated randomly. The results over 10 runs are shown in **Table 6**. The average of sensitivity, specificity, and accuracy are 98.77 ± 0.35, 98.76 ± 0.58, and 98.77 ± 0.39, respectively. The confusion matrix of all runs are listed in **Figure 11**.

### Pooling Method Comparison

In this experiment, we compared the stochastic pooling (SP) with max pooling (MP) and average pooling (AP). All the other settings are fixed and unchanged. The results of 10 runs of MP and AP are shown in **Table 7**.

We performed Wilcoxon signed rank test (Keyhanmehr et al., 2018) between the results of SP and those of MP, and between the results of SP and those of AP. The results are listed in **Table 8**. It shows SP are significantly better than MP in terms of specificity and accuracy. Meanwhile, SP are significantly better than AP in all four measures.

In this section, Wilcoxon signed rank test was utilized instead of two-sample t-test (Jafari and Ansari-Pour, 2018) and chi-square test (Kurt et al., 2019) based on following reasons: two-sample t-test supposes the data comes from independent random samples of normal distributions, the same for chi-square goodness-of-fit test. However, our sensitivity/specificity/precision/accuracy data do not meet the condition of gaussian distribution.

### Validation of the Data Augmentation

We compared the training process with and without data augmentation to explore the augmentation strategies. The data augmentation methods including: image rotation, scaling, noise injection, random translation and gamma correction as stated in section Data Augmentation Results. The respective performance is shown in **Table 9**. Training with data augmentation could provide better performance, particularly reducing the range of standard deviation.

### Comparison to State-Of-The-Art Approaches

In this experiment, we compared our CNN-DO-BN-SP method with traditional AI methods: Multiscale AM-FM (Murray et al., 2010), ARF (Nayak et al., 2016), BWT-LR (Wang et al., 2016), 4-level HWT (Wu and Lopez, 2017), and MBD (Zhang et al., 2017). The results were presented in **Table 10**. Besides, we compared our method with a modern CNN method, viz., CNN-PReLU-DO (Zhang et al., 2018). The results were listed in **Table 11**. We can observe that our method achieved superior performance than all six state-of-the-art approaches, as shown in **Figure 12**.

The reason why our method is the best among all seven algorithms lies in four points. (i) We used data augmentation, to enhance the generality of our deep neural network. (ii) The batch normalization technique was used to resolve the internal covariate shift problem. (iii) Dropout technique was used to avoid overfitting in the fully connected layers. (iv) Stochastic pooling was employed to resolve the down-weight issue caused by average pooling and overfitting problem caused by max pooling.

The bioinspired-algorithm may help the design or initialization of our model. In the future, we shall try particle swarm optimization (PSO) (Zeng et al., 2016c,d) and other methods. The hardware of our model can be optimized using specific optimization method (Zeng et al., 2018).

In this paper, we employed data augmentation, the main benefits mainly as follows: As it is a well-know challenge to collect biomedical data so as to generate more data from the limited data. Second, data augmentation has been shown to overcome the overfitting and increase the accuracy of classification tasks (Wong et al., 2016; Velasco et al., 2018).

### CONCLUSION

In this study, we proposed a novel fourteen-layer convolutional neural network with three advanced techniques: dropout, batch normalization, and stochastic pooling. The main contributes are list as follows:


### REFERENCES


The results showed our method is superior to six state-of-theart approaches: five traditional artificial intelligence methods and one deep learning method. The detail explanation is provided in section Comparison to State-of-the-art approaches. In the future, we shall try to test other pooling variants, such as pyramid pooling. The dense-connected convolutional networks will also be tested for our task. Meanwhile, we will also work on finding more ways to accelerate convergence (Liao et al., 2018).

### AUTHOR CONTRIBUTIONS

S-HW conceived the study. CT and JS designed the model. CT and Y-DZ analyzed the data. S-HW, PP, and Y-DZ acquired the preprocessed the data. JY and JS wrote the draft. CH, PP, and Y-DZ interpreted the results. All authors gave critical revision and consent for this submission.

## ACKNOWLEDGMENTS

This paper is supported by Natural Science Foundation of China (61602250), National key research and development plan (2017YFB1103202), Henan Key Research and Development Project (182102310629), Open Fund of Guangxi Key Laboratory of Manufacturing System & Advanced Manufacturing Technology (17-259-05-011K).

hours versus 3-4.5 hours. J. Stroke Cerebrovasc. Dis. 27, 1033–1040. doi: 10.1016/j.jstrokecerebrovasdis.2017.11.009


associated with dengue virus infection. J. Neuroimmunol. 318, 53–55. doi: 10.1016/j.jneuroim.2018.02.003


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Wang, Tang, Sun, Yang, Huang, Phillips and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A U-Net Deep Learning Framework for High Performance Vessel Segmentation in Patients With Cerebrovascular Disease

Michelle Livne1,2 \*, Jana Rieger <sup>1</sup> , Orhun Utku Aydin<sup>1</sup> , Abdel Aziz Taha<sup>3</sup> , Ela Marie Akay <sup>1</sup> , Tabea Kossen1,2, Jan Sobesky 2,4, John D. Kelleher <sup>5</sup> , Kristian Hildebrand<sup>6</sup> , Dietmar Frey <sup>1</sup> and Vince I. Madai 1,2

<sup>1</sup> Predictive Modelling in Medicine Research Group, Department of Neurosurgery, Charité - Universitätsmedizin Berlin, Berlin, Germany, <sup>2</sup> Centre for Stroke Research Berlin, Charité - Universitätsmedizin Berlin, Berlin, Germany, <sup>3</sup> Research Studios Data Science, Research Studios Austria, Salzburg, Austria, <sup>4</sup> Department of Neurology, Johanna-Etienne Hospital Neuss, Neuss, Germany, <sup>5</sup> Information, Communication and Entertainment Institute (ICE), Dublin Institute of Technology, Dublin, Ireland, <sup>6</sup> Department VI Computer Science and Media, Beuth University of Applied Sciences, Berlin, Germany

#### Edited by:

Guoyan Zheng, University of Bern, Switzerland

#### Reviewed by:

Suyash P. Awate, Indian Institute of Technology Bombay, India He Wang, Fudan University, China Leixin Zhou, The University of Iowa, United States

> \*Correspondence: Michelle Livne michelle.livne@charite.de

#### Specialty section:

This article was submitted to Brain Imaging Methods, a section of the journal Frontiers in Neuroscience

Received: 01 September 2018 Accepted: 28 January 2019 Published: 28 February 2019

#### Citation:

Livne M, Rieger J, Aydin OU, Taha AA, Akay EM, Kossen T, Sobesky J, Kelleher JD, Hildebrand K, Frey D and Madai VI (2019) A U-Net Deep Learning Framework for High Performance Vessel Segmentation in Patients With Cerebrovascular Disease. Front. Neurosci. 13:97. doi: 10.3389/fnins.2019.00097 Brain vessel status is a promising biomarker for better prevention and treatment in cerebrovascular disease. However, classic rule-based vessel segmentation algorithms need to be hand-crafted and are insufficiently validated. A specialized deep learning method—the U-net—is a promising alternative. Using labeled data from 66 patients with cerebrovascular disease, the U-net framework was optimized and evaluated with three metrics: Dice coefficient, 95% Hausdorff distance (95HD) and average Hausdorff distance (AVD). The model performance was compared with the traditional segmentation method of graph-cuts. Training and reconstruction was performed using 2D patches. A full and a reduced architecture with less parameters were trained. We performed both quantitative and qualitative analyses. The U-net models yielded high performance for both the full and the reduced architecture: A Dice value of ∼0.88, a 95HD of ∼47 voxels and an AVD of ∼0.4 voxels. The visual analysis revealed excellent performance in large vessels and sufficient performance in small vessels. Pathologies like cortical laminar necrosis and a rete mirabile led to limited segmentation performance in few patients. The U-net outperfomed the traditional graph-cuts method (Dice ∼0.76, 95HD ∼59, AVD ∼1.97). Our work highly encourages the development of clinically applicable segmentation tools based on deep learning. Future works should focus on improved segmentation of small vessels and methodologies to deal with specific pathologies.

Keywords: cerebrovascular disease, deep learning, medical imaging, segmentation, U-net

## INTRODUCTION

Stroke is a world disease with extreme impact on patients and healthcare systems. Approximately 15 million people suffer from an ischemic stroke each year worldwide<sup>1</sup> . A third of the patients die, making stroke a leading cause of death. Since stroke is a cerebrovascular disease, more detailed information about arterial vessel status may play a crucial role for both the

<sup>1</sup>WHO EMRO Stroke, Cerebrovascular Accident | Health Topics. Available online at: http://www.emro.who.int/healthtopics/stroke-cerebrovascular-accident/index.html (Accessed July 14, 2018).

prevention of stroke and the improvement of stroke therapy. It thus has potential to become a biomarker for new personalized medicine approaches for stroke prevention and treatment (Hinman et al., 2017). Considering that vessel imaging is a routine procedure in the clinical setting, vessel information could be easily integrated in the clinical workflow, if segmentations are available and processed.

Currently, however, vessel imaging is only visually qualitatively—assessed in the clinical routine. Technical challenges of extracting brain arteries and quantifying their parameters have prevented this information from being applied in the clinical setting. If done at all, segmentations of brain vessels are to date still done predominantly manually or semimanually and are not quantified. Additionally, (semi-) manual vessel segmentation is very time-consuming and has proven to be fairly inaccurate owing to high interrater-variability making it unfeasible for the clinical setting (Phellan et al., 2017). Consequently, research has focused on developing faster and more accurate automatic vessel segmentation methods. Many different rule-based methods exploiting various features of vessel images, such as vessel intensity distributions, geometric models, and vessel extraction schemes have been proposed for this purpose in the previous decades (Lesage et al., 2009; Zhao et al., 2017). These methods, however, are predominantly manually engineered in nature utilizing hand-crafted features and are—additionally—insufficiently validated (Lesage et al., 2009; Phellan et al., 2017). In fact, due to lack of validation and the necessary performance none of the suggested methods has found any broad use in the clinical setting or in research so far. Thus, crucial information about arterial vessel status and subsequent personalized treatment recommendation are not available. The doctor on site lacks a tool to assess this information for the potential benefit of cerebrovascular disease patients.

Deep neural network architectures are a natural choice to overcome this technological roadblock (Zhao et al., 2017). They have shown tremendous success in the last 5 years for image classification and segmentation tasks in various fields (LeCun et al., 2015; Chen et al., 2016; Badrinarayanan et al., 2017; Krizhevsky et al., 2017), and particularly in neuroimaging (Zaharchuk et al., 2018). In the peer reviewed literature for arterial brain vessel segmentation, Phellan et al. (2017) explored a relatively shallow neural net in magnetic resonance images of 5 patients (Phellan et al., 2017). While showing promising preliminary results, the small sample size and shallow net led to limited performance. Here, one of the most promising deep learning frameworks for segmentation tasks is the U-net (Ronneberger et al., 2015). It is a specialized convolutional neural net (CNN) with an encoding down-sampling path and an upsampling decoding path similar to an autoencoder architecture. It was specifically designed for segmentation tasks and has shown high performance for the segmentation of biomedical images (Fabijanska, 2018; Huang et al., 2018; Norman et al., 2018).

In this context our central contribution is a modified U-net architecture for fully automated arterial brain vessel segmentation evaluated on a dataset of 66 magnetic resonance (MR) images of patients with cerebrovascular disease. We performed a thorough qualitative and quantitative assessment to assess performance with a special focus on performance for pathological cases. Lastly, we compared our results to a traditional standard method of the graph cut approach (Chen and Pan, 2018).

## METHODS

## Patients

We retrospectively used data from patients from the PEGASUS study that enrolled patients with steno-occlusive cerebrovascular disease [at least 70% stenosis and/or occlusion of one middle cerebral artery (MCA) or internal carotid artery (ICA)]. The study details have been published previously (Mutke et al., 2014; Martin et al., 2015). As additional test-sets to assess generalization we included patients with cerebrovascular disease from the 7UP study. Both the 7UP study as well as the PEGASUS study were carried out in accordance with the recommendations of the authorized institutional ethical review board of the Charité-Universitätsmedizin Berlin with written informed consent from all subjects. All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the authorized institutional ethical review board of the Charité-Universitätsmedizin Berlin.

Of 82 patients in total, 4 did not have vessel imaging and 6 patients were not yet processed at the time of the study. Of the 72 patients remaining, 6 were excluded due to low quality vessel images owing to patient motion leading to 66 patient scans available for our study. The test-sets from the 7UP study comprised 10 patients each with cerebrovascular disease (stroke in the past) with Time-of-Flight (TOF) angiography from a different scanner and different parameters, and a different angiography modality, i.e. MPRAGE-angiography (Dengler et al., 2016).

### Data Accessibility

At the current time-point the imaging data cannot be made publicly accessible due to data protection, but the authors will make efforts in the future, thus this status might change. Researchers interested in the code and/or model can contact the authors and the data will be made available (either through direct communication or through reference to a public repository).

### Imaging

For the PEGASUS patients, scans were performed on a clinical 3T whole-body system (Magnetom Trio, Siemens Healthcare, Erlangen, Germany; in continuation referred to as Siemens Healthcare) using a 12-channel receive radiofrequency (RF) coil (Siemens Healthcare) tailored for head imaging.

Time-of-Flight (TOF) vessel imaging was performed with the following parameters: voxel size = (0.5 × 0.5 × 0.7) mm<sup>3</sup> ; matrix size: 312 × 384 × 127; TR/TE = 22 ms/3.86 ms; time of acquisition: 3:50 min, flip angle = 18 degrees.

For the additional 7UP test-sets, scans were performed on a clinical 3T whole-body system (Magnetom Verio, Siemens Healthcare) and a 12 channel RF receive coil (Siemens Healthcare) for TOF-imaging and a 7T whole-body system (Magnetom 7.0 T, Siemens Healthcare) with a 90 cm bore magnet (Magnex Scientific, Oxfordshire, United Kingdom), an avanto gradient system (Siemens Healthcare) and a 1/24 channel transmit/receive coil (NovaMedical, Wakefield, MA) was used for MPRAGE-angiography.

The parameters were:

TOF imaging: voxel size = (0.6 × 0.6 × 0.6) mm<sup>3</sup> ; matrix size: 384 × 384 × 160; TR/TE = 24 ms/3.60 ms; time of acquisition: 5:54 min, flip angle = 18 degrees.

MPRAGE imaging: voxel size = (0.7 × 0.7 × 0.7) mm<sup>3</sup> ; matrix size: 384 × 384 × 240; TR/TE = 2,750 ms/1.81 ms; time of acquisition: 5:40 min, flip angle = 25 degrees.

### Data Postprocessing

The raw PEGASUS study TOF images were denoised using the oracle-based 3D discrete cosine transform filter (ODCT3D) implemented in matlab (Manjón et al., 2012). Non-uniformity correction (NUC) was performed with the mri\_nu\_correct.mni tool integrated in freesurfer (website: Freesurfer mri\_nu\_correct.mni)<sup>2</sup> . Corresponding whole-brain masks were automatically generated using the Brain Extraction Tool (BET) of FSL (website: BET/UserGuide-FslWiki)<sup>3</sup> . Both NUC and FSL-BET post-processing were performed with the Nipype wrapper implemented in Python<sup>4</sup> . The post-processing parameters were as follows: ODCT Filter: Patch size 3 × 3 × 3 voxels, Search volume size: 7 × 7 × 7 voxels, Rician noise model; Freesurfer mnibias correction: iterations = 6, protocol\_iterations = 1,000, distance = 50; FSL BET: frac = 0.05.

The additional 7UP TOF and MPRAGE imaging testset image pipeline differed in these points: non-local means denoising implemented in Nipype (patch radius = 1, block radius = 5, and Rician noise model) and MPRAGE-BET parameters (frac 0.15).

### Data Labeling

For PEGASUS TOF data, ground-truth labels of the brain vessels were generated semi-manually using a standardized pipeline. Pre-labeling of the vessels was performed by a thresholded region-growing algorithm using the regiongrowingmacro module implemented in MeVisLab (website: MeVisLab)<sup>5</sup> . To tackle interrater variability in label generation, these pre-labeled data were thoroughly manually corrected by either OUA and EA (both junior raters) and then cross-checked by the other rater. These labels were then checked subsequently both by VIM (9 years experience in stroke imaging) and DF (11 years experience in stroke imaging). Thus, each ground-truth was eventually checked by 4 independent raters, two of them senior raters. The total labeling time with this framework amounted to 60–80 min per patient.

Additional test-set label data (TOF and MPRAGE imaging) was generated using the U-net model in a first step instead of the regiongrowingmacro framework, followed by the above described

<sup>5</sup>https://www.mevislab.de/ (Accessed July 14, 2018).

thorough manual correction steps. Images were reviewed in the final step by VIM.

### Data Splitting

For U-net training, both PEGASUS TOF images and groundtruth labels were skull-stripped using the whole-brain masks. The data was split into training, validation, and test-sets with 41, 11, and 14 patients-scans, respectively. For illustration of the extracted data see **Figure 1**.

### Random Patch Extraction

In order to successfully train our deep neural network we needed to consider two challenges. First, the brain slices, with 312 × 384 voxels, are very large and cannot processed at once due to the limited GPU memory. Second, the distribution of vessels within a slice is largely skewed as only 0.9% of brain voxels are vessels. To solve these problems we extracted 1,000 quadratic patches per patient: 500 random patches with a vessel in the center and 500 random patches without a vessel in the center. The model was trained using 4 different patch sizes (16 × 16, 32 × 32, 64 × 64, 96 × 96 voxels) and was later tested for best results against the validation set as part of the optimization process. Due to computational limitation the maximal patch size was set to 96 × 96 voxels. Testing the effect of different patch sizes on the model performance would reveal important information about the relevant spatial scope for a reliable vessel detection. The data was normalized patch-wise using zero-mean and unit-variance normalization.

### Network Architecture and Training Network Architecture

The U-net CNN model architecture was adapted from the framework presented by Ronneberger et al. (2015). The Unet model architecture is shown in **Figure 2**. The network is based on a convolutional neural network (CNN) and consists of an encoding and a decoding part. The contracting path, i.e., encoding part (left side) repeatedly applies two (padded) 3 × 3 convolutional layers with stride 1, each followed by a rectified linear unit (ReLU) and a 2 × 2 max-pooling operation with stride 2 on 4 levels. A dropout layer is applied following the first convolutional layer. At each down-sampling step the dimensions of the input image is reduced by half and the number of feature channels is doubled. The bottom level includes two 3 × 3 convolutional layers without pooling layer. The expansive path, i.e., decoding part (right side) recovers the original dimensions of the input images by up-sampling the feature map, a concatenation with the corresponding feature channels from the contractive path and two 3 × 3 convolutional layers, the first followed by ReLU and a dropout-layer and the second followed by ReLU. The final layer is a 1 × 1 convolution for mapping the feature vector to the binary prediction (i.e., vessel vs. non-vessel).

A variation of the U-net model architecture was applied, where the number of channels in each layer was consistently reduced to half. For simplicity, the additional architecture is therefore termed throughout this work as "half U-net."

<sup>2</sup>https://surfer.nmr.mgh.harvard.edu/fswiki/mri\_nu\_correct.mni (Accessed February 7, 2019).

<sup>3</sup>https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/BET/UserGuide (Accessed July 14, 2018).

<sup>4</sup>Neuroimaging in Python-Pipelines and Interfaces-Nipy Pipeline and Interfaces Package. Available online at: https://nipype.readthedocs.io/en/latest/ (Accessed July 14, 2018).

part, right side) that recovers the original dimensions of the input images. Each box corresponds to a multi-channel feature map. The dashed boxes stand for the concatenated copied feature maps from the contractive path. The arrows stand for the different operations as listed in the right legend. The number of channels is denoted on top of the box and the image dimensionality (x-y-size) is denoted on the left edge. The half U-net is constructed likewise, with the only difference given by the halved number of channels throughout the network.

The network is fed with 2D image patches and returns the 2D segmentation probability map for each given patch.

### Model Training

The skull-stripped denoised TOF input images and the corresponding ground-truth segmentation maps were used to train the U-net using the Keras implementation of Adam optimizer (Kingma and Ba, 2014).

In the model, the energy function is computed by a pixelwise sigmoid over the final feature map combined with the Dice coefficient loss function. The sigmoid is defined as p (x) = 1/ 1 + exp (a (x)) Where a (x) denotes the activation in the final feature channel at the voxel position x ∈ Ω with Ω ∈ Z 2 and p (x) is the approximated probability of a voxel x to be a vessel. The Dice coefficient D between two binary volumes is officially defined as D = 2TP <sup>2</sup>TP+FP+FN Where TP is the number of true-positive voxels, FP is the number of false positive voxels and FN is the number of false negative voxels. Using the following derivation:

$$\begin{split} D &= \frac{2\text{TP}}{2\text{TP} + \text{FP} + \text{FN}} \\ &= \frac{2\sum\_{\mathbf{x} \in \Omega} p\_{\mathbf{x}} \text{g}\_{\mathbf{x}}}{2\sum\_{\mathbf{x} \in \Omega} p\_{\mathbf{x}} \text{g}\_{\mathbf{x}} + \sum\_{\mathbf{x} \in \Omega} \left(p\_{\mathbf{x}}^2 - p\_{\mathbf{x}} \text{g}\_{\mathbf{x}}\right) + \sum\_{\mathbf{x} \in \Omega} \left(\text{g}\_{\mathbf{x}}^2 - p\_{\mathbf{x}} \text{g}\_{\mathbf{x}}\right)} \end{split}$$

The Dice coefficient can be written as:

$$D = \frac{2\sum\_{\boldsymbol{\chi} \in \mathcal{Q}} p\_{\boldsymbol{\chi}} g\_{\boldsymbol{\chi}} + s}{\sum\_{\boldsymbol{\chi} \in \mathcal{Q}} p\_{\boldsymbol{\chi}}^2 + \sum\_{\boldsymbol{\chi} \in \mathcal{Q}} g\_{\boldsymbol{\chi}}^2 + s}$$

Where p<sup>x</sup> ∈ P : Ω → {0, 1} is the predicted binary segmentation volume, g<sup>x</sup> ∈ G: Ω → {0, 1} is the ground-truth binary volume and s = 1 is an additive smoothing factor (i.e., Laplace smoothing). The Dice coefficient then penalizes at each position j the deviation of px from the true label gx using the differentiated gradient:

$$\frac{\partial D}{\partial p\_{\mathcal{j}}} = 2 \left[ \frac{g\_{\mathcal{j}} \left( \sum\_{\boldsymbol{\chi} \in \mathcal{Q}} p\_{\boldsymbol{\chi}}^2 + \sum\_{\boldsymbol{\chi} \in \mathcal{Q}} g\_{\boldsymbol{\chi}}^2 \right) - 2p\_{\boldsymbol{\chi}} \left( \sum\_{\boldsymbol{\chi} \in \mathcal{Q}} p\_{\boldsymbol{\chi}} g\_{\boldsymbol{\chi}} \right)}{\left( \sum\_{\boldsymbol{\chi} \in \mathcal{Q}} p\_{\boldsymbol{\chi}}^2 + \sum\_{\boldsymbol{\chi} \in \mathcal{Q}} g\_{\boldsymbol{\chi}}^2 + s \right)} \right]$$
 (Milletari et al., 2016)

The choice of the Dice coefficient as the loss function allows to handle the skewed ground-truth labels without sample weighting.

A constructive initialization of the weights is necessary to ensure gradient convergence, while preventing the situation of "dead neurons," i.e., parts of the network that do not contribute to the model at all. This is particularly true for the case of deep neural networks with many convolutional layers and many different paths through the network. Here we applied the commonly used heuristic with ReLU activation function, the Glorot uniform initializer where the initial weights are drawn from a uniform distribution within the range [−L, L] where L= q 6/ fin + fout , fin is the number of input units in the weight tensor and fout is the number of output units in the weight tensor (Glorot and Bengio, 2010).

The models were tuned in the validation process to optimize the hyperparameters learning-rate, batch-size, and dropout-rate in addition to the optimization of the patch-size as described above.

### Data Augmentation

Augmentation methods introduce variance to the training data which allows the network to become invariant to certain transformations. While CNNs and U-net in specific are very good in integration of spatial information which is essential to imaging segmentation tasks, they are not equivariant to transformations such as scale and rotation (Goodfellow et al., 2016). Data augmentation methods like rotations and flips yield the desired invariance and robustness properties of the resulted network. Additionally to flips and rotations, the data augmentation included shears as a derivative

TABLE 1 | Model parametrization.


of elastic deformations which are recommended as general best practice for convolutional neural networks (Ronneberger et al., 2015). The augmentation was applied on-the-fly on the patch-level using the ImageDataGenerator function implemented in Keras.

### Method Comparison

For comparison we used the graph cut implementation in the PyMaxFlow Python library in Version 1.2.11 (Neila, 2018). This method is tailored for binary segmentation problems, where the combination of Markov Random Fields (MRF) with Bayesian maximum a posteriori (MAP) estimation turns the segmentation task into a graph based minimization problem. Then, the graph cut methodology provides a computationally efficient solution to the minimization problem (Chen and Pan, 2018). We tuned the weights hyperparameter, representing the uniform capacity of the edges, on the validation set to determine the optimal setting. With this setting the algorithm was applied on the 14 patients of the test-set to produce segmentation images. We applied both patch wise segmentation with a patch size of 96 as well as segmentation whole slice by slice.

### Performance Assessment

#### Quantitative Assessment

The model performance was assessed based on three different measures: The Dice coefficient, 95HD and the AVD. While the Dice coefficient serves as a general common measure for segmentation tasks, the 95HD and AVD metrics allow to capture more accurate estimation of performance with relation to the boundary error of the branched and complex structure of brain vessels. In contrast to the Hausdorff distance which relates to the maximum of the distance metrics, the 95HD and AVD are calculated as the 95% percentileand the average distance, respectively.

Therefore, 95HD and AVD are stable and not sensitive to outliers which is typically an important quality measure in medical images analysis. While the Dice coefficient ranges from [0,1] unitless values where the larger the value, the better performance it indicates, the units of 95HD and AVD represent real distances with voxels as a unit, hence the smaller the value is, the better the performance is. The measures were calculated using the EvaluateSegmentation tool provided by Taha and Hanbury (2015), Taha (2018). We identified three final models for performance comparison: Since we based our assessment on three different metrics—the Dice coefficient, the 95HD and the AVD—we chose a model that optimized each of the metrics based on the validation set. The performance was assessed as an average of the measures of all fully reconstructed vasculatures of the patients in the test-set as well as on the segmentations resulting from the graph cut approach.

### Qualitative Assessment

For qualitative assessment the predicted segmentation masks as well as the graph cut results of the 14 patients in the test-set were transformed by an in-house python code where true positives (TP), false positives (FP), and false negatives (FN) were assigned different voxel values (True negatives (TN) remained labeled with 0). The images were then visualized by overlaying these new masks with the original scans using ITK-Snap (website: ITK-SNAP)<sup>6</sup> . By adjusting the opaqueness, it was possible to qualitatively assess which structures were correctly identified and which anatomical structures

<sup>6</sup>http://www.itksnap.org/pmwiki/pmwiki.php (Accessed July 14, 2018).

FIGURE 3 | Exemplary patches used for training. Five pairs of random patches with increasing patch size from left to right are shown. "A" columns indicate the MRA-TOF-scans, whereas "B" columns indicate the ground truth label.

dominated with errors. For each architecture and each model (2 architectures × 3 models = 6) VIM visually assessed per patient the images based on a predefined scheme. Large vessels were defined as the all parts of the ACA and the M1, A1, and P1 segments of the three large brain arteries. All other parts were considered small vessels. The results of the visual analysis are summarized qualitatively in the results section. The scheme was the following:

Large vessels, overall impression (bad, sufficient, good); Small vessels, overall impression (bad, sufficient, good); Large vessels, errors (FP or FN dominating); Small vessels, errors (FP or FN dominating); Pathology detected (yes/no); Other tissues type segmentation errors (yes/no).

### Generalization Assessment

All 6 models were applied on the additional 10 7UP patients with different TOF parameters as well as 10 patients with a different angiography modality (7T MPRAGE angiography). Segmentation quality was compared vs. the semimanual gold-standard labels as described above quantitatively with the EvaluateSegmentation framework using as metrics Dice, 95HD and AVD.

### Hardware

All trainings ran on a GPU workstation with 16 GB RAM, Intel(R) Xeon(R) CPU E3-1231 v3 @ 3.40GHz and a NVidia TITAN X (Pascal) GPU with 12 GB VRAM.

### RESULTS

The U-net model was trained on 81,000 extracted and augmented patches of 41 patients, validated using 11 full patient volumes and assessed for performance using the test-set of 14 full patient volumes. The U-net architecture resulted in 31,377,793 parameters, while the half U-net resulted in 7,846,081 parameters. For the U-net and half U-net, respectively, model training ran for ∼100 and 50 min, while TABLE 2 | Summary of test-set performance measures.


Detailed performance measures of the two Dice-optimized U-net model architectures and the baseline graph-cuts segmentation model as the averaged value over the 14 test patients. For the detailed models parametrization, see Table 1.

segmentation of a previously unseen volume took about 20 and 10 s. The optimal patch-size was identified as 96 × 96 voxels for both suggested architectures. The model parametrization can be found in **Table 1**. Exemplary patches used for training can be seen in **Figure 3**.

The U-net model yielded high performance in terms of the three measures that were comparable for both the full and the half architecture: The models optimized for the Dice coefficient had a Dice value of 0.89. The 95HD value was 47 voxels and the AVD models yielded results around 0.35 voxels. The detailed performance assessment of the finalized models is presented in **Table 2**. A representative overview of the visual analysis can be found in **Figure 4**.

The visual analysis of the three full models showed that consistently in all 14 patients the large vessels were segmented excellently. Only very few false positive voxels in the border zones of the vessels were present (see **Figure 5**). Small vessels were segmented well in half of the patients, in the other patients small vessels were segmented less well (see **Figure 6**). In 12 patients, we found false positive labeling of small parts of meningeal arteries present in the image or of venous structures (sinus and central veins) (see **Figures 4**, **6**). In one patient, tissue in an old

Labels are shown in the first column and exemplary segmentation results are shown in the second column. The third column shows the error map, where red voxels indicate true positives, green voxels false positives, and blue voxels false negatives. Overall a high performance segmentation could be achieved. In the error maps it can be seen that false positives mainly presented as central venous structures and parts of meningeal arteries. (3D view is meant for an overall overview. Due to 3D interpolation, very small structures may appear differently in the images. This does not translate to real voxel-to-voxel differences. For direct voxel-wise comparison please use the 2D-images in Figures 5–8).

infarct presented as cortical laminar necrosis with hyperintense elongated tissue against the dark cerebrospinal fluid. These parts were partially labeled as vessels (see **Figure 7A**). In another patient, a rete mirabile, a vessel network of small arteries developing due to occlusion, was present. The rete mirabile was only partially segmented (see **Figure 7B**).

A comparison of the three models showed comparable performance and consistent artifacts as described above. There was a tendency, however, for 95HD and AVG models to have less false positively labeled meningeal and venous structures than the Dice-optimized model. The visual comparison of the full and the half architecture showed comparable performance in the large vessels. We saw a tendency for slightly worse performance in smaller vessels in the half architectures. Vessel pathologies (stenosis/occlusion) were depicted in all patients and all models.

Graph cut results showed inferior performance to the U-net models with the following results (patch/whole slice): Dice 0.76/0.76, 95 HD 58.8/59.2, and AVD 1.97/1.97 (Detailed results can be found in **Table 2** and a visual example in **Figure 8**).

Generalization assessment showed a very good performance of the Dice optimized models for the intra-modal comparison with 3T TOF images with a Dice of 0.86 / 0.92, 95 HD of 64.5 / 50.0, and AVD of 1.591 / 0.650 for the full U-net and half U-net, respectively. We found insufficient performance for inter-modal comparison with 7T MPRAGE angiography (Dice around 0.60, 95 HD around 50, and AVD around 3.5). Detailed results can be found in **Table 3**.

### DISCUSSION

We present in the current work a U-net deep learning framework for fully automated brain arterial vessel segmentation from TOFimages of patients with cerebrovascular disease. Our framework demonstrated a very high quantitative performance based on three validation metrics. A lighter architecture—half U-net achieved comparable quantitative performance. Visual inspection showed excellent performance in large vessels and sufficient to good performance in small vessels as well as comparable performance between the full architecture and the half-net. Special cerebrovascular pathologies presented challenges for the network and need to be addressed in the future.

Applying a modified U-net framework as suggested by Ronneberger et al. (2015), we achieved a very high quantitative performance for the segmentation of arterial brain vessels in patients with cerebrovascular disease. To the best of our knowledge, our work is the first study to show the value of a U-net architecture for fully automated arterial brain vessel segmentation in cerebrovascular disease. Our results are therefore highly encouraging for the

are shown in the first column and exemplary segmentation results are shown in the second column. The third column shows the error map, where red voxels indicate true positives, green voxels false positives, and blue voxels false negatives. Only few false positive voxels can be seen in the border zones of the vessels.

further development of automated clinical vessel segmentation tools for cerebrovascular disease. In contrast to the so called "rule-based" non-neural-net attempts of the past, deep learning based networks do not require hand-crafted features or prior feature selection (Lesage et al., 2009; Zhao et al., 2017). The main reason for this is the inherent ability of U-nets to efficiently extract the relevant features in the training process. Confirming the broad consensus that deep learning based approaches constitute the new state-of-the-art in medical segmentation, the U-Net architecture clearly outperformed the traditional graph-cut based segmentation method.

Next to a quantitative assessment, an experienced medical professional also visually assessed the quality of the segmentations. We found that also in the visual analysis the performance of the networks was very high. However, while we saw excellent performance for large vessels, the performance in smaller vessels was less pronounced. While future networks should be improved regarding small vessel segmentation, the clinically most relevant vessels are the large vessels. Thus, we present evidence that already a relatively simple U-net architecture shows clinically highly relevant performance. Even higher performance can be expected using newer segmentation architectures, e.g., the MSnet (Shah et al., 2018). Also, when confronted with pathological cases, like cortical laminar necrosis and a rete mirabile, the performance of the network was limited. Here, there is a need for specifically tailored datasets being incorporated in the training samples. Taken together, the high quantitative and qualitative performance of the U-net are very promising for the development of new individualized precision medicine tools for stroke and cerebrovascular disease in the clinical setting. Vessel parameters could augment predictive models in cerebrovascular disease (Feng et al., 2018; Livne et al., 2018; Nielsen et al., 2018).

Our results confirm previous works in the field of vessel segmentation. Recently, pre-prints on ArXiv.org have explored deep neural nets architectures for brain vessel segmentation in healthy subjects (Chen et al., 2017; Tetteh et al., 2018). The reported performance measures were comparable to our results. This confirms the advantages of deep learning approaches for vessel segmentation tasks. A current limitation, however, is the lack of a standardized labeled vessel imaging dataset. For other segmentation tasks, labeled datasets have been published in the past, usually within public competitions (website: grand-challenges)<sup>7</sup> . A big advantage of such datasets and the competition framework is that it makes models comparable. If different datasets and especially different types of labeling are used, results can be roughly compared qualitatively, but a direct quantitative comparison cannot be performed. If, however, models cannot be compared, the translation of these new methodologies into clinically usable tools is strongly hampered. Thus, the medical machine learning community needs to address this issue by providing standardized datasets of vessels both with and without pathology for segmentation tasks in the established form of

<sup>7</sup>https://grand-challenge.org/All\_Challenges/ (Accessed July 14, 2018).

are shown in the first column and exemplary segmentation results are shown in the second column. The third column shows the error map, where red voxels indicate true positives, green voxels false positives and blue voxels false negatives. In 7 patients (50%) also small vessels were segmented well, with only few false negatives (B). In the other patients, the small vessels were segmented only sufficiently, with both false positives and false negatives (A).

competitions to allow proper benchmarking of methods. The authors of this study would happily contribute to such an effort.

We chose a simple architecture that closely followed the suggested U-net by Ronneberger et al. This resulted in roughly 30 million parameters. Promisingly, we found that a U-net with half of the convolution channels—coined half U-net—showed comparable performance, and yet consisted of only roughly 8 million trainable parameters. Naturally, the training of the half U-net can be done much faster, in our case in 50% of the time. This might be attributable to a limited variability of brain vessels as captured by the dataset that allows less complex architectures to perform comparably. This is also shown by the fact that we used a simple 2D-patch approach with success. It seems that certain segmentation tasks do not necessarily need complex models and 3D approaches to reach sufficient performance. However, a systematic assessment of the necessary model complexity, particularly the number of feature channels, and 2.5D and 3D approaches is warranted in future studies to find the optimal approach for vessel segmentation. Especially for small vessels detection, such approaches might be promising. The optimal patch-size was identified for both architectures as the largest tested value of 96 × 96 voxels. This may imply that a larger patchsize may be more beneficial for the segmentation task. Such future optimization could be potentially done using advanced hardware or by increasing the patch-size on the expanse of the batch-size.

An important part of the model training is the augmentation of the data. CNNs—and the encoder part of the U-net utilizes convolutional layers—are not equivariant to certain transformations, especially not rotations. It is thus absolutely essential to perform augmentations, especially when (relatively) few training examples are available (Ronneberger et al., 2015). The main principle of augmentation is that the newly generated data represents new information that would occur in the same domain where the original images stem from. We chose in our work the ImageDataGenerator implemented in Keras. It is a multi-purpose augmentation tool, that on one hand will generate helpful new training examples with high likelihood, but on the other hand will be naturally less specific than individually tailored augmentation strategies. Here, a highly promising approach is the application of Generative Adversarial Networks (GANs) for data augmentation, e.g., by Antoniou et al. (2017). The adversarial generative and discriminative networks ensure—if mode collapse can be avoided—that a large variety of new data is generated which all lie in the same domain as the original data. Such images would allow ideal augmentation for any segmentation task.

Generalization of our findings to other vessel segmentation tasks signifies an important implication of our work. While it is possible to achieve high performance for vessel segmentation with hand-crafted features and parameters optimized for a special case, e.g., for CTangiographies (Meijs et al., 2017), the development of such methods is time-consuming and a transfer of these results to images from other sources and other organs is hard to perform. In the case of a well-trained U-net, the convolutional layers have already learned the

part of the brain (arrow). On a transversal slice (on the right) false labeling of parts of cortical laminar necrosis can be identified as the cause (arrow). (B) The rete mirabile network of small vessels was only partially depicted (false negative labeling in blue in the error map). A rete mirabile is a relatively rare occurrence, only 3 patients of 66 in our study presented with one (2 in the training set and one in the test-set).

features necessary for the detection of vessels. Thus, it is possible to train new highly performant models for so far unseen vessel images by freezing the convolutional layers and by focusing the training on the rest of the model. This method is called "transfer learning" (Oquab et al., 2014) and requires only a few labeled datasets for each new source. Consequently, potential new tools can easily be adapted to various scanner settings, imaging modalities and even new organs, which is necessary for broad clinical adaptation and multicenter imaging studies.

We assessed model performance based on three different measures: First, the Dice coefficient. Mathematically it is equivalent to the F1 measure and thus the harmonic mean of precision and recall (Taha and Hanbury, 2015). It is a widely used measure for segmentation tasks and its popularity is explained by its insensitivity to background voxels, its easy interpretability and its customizability to improve learning in hard-to-segment regions (Shah et al., 2018). Together with patch extraction, the use of the Dice coefficient allowed us to alleviate the imbalanced sample distribution in our dataset, as only 0.9% of all voxels in the brain depict vessels. However, based on theoretical considerations, the Dice coefficient is limited when assessing the validity of vessel segmentations (Taha and Hanbury, 2015). For example, since vessels are narrow and elongated, segmentations errors can quickly lead to loss of overlap. However, once no overlap exists, the Dice coefficient cannot distinguish whether a segmented vessel is closer (better segmentation) or further away (worse segmentation) from the ground truth. Here, distance based measures are better suited (Taha and Hanbury, 2015), as they take into account the spatial position of voxels. Thus, we used two additional distance-based measures, the 95HD and the AVG. The plain Hausdorff distance was avoided due to its sensitivity to outliers. Promisingly, we saw a tendency that the models optimized by distance-based measures show improved results. It can thus be anticipated that customized loss functions incorporating distance-based measures will improve the performance of deep learning models for segmentation. Thus, future works should first systematically assess which metrics are best suited for brain vessel segmentation and then develop a customized loss function for vessel segmentation.

A special focus of our work was the selection of the dataset. First, we used patients with pathology, in our case cerebrovascular disease. The vessels of such patients are more challenging to segment owing to stenoses and occlusions, old infarcts and small vascular networks ("rete mirabile"). Thus, our results are more representative of the clinical challenges than the results of works using the data of healthy patients. And indeed, we found that special pathological cases like cortical laminar necrosis and a rete mirabile were challenging for the network and need to be focused in future works. In summary, our work serves as the starting point to develop new pathology-tailored models which are applicable in the clinical setting. In their training, random patch extraction should

be avoided, and patch selection should be focused on the special cases identified in our study. Second, we labeled 66 patients, which is—in medical imaging—a large number of patients. This number is roughly double to triple of the number of the datasets used by Ronneberger et al. in their original U-net paper (Ronneberger et al., 2015). Since the U-net is tailored for use with limited data, our number of patients should allow for strong generalization and this is reflected by our high performance. Lastly, we invested a large effort into the labeling of the dataset. Every patient scan was labeled by a medical researcher and independently checked by 3 others medical researchers, 2 amongst them expert readers. It is very encouraging that two-digit numbers of high-quality labeled medical imaging data are sufficient to achieve very strong segmentation results with modern deep learning architectures. Labeling of such a number of patient scans is achievable in a justifiable time and opens the door for the development of high-performance models for the clinical setting for any medical segmentation task. It is to be expected that such models will soon be translated into applicable tools and will be available for research and the clinical setting.

Our study has several limitations. First, we used a monocentric dataset. Thus, imaging parameters and scanner parameters were fixed. In an intra-modal analysis, i.e., TOF-images from a different scanner with different parameters, the generalization performance was very good. In the inter-modal analysis, however, applying the models on MPRAGE-angiography images, the performance was considerably inferior. For the applicability in the clinical setting, two different strategies can be envisioned: (1) Since TABLE 3 | Summary of generalization assessment results.


Detailed performance measures of the two tested model architectures as the averaged value over 10 test patients each for 3T TOF and 7T MPRAGE angiography. The results for the two Dice optimized models are shown. The half U-net architecture exhibited better generalization performance.

clinical on-site postprocessing is tied to the scanner-vendor and software, segmentation products tuned for vendor-specific sequences and parameter-ranges are possible and lack of generalization is unproblematic. (2) For development of a vendor-independent pipelines, clinical segmentation algorithms need to cover a much broader range of image variability. Here, a large number of varied datasets needs to be used for training of single models or model zoos in the future. Second, we reconstructed the images on the patch level and did not perform an algorithmbased optimization of the whole reconstructed vessel tree. Here, future works can explore for example recurrent neural networks, especially architectures with long short-term memory (LSTM) layers. Applying these techniques, an increase of small vessel segmentation performance might be possible. Third, also, the patch size was limited due to hardware constraints potentially reducing the performance of the network where more context is needed. Fourth, we performed an exploratory qualitative visual analysis by one medical expert. Future clinical assessments of different models should include a systematic quantitative rating by multiple medical expert readers, which exceeded the scope of the present work.

### CONCLUSION

In conclusion, a U-net deep learning framework yielded high performance for vessel segmentation in patients with cerebrovascular disease. Future works should focus on improved segmentation of small vessels and removal of artifacts resulting from specific cerebrovascular pathologies.

### REFERENCES


### AUTHOR CONTRIBUTIONS

ML, TK, JS, JDK, KH, DF, and VM: concept and design; VM and JS: acquisition of data; ML, JR, AT, JDK, KH, and VM: model design; ML, JR, OA, EA, DF, and VM: data analysis; ML, JR, AT, TK, JS, JDK, KH, DF, and VM: data interpretation; ML, JR, OA, AT, EA, TK, JS, JDK, KH, DF, and VM: manuscript drafting and approval.

### FUNDING

This work has received funding by the German Federal Ministry of Education and Research through (1) the grant Centre for Stroke Research Berlin and (2) a Go-Bio grant for the research group PREDICTioN2020 (lead: DF). For open access publication, we acknowledge support from the German Research Foundation (DFG) via the Open Access Publication Fund of Charité - Universitätsmedizin Berlin.


**Conflict of Interest Statement:** While not related to this work, JS reports the following board memberships, consultancies, and/or payments for lectures including service on speaker's bureaus: Boehringer-Ingelheim, Sanofi, Bayer, Pfizer, and Maquet.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Livne, Rieger, Aydin, Taha, Akay, Kossen, Sobesky, Kelleher, Hildebrand, Frey and Madai. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Deep Learning Based Attenuation Correction of PET/MRI in Pediatric Brain Tumor Patients: Evaluation in a Clinical Setting

Claes Nøhr Ladefoged, Lisbeth Marner, Amalie Hindsholm, Ian Law, Liselotte Højgaard and Flemming Littrup Andersen\*

Department of Clinical Physiology, Nuclear Medicine and PET, Rigshospitalet, Copenhagen, Denmark

Aim: Positron emission tomography (PET) imaging is a useful tool for assisting in correct differentiation of tumor progression from reactive changes. O-(2-18F-fluoroethyl)-Ltyrosine (FET)-PET in combination with MRI can add valuable information for clinical decision making. Acquiring FET-PET/MRI simultaneously allows for a one-stop-shop that limits the need for a second sedation or anesthesia as with PET and MRI in sequence. PET/MRI is challenged by lack of a direct measure of photon attenuation. Accepted solutions for attenuation correction (AC) might not be applicable to pediatrics. The aim of this study was to evaluate the use of the subject-specific MR-derived AC method RESOLUTE, modified to a pediatric cohort, against the performance of an MR-AC technique based on deep learning in a pediatric brain tumor cohort.

### Edited by:

Yangming Ou, Harvard Medical School, United States

#### Reviewed by:

Yi Su, Banner Alzheimer's Institute, United States Sergey M. Plis, The Mind Research Network (MRN), United States

#### \*Correspondence:

Flemming Littrup Andersen flemming.andersen@regionh.dk

#### Specialty section:

This article was submitted to Brain Imaging Methods, a section of the journal Frontiers in Neuroscience

Received: 15 August 2018 Accepted: 13 December 2018 Published: 07 January 2019

#### Citation:

Ladefoged CN, Marner L, Hindsholm A, Law I, Højgaard L and Andersen FL (2019) Deep Learning Based Attenuation Correction of PET/MRI in Pediatric Brain Tumor Patients: Evaluation in a Clinical Setting. Front. Neurosci. 12:1005. doi: 10.3389/fnins.2018.01005 Methods: The modifications to RESOLUTE and the implementation of a deep learning method were performed using 79 pediatric patient examinations. We analyzed the 36 of these with active brain tumor area above 1 mL. We measured background (B), tumor mean and maximal activity (TMEAN, TMAX), biological tumor volume (BTV), and calculated the clinical metrics TMEAN/B and TMAX/B.

Results: Overall, we found both RESOLUTE and our DeepUTE methodologies to accurately reproduce the CT-AC clinical metrics. Regardless of age, both methods were able to obtain AC maps similar to the CT-AC, albeit with DeepUTE producing the most similar based on both quantitative metrics and visual inspection. In the patient-bypatient analysis DeepUTE was the only technique with all patients inside the predefined acceptable clinical limits. It also had a higher precision with relative %-difference to the reference CT-AC (TMAX/B mean: −0.1%, CI: [−0.8%, 0.5%], p = 0.54) compared to RESOLUTE (TMAX/B mean: 0.3%, CI: [−0.6%, 1.2%], p = 0.67) and DIXON-AC (TMAX/B mean: 5.9%, CI: [4.5%, 7.4%], p < 0.0001).

Conclusion: Overall, we found DeepUTE to be the AC method that most robustly reproduced the CT-AC clinical metrics per se, closely followed by RESOLUTE modified to pediatric cohorts. The added accuracy due to better noise handling of DeepUTE, ease of use, as well as the improved runtime makes DeepUTE the method of choice for PET/MRI attenuation correction.

Keywords: pediatric, deep learning, PET/MRI, attenuation correction, brain tumors, bone density, RESOLUTE

## INTRODUCTION

fnins-12-01005 December 24, 2018 Time: 16:50 # 2

Positron emission tomography/Magnetic Resonance Imaging with the combination of MRI and radiolabeled amino acid analog tracers such as O-(2-18F-fluoroethyl)-L-tyrosine (FET) PET offer complimentary information when imaging cerebral brain tumors (Watanabe et al., 1992; Buchmann et al., 2016), especially when estimating the true tumor extent both in lowand high-grade gliomas (Kracht et al., 2004; Vander Borght et al., 2006). The combined information from the two modalities can help to discriminate post-operative changes or radiation damage from true tumor relapse presenting with a contrastenhanced region (Mullins et al., 2005; Vander Borght et al., 2006; Galldiks et al., 2015a,b). The experience with FET-PET in pediatric and adolescent patients is limited, but it has been shown that FET-PET can add valuable information for clinical decision making (Dunkl et al., 2015). For pediatric patients, there is a clear advantage of acquiring FET-PET simultaneously with conventional MRI, as it offers a one-stopshop examination, limiting the need for a second sedation or anesthesia as with PET and MRI in sequence, as well as improves co-registration (Henriksen et al., 2016). The advantage of a simultaneous PET/MRI comes with the challenge of accurate attenuation correction (AC) in order for the FET-PET images to be quantitatively correct (Vander Borght et al., 2006).

The initial shortcomings of the vendor-provided AC have been solved for examinations of adult brains without abnormal anatomy to a clinically acceptable precision (Ladefoged et al., 2016), whereas MR-based brain AC methods targeted toward pediatric subjects are scarce. Traditional atlas-based methods are likely to fail, since they are based on a database of adult subjects with normal anatomy (Spick et al., 2016). A database of pediatric age-matched subjects (Bezrukov et al., 2015) is difficult to obtain and might not be sufficient to model anatomical deformations following surgical intervention. An obvious alternative, the MRbased segmentation methods, is often challenged by the fact that traditional MR sequences are not able to distinguish bone and air due to the short relaxation time in bone. However, with special sequences such as ultra-short echo time (UTE) and zero echo time (ZTE), cortical bone can have a high signal despite its very short spin-spin relaxation time (Robson et al., 2003). Unfortunately, the use of these sequences is often hampered by incorrect representation of tissues at air/tissue interfaces (Ladefoged et al., 2015; Sekine et al., 2016) that needs to be specially addressed if a bias is to be avoided. We have recently introduced a PET/MRI-AC method, RESOLUTE (Ladefoged et al., 2015), that makes use of UTE images to calculate an attenuation map with continuous bone representation, and overcomes the air/tissue interface challenges by using anatomical regional masks defined on an aligned template in MNI space. Within these masks, possible bias from the bone surrogate signal is limited. We have shown that RESOLUTE led to the same clinical diagnosis as the reference CT-AC in a challenging cohort consisting of adult post-surgical brain tumor patients with severe anatomical deformations (Ladefoged et al., 2017). A prerequisite for successful application of RESOLUTE to pediatric cohorts is that these masks should be defined on pediatric templates.

Recently, deep learning using convolutional neural networks have demonstrated that they are able to handle complex signals, including noise, while maintaining a high level of accuracy (Han, 2017; Liu et al., 2017; Gong et al., 2018; Leynes et al., 2018). Using this technique, it could therefore be possible to limit the air/tissue interface noise without regional masks, thereby avoiding the need for any registration, as well as benefitting from the improved inference speed usually associated with deep learning. Several techniques using deep learning for MR-AC have been proposed (Han, 2017; Liu et al., 2017; Gong et al., 2018; Leynes et al., 2018), but none have been evaluated on a challenging cohort such as pediatrics.

The aim of this study was to modify the original RESOLUTE method to a pediatric cohort, and implement an MR-AC technique based on deep learning, that takes the UTE images as input and returns an attenuation map without any registration steps or need for regional masks. In a pediatric brain tumor cohort, we evaluated the attenuated FET-PET images of the modified RESOLUTE method, the proposed deep learning method and the vendor-provided DIXON-AC method using CT-AC as reference standard, with the methods evaluated regionally, as well as with metrics used clinically for diagnosis and follow-up examinations.

### MATERIALS AND METHODS

### Patients

We included children with suspected brain tumor examined with FET-PET using our PET/MRI system (Siemens Biograph mMR, Siemens Healthcare, Erlangen, Germany) (Delso et al., 2011) between February 2015 and October 2017, and 86 FET-PET examinations in total were identified of children under the age of 14. Seven examinations were removed due to missing or corrupt data, resulting in 79 scans used to develop the method (average age: 8 years, min: 2 months, maximum 14 years). For evaluation of the four AC-methods, we included patients with an active tumor area above 1 mL. Patients were part of a larger study of FET-PET/MRI in primary CNS tumors in children and adolescents approved by the regional ethical committee (ID: H-6- 2014-095) and registered at clinicaltrials.gov (NCT03402425) and their parents gave written informed consent for participation.

### Acquisition of CT

A reference low-dose CT image (120 kVp, 36 mAs, 74 slices, 0.6 mm × 0.6 mm × 3 mm voxels) of the head using a wholebody PET/CT system was used (Biograph TruePoint 40 and 64, Siemens Healthcare) (Jakoby et al., 2009). The CT images were acquired either on the same day as the PET/MRI examination, or at a previous PET/MRI+CT examination with no brain altering surgery in-between. The longest time for any patient between PET/MRI and low dose CT was 8 month.

### Acquisition of MRI

The scan protocol included two vendor-provided AC methods: a two-point DIXON-VIBE AC sequence with repetition time (TR) 2,300 ms, echo time 1 (TE1) 1.23 ms, echo time 2 (TE2) 2.46 ms, flip angle 10◦ , coronal orientation, 19 s acquisition time, voxel size of 2.6 mm × 2.6 mm × 3.12 mm, and a UTE AC sequence with TR/TE1/TE2 = 11.94/0.07/2.46 ms, a flip angle of 10◦ , axial orientation, 100 s acquisition time, software version VB20P, field of view (FOV) of 300 mm<sup>2</sup> , reconstructed on 192 × 192 × 192 matrices (1.6 mm × 1.6 mm × 1.6 mm voxels).

### Acquisition of FET-PET

fnins-12-01005 December 24, 2018 Time: 16:50 # 3

Patients were positioned head first with their arms down on the fully integrated PET/MRI system. Data were acquired for 40 min immediately following injection of 3 MBq/kg (86 ± 37) MBq FET (Langen et al., 2006) over a single bed position of 25.8 cm covering the head and neck. For the purpose of this study, the summed PET data 20–40 min after injection from the PET/MRI acquisition were reconstructed offline (E7tools, Siemens Medical Solutions, Knoxville, TN, United States) using 3D Ordinary Poisson-Ordered Subset Expectation Maximization (OP-OSEM) with 4 iterations, 21 subsets, zoom 2.5 and 5 mm Gaussian post-filtering on 344 × 344 matrices (0.8 × 0.8 × 2 mm<sup>3</sup> voxels) in line with the clinical protocol used at our institution. For all images, default random, scatter and dead time correction were applied.

### Attenuation Correction Methods

Four methods for AC were applied to the data. First, the CT image was co-registered to the UTE TE2 image, and was used as our gold standard AC reference following conversion of Hounsfield Units as implemented on the Siemens PET/CT system. Second, vendor-provided MR-based attenuation map were derived using the DIXON VIBE sequence (Martinez-Möller et al., 2009). Third, our recently proposed AC method, RESOLUTE, was updated to process the pediatric cohort on two areas: (1) the regional masks were re-drawn on pediatric templates in MNI space (Fonov et al., 2011) spanning the ages: 0–2 m, <1 year, 1–2, 2–4, 4–8, 8– 11, and 11–14 years, and (2) the R ∗ 2 -CT bone mapping was calculated for the pediatric patients by the use of a sigmoid fit rather than a polynomial (Juttukonda et al., 2015). RESOLUTE was derived for each pediatric patient, where we used 2-fold cross validation to ensure that the mapping was not performed on the same patients used to recalibrate the mapping. Lastly, we implemented an MR-AC method based on deep learning convolutional neural networks, denoted DeepUTE. The network was based on a modified version of the U-net architecture (Ronneberger et al., 2015; Çiçek et al., 2016), where the max pool operations were replaced with convolutions with stride 2 (Springenberg et al., 2014), and each convolution, initialized using He normal initializer (He et al., 2015), is followed by a batch normalization, a rectified linear unit (ReLU) activation function, and a dropout layer with increasing fraction from 0.1–0.3 in the encoding part, and vice versa in the decoding part of the network (**Supplementary Figure 1**). The network takes as input 3D volumes consisting of 16 neighboring slices for each of the three UTE images, the echo images and the derived R<sup>2</sup> ∗ -map (16 slices × 192 voxels × 192 voxels × 3 channels), and outputs the corresponding CT slices (16 slices × 192 voxels × 192 voxels × 1 channel). We used the HU-converted co-registered CT image as our target. We trained the 3D-network in Keras (Chollet, 2015) with TensorFlow backend (Abadi et al., 2016) using the Adam optimizer (learning rate = 10−<sup>4</sup> ) (Kingma and Ba, 2014), meansquared-error as loss function, batch size of 2 for 100 epochs. The 35 million parameters that were determined during the training process took 2 days on a Titan V (NVIDIA Corporation, Santa Clara, CA, United States) graphics processing unit. From our cohort of 79 scans, we did a 4-fold cross validation, effectively training 4 networks on approximately 60 scans and evaluation on the remaining. During testing, we predicted the 3D pseudo-CT volumes around each slice, and computed the average voxel value for each of the overlapping volumes.

Since the CT coverage were usually less than the PET/MRI coverage, we added the DIXON-AC attenuation map outside the CT field-of-view. This was also done for the subsequently generated RESOLUTE and DeepUTE attenuation maps to allow for a fair comparison to the reference.

### Image Processing and Analysis

Image processing and analysis were performed similar to our previous analysis of adult post-operative brain tumor FET-PET patients (Ladefoged et al., 2017). First, a background (B) region of interest was delineated in healthy appearing gray and white matter at a level above the insula in the contralateral hemisphere to the tumor. The biological tumor volume (BTV) of FET-PET was measured using a 3D auto-contour using Mirada XD software (Mirada Medical, Oxford, United Kingdom) defining tumor tissue at a threshold above 1.6 of the mean standardized uptake value (SUV) in the background ROI (Floeth et al., 2005) for each AC method separately. Extratumoral areas with high FET uptake, e.g., vascular structures, pineal body and skin, were identified on either the T1w or FET-PET image and removed from evaluation. The delineation was performed by a nuclear medicine specialist experienced in pediatric neurooncology (LM).

We assessed the different AC methods ability to produce accurate FET-PET images on a patient-by-patient basis using the most commonly semi-quantitative clinical metrics in the diagnostic workflow. We measured the biological tumor volume (BTV), mean (TMEAN) and max (TMAX), and the ratios TMEAN/B and TMAX/B were calculated. For the BTV we analyzed the tumor contours relative to the CT-AC reference using the Jaccard similarity metric, and a measurement of shape deviations. The calculated ratios were compared to the ratios calculated with the reference CT-AC. These metrics are commonly used as a criterion to identify active tumor tissue from reactive changes. As described previously (Ladefoged et al., 2017), we defined acceptance criteria of < ± 0.05 and 0.1 or 5% for the TMEAN/B and TMAX/B ratios, respectively, and ± 2mL or 10% for the BTV. These were based on differences in clinical practice that may be considered clinically relevant in identifying biologically active tumor tissue or treatment related change in activity (Piroth et al., 2011). The mix of both an absolute and relative cutoff reflects that larger absolute change is acceptable in large or very active tumors. For each clinical metric we calculated the mean difference, 95% confidence intervals (CI) and limits of agreement on the log-transformed data, as the data was found to have log normal distribution. Exponentiation was applied to

these results to express the differences as ratios on the original scale and report them as percentage differences. We corrected for repeated measurements from the repeated examinations (Bland and Altman, 1999).

### RESULTS

A total of 28 patients met the inclusion criteria of 1 mL active tumor area, 6 of which had one or more follow up examination, resulting in a total of 36 examinations used for evaluation (**Supplementary Table 1**). Both RESOLUTE and DeepUTE were able to derive attenuation maps for all pediatric patients regardless of the age. Ten of the 28 patients (35%) had titanium implants present. Overall, DeepUTE had improved accuracy over RESOLUTE: the Jaccard index was 0.57/0.62 in air, 0.74/0.79 in soft tissue and 0.53/0.70 in bone tissue for RESOLUTE/DeepUTE, respectively. The improved accuracy was also apparent in a direct visual comparison of the estimation of regional attenuation values in the nasal cavities, the skull base and the mastoid processes, and can be appreciated in **Figure 1**, where two patients with challenging anatomy are shown for RESOLUTE-AC, DeepUTE-AC and CT-AC, and the relative difference PET image in **Supplementary Figure 2**. Another example of a typical patient is given in **Figure 2**. There was also a significant improvement in AC runtime with values of 4 s for DeepUTE and ∼3 min for of RESOLUTE, which although small, improves the overall imaging workflow.

Across all pediatric patients, the Jaccard index of the tumor delineation was 0.73 ± 0.20 for DIXON-AC, 0.90 ± 0.07 for RESOLUTE and 0.92 ± 0.07 for DeepUTE. The tumor configuration did not change for any of the patients when using RESOLUTE or DeepUTE compared to CT-AC but for DIXON-AC this was found in 4 examinations (mean difference: 1.6 mL), and was completely missed for an additional examination (BTV with CT-AC: 2 mL).

The comparison of the clinical metrics can be seen in **Figure 3**, together with the defined acceptable limits. Across all metrics, using DeepUTE, none of the patients were outside the acceptable limits, whereas two patients fall short of the TMAX/B limit and a single patient in the TMEAN/B limit when using RESOLUTE. In these patients, the largest difference was TMAX/B overestimation

FIGURE 1 | Sample cases for two pediatric patients with irregular anatomy. (A) show the T1w MPRAGE, (B) CT-AC, (C) RESOLUTE-AC, and (D) DeepUTE-AC. The top rows show a 5-year-old patient with post-operative subcutaneous soft tissue swelling in the occipital region. RESOLUTE erroneously fills in a dual layer bone layer on both sides of the swelling, along skin and bone. The bottom rows show a 6-year-old patient with air pockets anteriorly in the lateral ventricles that appeared after surgical intervention, and are not imaged in RESOLUTE. Also in this case RESOLUTE crafts a dual layered skull in the occipital region. For both patients, RESOLUTE is challenged in the definition of facial and skull base attenuation value. DeepUTE captures the morphology more confidently.

subtracted (A), respectively, and (F,G) shows the resulting relative difference in the PET images between RESOLUTE and DeepUTE relative to CT-AC, respectively. The improved accuracy in the nasal cavities, the skull base and the mastoid processes, leads to a clear reduction of the errors in the surrounding regions, e.g., in the medulla. It also appears that, for this patient, a small underestimation of the densities within the brain in DeepUTE leads to a small underestimation globally within the brain. The tumor delineation is show on the sagittal view in (D–G).

of 0.13 a.u. due to overestimated bone area in the skull base. In comparison, DIXON-AC gave a TMAX/B difference over the acceptable limit in 23/36 (64%) examinations, and 13/36 (36%) examinations had changes to BTV over the acceptable limit.

The relative %-difference in the diagnostic measures was similar between RESOLUTE and DeepUTE, again with DeepUTE with the reduced error and variation (**Table 1**). BTV measured using DeepUTE was underestimated by 2% on average (95% CI: −5 to 1%) compared to −1% (95% CI: −5 to 4%) with RESOLUTE. None of the metrics had statistically significant differences compared to the reference CT-AC. In comparison, DIXON-AC had statistically significant differences in all three clinical metrics (p < 0.001).

### DISCUSSION

Magnetic resonance imaging is the method of choice to diagnose brain tumor patients, but FET-PET can add valuable information for clinical decision making (Dunkl et al., 2015). Examining pediatric and adolescent patients on a hybrid PET/MRI can be preferred over PET/CT to reduce the number of examinations, which is especially relevant when anesthesia is required, and is important for both child and parents. A prerequisite for a confident clinical evaluation of the cohort with PET/MRI is an accurate AC. The skull shape, density, thickness, and composition change considerably during development in childhood especially the first three years after which the sutures and fontanelles gradually calcify and close (Li et al., 2015). Especially the rapid growth of skull thickness and bone density will highly influence attenuation leading to errors in atlas-based methods that cannot account for the thin, low-density infant cranium.

In designing the clinical study, we were acutely aware of these unresolved AC issues and choose to include a separate low-dose CT acquisition. This could be performed safely in all children, although it involved moving sensitive patients to a different scanner for additional radiation exposure and, for some children, extending anesthesia. This additional stress on the patients was regarded ethically acceptable so that future use of hybrid PET/MRI in pediatric brain tumors, which could be one of the most important applications, could be performed with the best possible assessment of risk to the patient caused by

standard CT-AC. The black lines indicate the acceptance criteria of TMEAN/B of ± 0.05 or 5%, TMAX/B of ± 0.1 or 5%, and BTV of ± 2mL or 10%, respectively. Points that exceed the criteria have been colored. The age of the children exceeding the threshold using RESOLUTE are 7, 7, and 11 years, respectively. Note the difference on the axes. The dashed gray line indicates the mean value.

quantitative inaccuracies using accepted standard metrics within the field.

We modified the already thoroughly evaluated RESOLUTE method to be applied on pediatric patients, as well as introduced an MR-AC method based on a deep learning convolutional neural network, and also included DIXON-AC. The novelty of DeepUTE does not lie in the chosen type of architecture, but rather in the data that went into training the model. This manuscript is, to the best of our knowledge, the first of its kind to train a deep learning network for MR-AC purposes on a pediatric cohort of this size. The included patients in the evaluated cohort are well suited to test the method's ability to adapt to anatomy changes across different ages.

Pediatric patients are a challenging cohort to examine due to motion, often leading to sedation or anesthesia of the patients. The patients included in this study had, as expected, a larger amount of noise in the MR images than adult patients, leading to increased amount of noise in the bone surrogate signal. The

Ladefoged et al. Deep Learning AC for Pediatrics

TABLE 1 | Summary of the relative %-difference<sup>∗</sup> to the reference CT-AC of each clinical metric for the MR-AC methods.


<sup>∗</sup>Exponentiation was applied to results from analysis on log scale, and results were expressed as percentages. ∗∗Indicates a statistical significant (p < 0.05) found by a paired t-test. CI = 95% confidence interval for mean difference. BTV is measured in mL. A single examination without BTV with DIXON-AC was left out of this analysis.

strength of the DeepUTE method is that it is able to robustly handle this noise, which the deep learning methods are known for. An example of the improved noise handling is evident in **Figure 1**, where DeepUTE better models both the thin bone and noise at the posterior part of the head.

Titanium alloy clamps, that were present in 33% of the patients to fix the craniotomy, showed up as small signal voids in the MR images with a size similar to the implants seen on CT. Visual reading showed that both RESOLUTE and DeepUTE filled the signal void with a density similar to dense bone, similar to what has previously been observed (Ladefoged et al., 2017). This meant that a valid attenuation map without artifacts could be calculated in all scans using RESOLUTE and DeepUTE.

Overall, we found both RESOLUTE and our DeepUTE methodology to accurately reproduce the CT-AC clinical metrics with similar accuracy as was seen for RESOLUTE when evaluating adult FET-PET brain tumor patients (Ladefoged et al., 2017). Regardless of age, both methods were able to obtain AC maps similar to the CT-AC, albeit with DeepUTE producing the most similar based on both quantitative metrics and visual inspection. In the patient-by-patient analysis, all patients were inside the predefined acceptable clinical limits with DeepUTE, where three patients (7–11 years old) were outside the limits in the TMAX/B or TMEAN/B metrics when using RESOLUTE (**Figure 3**). A similar result was obtained with RESOLUTE for the adult FET-PET brain tumor patients (Ladefoged et al., 2017) where 5/68 studies exceeded the predefined limit. The errors from RESOLUTE were due to an overestimation of bone density in known "problem" areas near the skull base, but none of the errors impacted the clinical reading of the images. In comparison, the same patients obtained with DeepUTE-AC had a higher precision in the skull base, leading to more accurate measurements. The confidence interval was narrower when using DeepUTE compared to RESOLUTE (**Table 1**). This indicates that there is a smaller variation of the errors in DeepUTE compared to RESOLUTE.

The processing in RESOLUTE was the same for all patients, except for the combination of the segmented tissue maps within regional masks, as these are different depending on the patient age. In DeepUTE, the same method was applied regardless of patient age. Further dividing the training patients into smaller groups depending on age might further reduce the variance, but requires more data, as training a deep learning network with too few patients leads to overfitting. We did not apply transfer learning in this study, as it has been shown that training a deep learning network using less than 30 patients is feasible (Han, 2017; Liu et al., 2017; Gong et al., 2018; Kläser et al., 2018; Leynes et al., 2018). However, using transfer learning, e.g., from a larger adult cohort might further improve the results presented here, as the low-level information are to be expected similar between the cohorts.

In software version VB20P on the Siemens mMR, two vendor-provided solutions for AC is available – DIXON-AC and UTE-AC, that both have been used in the published pediatric neuro-oncology PET/MRI literature (Garibotto et al., 2013; Preuss et al., 2014; Fraioli et al., 2015), however, encompassing only 6 and 12 patients, respectively. This small patient sample may reflect hesitation from the clinical community to use PET/MRI routinely in this difficult patient group because of the well-documented systematic underperformance of particularly DIXON-AC (Andersen et al., 2014; Ladefoged et al., 2017), which is also apparent from our study. DIXON-AC was the only vendorprovided method capable of producing attenuation maps for the full pediatric cohort. In four patients, UTE-AC was not able to produce an attenuation map of patients, aged 0–2 years, which is why we chose to exclude UTE-AC from the comparison.

In this study, we only had 6 patients with repeat examinations. We found that the change of TMEAN/B, TMAX/B and BTV between two examinations with RESOLUTE or DeepUTE were in congruence with the change when measured with CT-AC, as none of the differences were outside the acceptable limit. A larger number of repeat examinations should confirm this.

### Limitations

We did not have pediatric data available after the software upgrade to VE11P, which adds a model-based AC method (Paulus et al., 2015; Koesters et al., 2016), but we speculate that the method would be unsuccessful for the younger pediatric cohort since the method was developed for adults.

Both RESOLUTE and DeepUTE are based on the UTE sequence, so while we expect DeepUTE to be directly transferable to any Siemens mMR, which is the case for RESOLUTE, neither method is able to produce attenuation maps from PET/MRI data from other vendors. The fundamental idea behind DeepUTE is not limited to UTE data, and retraining the network on other MR sequences such as the T1w MPRAGE or ZTE could allow for a multi-vendor method. However, it would require a large pediatric dataset across several vendors to confirm this.

Although, the limits of agreement using RESOLUTE and DeepUTE are encouragingly narrow (**Table 1**), the number of patients in each age category is still small. Thus, we cannot rule out artifacts caused by other combinations of anatomy and pathology.

### CONCLUSION

fnins-12-01005 December 24, 2018 Time: 16:50 # 8

The present study performed on FET-PET/MRI examinations of pediatric patients revealed that both RESOLUTE and our deep learning method DeepUTE are able to robustly produce attenuation maps similar to the reference CT-AC. The clinical metrics were within acceptable limits of the reference CT-AC, making either method suitable for imaging of pediatric brain tumor patients – a cohort that is especially challenging for atlasbased methods. For clinical use of any MR-AC map, however, we recommend visually inspection for artifacts with particular attention to areas close to the skull base, anatomically distorted tissue and metal implants. The added accuracy due to better noise handling of DeepUTE, ease of use without the need for regional masks, as well as the improved runtime makes DeepUTE the method of choice for PET/MRI AC. Further refinement of the deep learning method with age-specific data input is likely to improve the performance.

### AUTHOR CONTRIBUTIONS

CL designed the method, did the data analysis, and prepared the manuscript. LM and IL designed the method, aided in data

### REFERENCES


Chollet, F. (2015). Keras. Available at: https://github.com/fchollet/keras

Delso, G., Fürst, S., Jakoby, B., Ladebeck, R., Ganter, C., Nekolla, S. G., et al. (2011). Performance measurements of the Siemens mMR integrated wholebody PET/MR scanner. J. Nucl. Med. 52, 1914–1922. doi: 10.2967/jnumed.111. 092726

analysis, and revised and approved the manuscript. AH, FA, and LH aided in data acquisition, and revised and approved the manuscript.

### FUNDING

The generous support by The Danish Childhood Cancer Foundation is highly appreciated (2014-34, 2015-48).

### ACKNOWLEDGMENTS

We highly appreciate the scanner assistance of technologists Karin Stahr and Marianne Federspiel, and of radiographers Jákup Martin Poulsen and Nadia Azizi. The John and Birthe Meyer Foundation is thanked for their generous donation of the Siemens mMR hybrid PET/MR system at Copenhagen University Hospital Rigshospitalet.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fnins. 2018.01005/full#supplementary-material



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Ladefoged, Marner, Hindsholm, Law, Højgaard and Andersen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

fnins-12-00716 October 4, 2018 Time: 15:24 # 1

# Analysis of Progression Toward Alzheimer's Disease Based on Evolutionary Weighted Random Support Vector Machine Cluster

Xia-an Bi<sup>1</sup> \*, Qian Xu<sup>1</sup> , Xianhao Luo<sup>2</sup> , Qi Sun<sup>1</sup> and Zhigang Wang<sup>1</sup>

<sup>1</sup> College of Information Science and Engineering, Hunan Normal University, Changsha, China, <sup>2</sup> College of Mathematics and Statistics, Hunan Normal University, Changsha, China

Alzheimer's disease (AD) could be described into following four stages: healthy control (HC), early mild cognitive impairment (EMCI), late MCI (LMCI) and AD dementia. The discriminations between different stages of AD are considerably important issues for future pre-dementia treatment. However, it is still challenging to identify LMCI from EMCI because of the subtle changes in imaging which are not noticeable. In addition, there were relatively few studies to make inferences about the brain dynamic changes in the cognitive progression from EMCI to LMCI to AD. Inspired by the above problems, we proposed an advanced approach of evolutionary weighted random support vector machine cluster (EWRSVMC). Where the predictions of numerous weighted SVM classifiers are aggregated for improving the generalization performance. We validated our method in multiple binary classifications using Alzheimer's Disease Neuroimaging Initiative dataset. As a result, the encouraging accuracy of 90% for EMCI/LMCI and 88.89% for LMCI/AD were achieved respectively, demonstrating the excellent discriminating ability. Furthermore, disease-related brain regions underlying the AD progression could be found out on the basis of the amount of discriminative information. The findings of this study provide considerable insight into the neurophysiological mechanisms in AD development.

Keywords: Alzheimer's disease progression, functional connectivity, classification, disease-related brain regions, evolutionary weighted random support vector machine cluster

### INTRODUCTION

Alzheimer's disease (AD) is a devastating neuro-cognitive disorder of the human brain (Keren-Shaul et al., 2017; Kodis et al., 2018), which is characterized by the progressive loss of cognition and memory in elderly adults (Roy et al., 2016). Along with the aging of global population, the number of individuals suffering from AD will increase (Novak et al., 2017). It is predicted that there will be more than 100 million elderly people worldwide affected by AD by 2050 (Cortes-Canteli et al., 2015; Branca and Oddo, 2017). Therefore, the identification of AD and particularly its transitional phase, namely mild cognitive impairment (MCI), have received increasingly growing attentions in recent years (Cui et al., 2018). The individuals diagnosed with MCI could be further subdivided into the early MCI (EMCI) and late MCI (LMCI) (Lee et al., 2017) and the distinguishing criterions for EMCI

Edited by:

Nianyin Zeng, Xiamen University, China

#### Reviewed by:

Tian Wang, Huaqiao University, China Li Xiao Yan, The Sixth Affiliated Hospital of Sun Yat-sen University, China

> \*Correspondence: Xia-an Bi bixiaan@hnu.edu.cn

#### Specialty section:

This article was submitted to Brain Imaging Methods, a section of the journal Frontiers in Neuroscience

Received: 10 August 2018 Accepted: 19 September 2018 Published: 08 October 2018

#### Citation:

Bi X-a, Xu Q, Luo X, Sun Q and Wang Z (2018) Analysis of Progression Toward Alzheimer's Disease Based on Evolutionary Weighted Random Support Vector Machine Cluster. Front. Neurosci. 12:716. doi: 10.3389/fnins.2018.00716

**189**

fnins-12-00716 October 4, 2018 Time: 15:24 # 2

and LMCI have been previously depicted in Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort (Nuttall et al., 2016). At present, there is still no therapy to prevent or reverse the AD pathological process (Forster et al., 2017). It is hence important to develop a new approach that could identify different stages of AD to enhance the understanding of AD pathophysiological progression, which is helpful to the preclinical AD studies.

A great deal of neuroimaging techniques could be utilized to image human brain function and structure, e.g., diffusion tensor imaging (DTI), magnetic resonance spectroscopy (MRS), electroencephalogram (EEG), functional magnetic resonance imaging (fMRI), and so on (Busato et al., 2016; Thanh Vu et al., 2017). Due to the advantages of high temporal and spatial resolutions, fMRI especially resting-state fMRI have gained increasingly growing popularities in the investigation of the whole-brain neural connectivity recently (Goense et al., 2016). As an advanced brain imaging technology, resting-state fMRI has shown a great potential in providing comprehensive information to achieve a high level of identification of the neurological diseases (Phillips, 2012; Rosa et al., 2015). Accordingly, the application of non-invasive resting-state fMRI is highly advantageous to unfold the complexity of brain connectivity network and examine the brain dynamic changes from EMCI to LMCI to AD.

Machine learning (ML) technologies were extensively used in automatic pattern recognition based on imaging data (dos Santos Siqueira et al., 2014; Moradi et al., 2015; Wang et al., 2018; Zeng et al., 2018). In existing literature, there has been a widespread interest to utilize ML methods to classify different stages of AD. Nozadi et al. (2018) employed a random forest (RF) algorithm based on the whole-brain approach to achieve the accuracies of 72.5 and 81.7% for 164 EMCI versus 189 LMCI and 189 LMCI versus 99 AD respectively. Goryawala et al. (2015) reported the accuracies of 73.6 and 90.1% for 114 EMCI versus 91 LMCI and 91 LMCI versus 55 AD using the linear discriminant analysis (LDA). Jie et al. (2018) utilized the multi-kernel SVM and displayed a high accuracy of 78.8% classifying 56 EMCI from 43 LMCI. It is noteworthy that the discrimination between EMCI and LMCI is more challenging in comparison to LMCI and AD.

In order to improve the classification performances especially of EMCI and LMCI, and enhance the understanding of neuropathology in the AD progression, a new method of evolutionary weighted random SVM cluster (EWRSVMC) was presented in this paper to diagnose different stages of AD. The EWRSVMC combined multiple weighted SVM classifiers to make the final decision, which was believed to be considerably stable and robust compared to other individual classifiers such as artificial neural network and decision tree. In addition, the EWRSVMC employed a method of evolution to guide feature selection to explore the optimal feature set for better classification performance. We performed the experiment 1 for EMCI/LMCI classification and the experiment 2 for LMCI/AD classification, yielding high accuracies of 90 and 88.89% respectively using this new framework. Furthermore, the disease-related brain regions were ranked according to the corresponding optimal features' frequencies and the top-ranked brain regions could be found out. On the one hand, several high-frequency brain regions [e.g., superior temporal gyrus (STG.R), insula (INS.L) and middle temporal gyrus (MTG.L)] are presented in the two groups of experiments at the same time, which suggested that these brain regions play crucial roles in the progression of AD. On the other hand, some brain areas displayed high frequencies only in one group of experiment [e.g., superior frontal gyrus (SFGmed.L) and olfactory cortex (OLF.R) in the experiment 1, and parahippocampal gyrus (PHG.L) and posterior cingulate gyrus (PCG.L) in the experiment 2], which facilitated to understand differences in disease progression. These findings are in agreement with the claims of the previous studies on AD (Douaud et al., 2013; Xiang et al., 2013; Zhu et al., 2014) and provide a novel perspective to AD progression's neurophysiological mechanisms.

### MATERIALS AND METHODS

### Subjects

The neuroimaging data we utilized in this study came from the ADNI cohort<sup>1</sup> (Morris et al., 2014). We collected the restingstate fMRI data of 105 participants, which contained 42 EMCI patients (18 male, average age 72.34 years), 38 LMCI patients (23 male, average age 72.99 years) and 25 AD subjects (12 male, average age 74.59 years). Every participant had clinical dementia rating (CDR) scores and mini-mental state examination (MMSE) scores to ensure that the data was homologous. Chi-squared test was utilized for gender comparisons and two-sample t-test was utilized for age, MMSE and CDR comparisons. The detailed demographic information for the patient cohorts was listed in **Table 1**.

All participants were asked to lie still in a Siemens TRIO 3 Tesla machine using the same scanning parameters as follows: 64 × 64 acquisition matrix; flip angle = 80◦ ; echo time (TE) / repetition time (TR) =30/3000 ms; pixel spacing Y/pixel spacing X = 3.3/3.3mm; 140 image volumes; 48 axial slices; 3.313 mm slice thickness with no gap. During the scan, all participants should close eyes but keep awake with thinking of nothings (Liu et al., 2018).

### Data Preprocessing

The same image preprocessing for EMCI, LMCI and AD patients was performed by utilizing the Data Processing

<sup>1</sup>http://adni.loni.usc.edu/



<sup>a</sup>The P-value of the comparison between the EMCI and LMCI. <sup>b</sup>The P-value of the comparison between the LMCI and AD.

Assistant for Resting State fMRI (DPARSF) toolbox (Dan et al., 2017). Briefly, the data was preprocessed in nine steps: converting the data into NIFTI format; exclusion of the first 10 volumes; slice-timing correction; realignment for head movement compensation; normalization; smoothing (utilizing a Gaussian kernel); removing linear trend; temporal band-pass filtering; 9) regressing out the nuisance signals.

### Functional Connectivity Features

The brain is a dynamic system constructed by large-scale complex networks comprised of the connections between different brain regions (Braun et al., 2015). In this paper, we employ a popular automated anatomical labeling template (Rolls et al., 2015) to divide the cerebrum into 90 brain areas (45 for left and right hemisphere respectively). A representative resting-state fMRI signal for each brain region is generated by averaging the time series of voxels within each of 90 brain regions. The Pearson correlation coefficient between the representative signals of each pair of the brain regions is computed and treated as a proxy of functional connectivity (FC) (Noble et al., 2017). As a result, a total of 4005 (80 × 90/2) FCs are obtained for each subject and served as predictor features for the proposed EWRSVMC algorithm, which is considered to be a promising approach.

## The Evolutionary Weighted Random SVM Cluster

#### EWRSVMC Design

Machine learning techniques are widely used for pattern recognition (Zeng et al., 2017), among which the SVM model has received increasing popularities in the analysis of neurological disease based on the high-dimensional imaging data recently. Nevertheless, utilizing the single SVM classifier is too challenging to achieve excellent diagnostic performance due to the noise of brain imaging data. Bi et al. (2018) put forward a random SVM cluster (RSVMC) in which multiple SVM classifiers are combined for a final decision-making, which outperforms an individual SVM classifier. But, it could not be ignored that the diagnostic power of each individual classifier in the ensemble classifier may be greatly differential from others. The previous method of RSVMC ignores the fact that the individual SVM classifier with relatively high training error is likely to perform wrong voting on the new samples, which is likely to degrade the discriminative ability. Accordingly, there still remains room for the improvement with respect to the RSVMC method.

This paper presents a novel algorithm of EWRSVMC with two successive steps, i.e., the construction and evolution of weighted ensemble of SVMs respectively. First, in order to reduce the influence of the weak classifiers on the voting, the classification accuracy of each SVM classifier is calculated using the validation set, which is regarded as a proxy of weight of every SVM classifier. The output of EWRSVMC is a weighted average of the outputs of multiple SVMs, which could further reduce classification error rate. Second, in order to select out the most discriminative features from a large-scale feature vector, the method of evolution is introduced to dynamically eliminate the redundant features for further improving final classification performance. The idea of our proposed architecture is showed in **Figure 1**, where each row and column corresponds to a subject and feature respectively in the left data matrixes.

We suppose X = {x1, ...x<sup>k</sup> , ...xn} ∈ R N × d as the connectivity features vectors where N and d are the numbers of all subjects and features. y<sup>i</sup> ∈ {+1, −1} is the response class label representing two different states (e.g., EMCL or LMCI). The construction of the weighted random SVM cluster is performed using the following steps:


$$W\_l = \frac{T\_l^{correct}}{T\_L} \tag{1}$$

where T correct l denotes the number of validation samples correctly classified by l-th SVM classifier, T<sup>L</sup> represents the number of validation samples.

(4) Step4: The step 2 to step 4 are repeated for n times to build a weighted ensemble of n SVM classifiers.

Following the above steps, a weighted ensemble of multiple SVM classifiers could be constructed and then an approach of evolution is applied to the ensemble classifier to guide feature selection.

fnins-12-00716 October 4, 2018 Time: 15:24 # 3

Specifically, the SVM classifiers whose classification accuracies are lower than 0.5 are first picked out from the weighted random SVM cluster and considered as weak classifiers. Similarly, the remaining SVM classifiers are regarded as strong classifiers due to the good performance. Then the features selected by these weak classifiers are found out and the weights corresponding to the common features are accumulated. The total weight of each feature in weak classifiers is denoted asTw<sup>j</sup> :

$$Tw\_j = \sum\_{l=1}^{p} w\_{l,j} \tag{2}$$

where p is the number of weak classifiers; wl,<sup>j</sup> represents the weight of the j-th feature corresponding to l-th weak classifier.

Next, we remove the features whose total weight Tw<sup>j</sup> exceeds a certain threshold q, because these features play crucial roles in the weak classifiers and are likely to make few contributions to the excellent performance of the overall system. As a result, we obtain the remaining features with lower total weights in the weak classifiers and all the features determined by the strong classifiers as an evolutionary feature set, leading to the reduced dimensionality of total feature space. Finally, the aboveobtained evolutionary feature set is employed to rebuild a weighted random SVM cluster for the further reduction of feature dimensionality. This procedure is repeated iteratively until it reaches the times of evolutions we set. The optimal EWRSVMC with the highest accuracy during the evolution process could be found out and the features determined by this optimal EWRSVMC are considered as the optimal feature set. The feature selection procedure of the EWRSVMC is exhibited in **Figure 2**.

#### The Evaluation of the EWRSVMC

The EWRSVMC perform a weighted average of the outputs of multiple SVM classifiers, which could predict the class label of each new testing sample. To be specific, a new sample is firstly input into a EWRSVMC system and each individual SVM classifier performs a weighted vote in accordance with its accuracy dealing with the validation samples. Then the weighted voting values belonging to the same predicted label are added up. Lastly, the label having the highest voting value represents new sample's final predicted label.

In this paper, we employ the three metrics, i.e., accuracy, sensitivity and specificity to estimate our proposed EWRSVMC's final performances. The diagnostic accuracy A<sup>c</sup> stands for a fraction of correctly identified samples (Schröder et al., 2015):

$$A\_c = \frac{TP + TN}{TP + FP + FN + TN} \tag{3}$$

where TP, FP, FN, and TN respectively represents the number of true positives, false positives, false negatives and true negatives.

Sensitivity (Sn) stands for a proportion of actual positive samples which are correctly identified (Mondal and Pai, 2014):

fnins-12-00716 October 4, 2018 Time: 15:24 # 4

$$S\_n = \frac{TP}{TP + FN} \tag{4}$$

Specificity (Sp) stands for a proportion of actual negative samples which are correctly identified (Kumar and Helenprabha, 2017):

$$\mathcal{S}\_{\mathbb{P}} = \frac{TN}{TN + FP} \tag{5}$$

### The Application of the EWRSVMC

fnins-12-00716 October 4, 2018 Time: 15:24 # 5

In the current study, we conducted multiple binary classifications, including EMCI vs. LMCI and LMCI vs. AD to confirm the performance of our proposed EWRSVMC using 4005 FCs as the raw features. In addition to optimizing the classification accuracy as with most existing studies, we also paid great attentions to exploring and analyzing the alterations of the brain in patients with different cognitive stages of AD. Accordingly, another subprocedure for the exploration of the disease-related brain regions using the optimal features set was carried out. First, we detected the brain regions which are relevant to the optimal features in the EWRSVMC with the highest classification accuracy. Then, disease-related brain regions were sorted in a descending mode, which is consistent with their occurrence frequencies. The higher the frequencies are, the greater the abnormal degrees of the brain regions are.

#### Experiment Design

In this paper, we conducted the experiment 1 for EMCI vs. LMCI classification and the experiment 2 for LMCI vs. AD classification. Each group of experiment could be mainly divided into four parts:

(1) Division of data sets. A 3:1 ratio is set to divide entire resting-state data set into the "training and validation" set for training the EWRSVMC and the test set for examining the generalization ability of the overall system. Furthermore, a 2:1 ratio is set to subdivide the "training and validation" set into the training set for training the SVM classifier and the validation set for obtaining the weight corresponding to the SVM classifier.

√ (2) Building an ERWSVMC. Firstly, we randomly select 4005 ≈ 62 features from all 4005 features based on the training set to build a radial basis function (RBF) kernel SVM classifier. The kernel bandwidth σ and penalty parameter C for each SVM model are primarily set as 3 and Inf respectively. The number of initial base classifiers is set to 500 to get the weighted ensemble of SVMs. Then, we make the ensemble classifier evolves for 50 times. In each evolution, we find out the features selected by the weak classifiers and remove the features whose total weight Tw<sup>j</sup> exceeding the certain threshold q = 7. As a result, the EWRSVMCs with different evolution times are obtained.

(3) Finding out the optimal subset of features. We compute the diagnostic accuracies of the EWRSVMCs with different evolution times. The features selected by the optimal EWRSVMC having the lowest diagnostic error rate form the optimal features subset.

(4) Exploring the abnormal brain regions. We seek out the features with high discriminative ability in the optimal EWRSVMC, and then investigate the corresponding diseaserelated brain regions associated with these features.

### RESULTS

### The Experiment 1

We investigated the performance of classification between EMCI and LMCI in the experiment 1. According to Section "Experiment Design," we conducted 50 evolutions for the EWRSVMC. Consequently, the EWRSVMC yielded a maximum accuracy of 90% in the 32nd evolution (as shown in **Figure 3**), which suggested that 32 was the optimal times of evolutions. Meanwhile, a sensitivity of 90.9% and a specificity of 88.89% were achieved based on the optimal feature set. The experiment results showed that the novel framework could significantly enhance diagnostic performance for EMCI/LMCI classification in compared with some other existing algorithms.

Feature selection was a crucial stage in our EWRSVMC algorithm classifying LMCI from EMCI and the process was shown in **Figure 4**. On the one hand, the number of removed features increased rapidly and exceeded 100 after two evolutions. Then it became gradually stable and fluctuated around 120. On the other hand, the number of remained features showed a trend of linear decline. There were 248 features left after completing the 32nd evolution, which constituted the optimal feature set and were utilized for subsequent study on the exploration of disease-related brain regions.

By counting the high-frequency FCs, we could detect the most discriminative brain regions which were ranked in the **Table 2**. The brain regions exceeding the frequency of 10 comprise inferior temporal gyrus (ITG.R), temporal pole: middle temporal gyrus (TPOmid.L), temporal pole: superior temporal gyrus (TPOsup.R), middle temporal gyrus (MTG.L) and insula (INS.L). As seen from **Table 2**, some sub-regions of the temporal lobe showed higher frequencies compared to other regions, indicating the temporal lobe made an essential contribution to the evolution from EMCI to LMCI. The locations of brain regions were mapped in **Figure 5** and the size of the red node

fnins-12-00716 October 4, 2018 Time: 15:24 # 6

the EWRSVMC reported the highest accuracy of 88.89% in the 34nd evolution (please see **Figure 6**), which indicated that 34 was the optimal times of evolutions in LMCI/AD classification. At the same time, the optimal EWRSVMC achieved 85.71% sensitivity and 90.9% specificity. The encouraging performances demonstrated the potential of our new framework for the diagnosis of AD dementia.

The process of feature selection in LMCI/AD classification was plotted in **Figure 7**. The number of removed features showed an overall upward trend, while the number of remained features exhibited a trend of linear decline. There were 293 features left after finishing the 34th evolution, which formed the optimal feature set for the further analysis of progression from LMCI to AD.

We were able to explore the most discriminative brain regions by counting the high-frequency FCs. The disease-related

FIGURE 6 | Finding the optimal times of evolutions in the experiment 2.

TABLE 2 | The frequencies of the most discriminative brain regions in the experiment 1.


represented the degree of abnormality of the corresponding brain regions.

### The Experiment 2

The classification of patients with LMCI and AD was carried out in the experiment 2. Similarly, 50 evolutions were performed and brain regions in LMCI/AD classification were ranked in the **Table 3** and the ones exceeding the frequency of 10 were listed as follows: superior temporal gyrus (STG.R), parahippocampal gyrus (PHG.L), middle frontal gyrus, orbital part (ORBmid.R), calcarine fissure and surrounding cortex (CAL.R), insula (INS.L), temporal pole: middle temporal gyrus (TPOmid.R), and posterior cingulate gyrus (PCG.L). Similarly, some subregions of the temporal lobe and insula showed higher frequencies than other brain regions, suggesting the temporal lobe and insula made greater contributions to the evolution of AD. **Figure 8** described the locations of brain regions.

### DISCUSSION

### Classification Effect

fnins-12-00716 October 4, 2018 Time: 15:24 # 7

In this paper, we propose an advanced framework of EWRSVMC based on resting-state fMRI data to accurately classify different stages of AD. Resting-state fMRI is an effective tool for exploring the dynamical changes in human brain because of the high temporal and spatial resolutions (Lee M.H. et al., 2016). In addition, to the best of our knowledge, no investigation is available about the EWRSVMC in AD studies using brain imaging data. The EWRSVMC is able to efficiently perform EMCI/LMCI and LMCI/AD classifications with the high accuracies of 90 and 88.89%, sensitivities of 90.9 and 85.71%, specificities of 88.89 and 90.9% respectively. The results of two

TABLE 3 | The frequencies of the most discriminative brain regions in the experiment 2.


groups of experiments demonstrate the availability of novel EWRSVMC algorithm for early detection of AD and the potential of resting-state fMRI for identification of the transition from EMCI to LMCI to AD.

The ML techniques have received increasingly growing attentions recently in imaging data (Zeng et al., 2014; Wang et al., 2017), and have been shown to be a reliable method to diagnose different cognitive stages of AD using neuroimaging data. Jiang et al. (2014) achieved a high accuracy around 80% for 56 EMCI versus 44 LMCI combining a sparse learning with the SVM classifier. Prasad et al. (2015) reported the accuracy of 63.4% for 74 EMCI vs. 38 LMCI using the SVM classifier with the feature set of the fiber network measures (FIN) and the flow network measures (FLN). Mahjoub et al. (2018) combined the proposed deep similarity network architectures with the single SVM classifier utilizing the cross-validation method to classify 41 AD from 36 LMCI with a classification accuracy peaking at 77.92%.

The majority of ML methods had the slightly lower classification performances especially classifying EMCI from LMCI because of image noise and small-sample size of data. In addition, a great deal of studies have paid more attention to the classification but rarely explored disease-related brain regions underlying the AD evolution. To address these issues, a new framework of EWRSVMC using the FCs as the raw features was presented in this paper. The output of EWRSVMC is a weighted average of the outputs of SVMs, which could further reduce classification error rate compared to some previous methodologies. Additionally, Due to the high dimensionality of feature space, the complexity of the algorithm is likely to be increased and the performance of model estimation is degraded. Accordingly, a method of evolution is employed to dynamically eliminate the redundant features and the features in the optimal EWRSVMC are regarded as the optimal features. Moreover, disease-related brain regions could be found out by identifying these features with high discriminative ability, which provides new insights in the pathology of AD.

The issue of overfitting is a major concern in the training process of our EWRSMC algorithm and more details about it are discussed here. In order to building an individual SVM classifier in EWRSVMC, the training set was randomly chosen out from the all experimental dataset and 62 FCs was randomly chosen out from total 4005 FCs as input features. Because of the randomness of samples and features, each SVM base classifier is greatly different from others, which could reduce the effects of overfitting. Furthermore, the EWRSVMC shows a good classification performance in the test set, suggesting a low risk of overfitting phenomenon.

In our proposed EWRSVMC, two hyperparameters, namely the penalty parameter C and the kernel bandwidth σ, need to be determined. Initially, we set parameter C and σ to Inf and 3 to train the individual RBF-SVM classifier. For comparison, we tested different values for C and σ and found no considerable changes in terms of the classification performances of the EWRSVMC, suggesting that the proposed EWRSVMC is considerably robust and universal.

### Analysis of Higher-Frequency Brain Regions

In this part, we mainly discussed about four abnormal brain regions, i.e., temporal lobe, insula, superior frontal gyrus, and parahippocampal gyrus respectively.

### The Temporal Lobe

fnins-12-00716 October 4, 2018 Time: 15:24 # 8

Some subregions of the temporal lobe had relatively greater frequencies in both EMCI/LMCI and LMCI/AD classifications, indicating that the temporal lobe is likely to play a crucial role in AD progression. The temporal lobe is situated beneath the lateral sulcus on both hemispheres of the human cerebrum (Kiernan, 2012), which is known to be associated with visual memory, language comprehension, emotion association and executive function (Riley et al., 2010; Bell et al., 2011).

Several previous studies have reported the abnormal temporal lobe in AD progression. Younes et al. (2014) found that the volume of medial temporal lobe structures were relevant to time of progress from MCI to AD. Davatzikos et al. (2011) observed the positive baseline Spatial Pattern of Abnormalities for Recognition of Early AD in temporal lobe in patients with MCI who progressed to AD dementia. Stein et al. (2010) observed the temporal lobe volume differences in brain MRI scans of AD patients, MCI patients and healthy elderly participants. Douaud et al. (2013) found that the cerebral atrophy in medial temporal lobe was vulnerable to the AD progression. Blasko et al. (2008) reported the changes of medial temporal lobe atrophy (MTA) through the evolution from cognitive health to MCI and to AD in a prospective cohort of subjects aged 75 years. The discovery of abnormal temporal lobe may help to improve the understanding of AD progression.

### The Insula

The insula had a relatively higher frequency than other brain regions in both EMCI/LMCI and LMCI/AD classifications as well, indicating that the insula may make a great contribution in the progression of AD. The insula is a crucial hub of the human brain networks and is folded deep in the floor of lateral sulcus (Cauda et al., 2011). It is reported that the human insula is involved in perception, motor control, general cognition and self-awareness (Kang et al., 2011; Chang et al., 2013).

The insula abnormality was reported in numerous previous literatures in AD pathology. Xie et al. (2012) found out the altered functional integration of the insula networks in AD development. Zhu et al. (2014) observed the significantly greater gray matter volume loss in the bilateral insula in the progression of conversion from HC to MCI to AD with a linear trend. Sojkova et al. (2008) reported the longitudinal alterations in regional cerebral blood flow which involved insula and superior temporal regions in AD progression. Hafkemeijer et al. (2012) mentioned that the patients diagnosed with AD exhibited extensive decreases in gray matter volume in insula and temporal lobe. Patel et al. (2013) reported that the default mode network (DMN) regions, e.g., insula and superior temporal gyrus, were significantly affected by AD pathology. The discovery of the insula abnormality may help to illuminate the underlying neuromechanism of AD disorder.

### The Superior Frontal Gyrus

The superior frontal gyrus possessed a relatively higher frequency compared to other brain regions in the EMCI/LMCI classification, suggesting that the superior frontal gyrus made an important contribution to the evolution from EMCI to LMCI. The superior frontal gyrus (SFG) is situated at the frontal lobe' superior part and makes up about one third of the prefrontal cortex of the human brain (Li et al., 2013). It has been reported that the superior frontal gyrus is associated with motor functions and cognitive control especially execution within working memory (Chiao et al., 2009; Van den Stock et al., 2011).

We have reviewed a great deal of previous literature about EMCI and LMCI, and found that there were relatively few studies to make inferences about the brain dynamic differences in the cognitive process from EMCI to LMCI. Accordingly, the discovery of abnormal superior frontal gyrus could be clinically helpful for early detection of AD evolution at MCI stage. Lee E.S. et al. (2016) showed the decreased FC in the right superior frontal gurus in patients with LMCI compared with EMCI, which was agreement with our finding.

### The Parahippocampal Gyrus

The parahippocampal gyrus obtained a higher frequency in 90 brain regions in the LMCI/AD classification, indicating that the parahippocampal gyrus acted a crucial part in the evolution from LMCI to AD. The parahippocampal gyrus is a part of the limbic system (Enatsu et al., 2015; Arnone et al., 2016), which is involved in the memory encoding and retrieval (Puri et al., 2012; Monti et al., 2018).

Several previous studies have reported the parahippocampal gyrus abnormality in AD pathology. Liang et al. (2014) found out the altered amplitude of low-frequency fluctuations in right parahippocampal gyrus from LMCI and AD. Xiang et al. (2013) reported that AD patients showed less activity than MCI patients in the right parahippocampal gyrus during a visual memory task. Yetkin et al. (2006) mentioned that the AD group had less activation in bilateral parahippocampal gyri than the MCI group in a memory-encoding task. Echávarri et al. (2011) found out the significant differences of volumes of the parahippocampal gyrus between the groups with the following order: AD < aMCI < healthy. The discovery of parahippocampal gyrus abnormality may provide assistant for clinical diagnosis of early AD.

### Limitations

The current study is limited by the following two factors. Firstly, we utilized one modality, i.e., RS-fMRI for multiple binary classifications. Nevertheless, there exist other modalities [e.g., cerebrospinal fluid (CSF) and positron emission tomography (PET)] which may also contain commentary information for better classification performance. Secondly, it is crucial to visualize the learned decision process for better understanding the classification approach and gaining clinical insights. However, as with most previous AD classification algorithms, the visualization of the learned decision process in our proposed EWRSVMC is not informative, which is still a limitation which is expected to be addressed in the future.

### ETHICS STATEMENT

fnins-12-00716 October 4, 2018 Time: 15:24 # 9

This study was carried out in accordance with the recommendations of National Institute of Aging-Alzheimer's Association (NIA-AA) workgroup guidelines, Institutional Review Board (IRB). The study was approved by IRB of each participating site, including the Banner Alzheimer's Institute, and was conducted in accordance with Federal Regulations, the Internal Conference on Harmonization (ICH), and Good Clinical Practices (GCP).

### REFERENCES


### AUTHOR CONTRIBUTIONS

X-aB proposed the design of the work and revised it critically for important intellectual content. QX and QS carried out the experiment for the work and drafted part of the work. XL and ZW collected, interpreted the data, and drafted part of the work. All the authors approved the final version to be published and agreed to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

### FUNDING

This work was supported by the National Natural Science Foundation of China (No. 61502167).

sleep behavior disorder: a resting-state functional MRI study. J. Magn. Reson. Imaging 46, 697–703. doi: 10.1002/jmri.25571


associated with restricting development of Alzheimer's Disease. Cell 169, 1276.e17–1290.e17. doi: 10.1016/j.cell.2017.05.018


integrity in healthy middle-aged ApoE4 carriers. Brain Imaging Behav. 7, 60–67. doi: 10.1007/s11682-012-9187-y


fnins-12-00716 October 4, 2018 Time: 15:24 # 10

fnins-12-00716 October 4, 2018 Time: 15:24 # 11

network approach. IEEE Trans. Med. Imaging 33, 1129–1136. doi: 10.1109/TMI. 2014.2305394


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Bi, Xu, Luo, Sun and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Topological Properties of Resting-State fMRI Functional Networks Improve Machine Learning-Based Autism Classification

#### Amirali Kazeminejad1,2 \* and Roberto C. Sotero1,2,3

<sup>1</sup> Hotchkiss Brain Institute, University of Calgary, Calgary, AB, Canada, <sup>2</sup> Biomedical Engineering Graduate Program, University of Calgary, Calgary, AB, Canada, <sup>3</sup> Department of Radiology, University of Calgary, Calgary, AB, Canada

#### Edited by:

Yangming Ou, Harvard Medical School, United States

#### Reviewed by:

Lili Jiang, Institute of Psychology (CAS), China Matthew Toews, École de Technologie Supérieure, Canada

\*Correspondence:

Amirali Kazeminejad amirali.kazeminejad@ucalgary.ca; roberto.soterodiaz@ucalgary.ca

#### Specialty section:

This article was submitted to Brain Imaging Methods, a section of the journal Frontiers in Neuroscience

Received: 18 July 2018 Accepted: 18 December 2018 Published: 10 January 2019

#### Citation:

Kazeminejad A and Sotero RC (2019) Topological Properties of Resting-State fMRI Functional Networks Improve Machine Learning-Based Autism Classification. Front. Neurosci. 12:1018. doi: 10.3389/fnins.2018.01018 Automatic algorithms for disease diagnosis are being thoroughly researched for use in clinical settings. They usually rely on pre-identified biomarkers to highlight the existence of certain problems. However, finding such biomarkers for neurodevelopmental disorders such as Autism Spectrum Disorder (ASD) has challenged researchers for many years. With enough data and computational power, machine learning (ML) algorithms can be used to interpret the data and extract the best biomarkers from thousands of candidates. In this study, we used the fMRI data of 816 individuals enrolled in the Autism Brain Imaging Data Exchange (ABIDE) to introduce a new biomarker extraction pipeline for ASD that relies on the use of graph theoretical metrics of fMRI-based functional connectivity to inform a support vector machine (SVM). Furthermore, we split the dataset into 5 age groups to account for the effect of aging on functional connectivity. Our methodology achieved better results than most state-of-the-art investigations on this dataset with the best model for the >30 years age group achieving an accuracy, sensitivity, and specificity of 95, 97, and 95%, respectively. Our results suggest that measures of centrality provide the highest contribution to the classification power of the models.

Keywords: graph theoiy, SVM–support vector machine, machine learing, fMRI, ABIDE, brain connectitvity

### INTRODUCTION

Autism Spectrum Disorder (ASD) is a neurodevelopmental disease which manifests in early childhood and persists into adulthood. Recent studies show that 1 in 45 children is diagnosed with autism (Zablotsky et al., 2015). While there is no cure for ASD (Brentani et al., 2013), early diagnosis of autistic individuals is proven to improve quality of life (Fernell et al., 2013). To better detect ASD, biomarkers characterizing the disorder need to be identified. It has been shown that by using topological biomarkers extracted from the brain functional network, machine learning (ML) algorithms can be trained to aid in ASD diagnosis (Plitt et al., 2015). However, there are many variables, such as different methods to construct the functional network and carry out the topological measurements that can affect the extraction of these biomarkers. One goal of this study was to find the best combination of these variables to tackle the task of ASD classification. For this goal, we used 5 different network extraction pipelines with 12 graph theoretical topological measurements and preformed a statistical analysis to compare the classification results between the pipeline. The second goal was to identify the top topological measures in each pipeline and investigate their relation to ASD in order to attempt and further understand the disorder.

Our brains can be viewed as a network of functionally interconnected regions. To measure the strength of these connections, the temporal dynamics of brain activity is needed. Modalities such as Electroencephalography (EEG) and magnetoencephalography (MEG) provide this information, however, they suffer from poor spatial resolution when compared to Functional Magnetic Resonance Imaging (fMRI). In fMRI, brain activity is usually monitored by the changes in blood oxygenation which changes the magnetic properties of blood. The resulting signal is called the Blood-oxygen-level dependent (BOLD) signal. At the turn of the century, researchers provided evidence that fMRI can be used to identify functional connections of the brain while the subject was in a "resting-state" and not doing any specific task (Lowe et al., 2000). Later studies found many different functional networks can be identified using the resting-state connectivity derived from fMRI (van den Heuvel and Hulshoff Pol, 2010). Information from these networks can be extracted and used as an input to ML algorithms to automatically identify the best biomarkers distinguishing between healthy and diseased networks (Nielsen et al., 2013; Plitt et al., 2015; Hazlett et al., 2017; Heinsfeld et al., 2018).

ML has proven to be a powerful tool for automatic disease diagnosis in neurodegenerative disorders such as Alzheimer's Disease (AD) (Chen et al., 2011) and Parkinson's Disease (Kazeminejad et al., 2017; Talai et al., 2017). In recent years, researchers began investigating how the same principles can be used for automatic ASD diagnosis. Promising results with accuracies over 90% were observed using invasive methods and blood analysis (Howsmon et al., 2017). However, the classification studies conducted using non-invasive data acquisition such as brain imaging, while above chance levels, generally report lower accuracies. By using fMRI data acquired in the Autism Brain Imaging Data Exchange (ABIDE) (Nielsen et al., 2013) extracted the pairwise functional connectivity of 7,266 Regions of interest (ROI) using Pearson correlation and used a leave-one-out general linear model classifier to achieve a ASD vs. Healthy Controls (HC) classification accuracy of 60%. More recently, by applying and comparing different ML algorithms to the same dataset, the accuracy has reached 70%. Heinsfeld et al. used the Pearson correlation of fMRI activity of region pairs in CC200 atlas (Craddock et al., 2012) as the inputs to a multi-layer perceptron to achieve this result (Heinsfeld et al., 2018). Other research groups using their own datasets have reported higher accuracies. One study using cortical thickness, total brain volume, and surface area of different brain regions was able to achieve an accuracy of 81% using a neural network as their classifier (Hazlett et al., 2017).

Another emerging methodology in understanding different neurological disorders is graph theory, a mathematical tool used to explain network characteristics that can also be applied to the human brain network (Iturria-Medina et al., 2008; Bullmore and Sporns, 2009; Rubinov and Sporns, 2010; Sotero, 2016; Sanchez-Rodriguez et al., 2018). Graph theory can be used to measure the brain network segregation (clustering coefficient and transitivity), integration (characteristic path length and efficiency), and centrality (betweenness centrality, eigenvector centrality, participation coefficient and within module z-score). Recent brain imaging studies have found topological differences between ASD and normal brains which can be quantified using graph theory, such as global alterations of characteristic path length and efficiency in ASD (Rudie et al., 2013; Itahashi et al., 2014; Zeng et al., 2017; Qin et al., 2018) as well as alterations to segregation measures (Barttfeld et al., 2011; Rudie et al., 2013; Leung et al., 2014; Keown et al., 2017; Zeng et al., 2017) and centrality measures(Di Martino et al., 2013; Leung et al., 2014; Balardin et al., 2015).

Previous studies in AD patients have used topological properties of brain networks as features for a ML algorithm, achieving classification accuracies of 85% (Dyrba et al., 2015). However, this methodology hasn't been tested in ASD. With the emergence of the ABIDE dataset, large amounts of imaging and clinical data has become available to researchers (Di Martino et al., 2014). More than 1,000 datasets are available for individuals with ASD and HC each. This data is collected from multiple sites with slightly varying machinery and imaging parameters. Therefore, a well-developed preprocessing pipeline is essential to minimize the effects of site and imaging parameter changes, but further data manipulations may be needed to standardize the data from different sites.

One explanation for the lower accuracies of studies using the ABIDE dataset is that it covers a large age range (5–65). Age has been proposed as a factor attributing to the different results reported on resting-state fMRI analysis of ASD (Hull et al., 2016). Another study focusing on using multi-scale image textures to study neuroanatomical texture features in autism has found correlations between age and texture features (Chaddad et al., 2017). Therefore, any study that uses all this data will have to take aging effects into consideration. If these issues are correctly addressed, the ABIDE initiative will provide a suitable database for ML centered research on ASD. Another limitation that can be associated with the previously mentioned studies is that they use a simple connectivity matrix such as one computed by Pearson correlation as their features for the classification algorithms. The connectivity matrix is interpreted as the strengths of the connection between ROIs and the changes in these connection strengths are used to classify between ASD and HCs. We hypothesize that by applying graph theoretical measurement of network segregation (clustering coefficient and transitivity), integration (characteristic path length and efficiency), and centrality (betweenness centrality, eigenvector centrality, participation coefficient and within module z-score) for extracting features from the connectivity matrix, the performance of ML algorithm on this dataset will be improved.

In this study, we use fMRI BOLD signals to estimate functional connectivity matrices using different network extraction methods. Using these matrices, we construct a brain network modeling the functional connectivity of a subject's brain. Topological properties such as integration, segregation, and centrality of the obtained networks are then used as features (for a total of 817 features for each network extraction method) fed to a gaussian kernel Support Vector Machine (SVM) to classify whether a subject is suffering from ASD or not. We then use a sequential feature selection technique to choose the top 10 features that contribute to this classification. To control for the effects of aging, we separated our data into 5 age groups. Our best model, for the >30 age range achieved a classification accuracy, sensitivity, and specificity of ∼95, 97, and 95%, respectively. Most regions that the features were extracted from had been previously shown to undergo structural and/or functional changes in ASD.

### MATERIALS AND METHODS

### Dataset and Preprocessing

In order to ensure replicability, we used the preprocessed version of ABIDE I (Di Martino et al., 2014) data publicly available via the Preprocessed Connectome Project (Cameron et al., 2013b). The preprocessing pipeline we used for this study is the Configurable Pipeline for the Analysis of Connectomes (CPAC) (Cameron et al., 2013a). Regions of interests (ROIs) were defined as the 116 regions in the automatic anatomical labeling (AAL) atlas (Tzourio-Mazoyer et al., 2002).

The preprocessing included the following steps. AFNI was used for removing the skull from the images. The brain was segmented into three tissues using FSL. The images were then normalized to the MNI 152 stereotactic space using ANTs. Functional preprocessing included motion and slice-timing correction as well as the normalization of voxel intensity. Nuisance signal regression included 24 parameters for head motion, CompCor with 5 principal components for tissue signal in CSF, and white matter, linear and quadratic trends for Lowfrequency drifts and a global bandpass filter (0.01–0.1 Hz). These images where then co-registered to their anatomical counterpart by FSL. They were then normalized to the MNI 152 space using ANTs. The average voxel activity in each ROI was then extracted as the time-series for that region. Any subject that had a consistently 0 time-series was omitted from the dataset. To minimize the effects of age on the results, the dataset was split into 5 age ranges with 5-year increments for the first three step and a 10 year and unlimited increment for the final two. This was done in order to ensure that no age range will have a very small number of subjects. The distribution of the subjects in each age range can be seen in **Table 1**. Further breakdown of the subject's demographics is shown in **Supplementary Table A**.

### Creating the Functional Connectivity Network

To extract the whole-brain functional connectivity network of each subject, each ROI is seen as a network node and a measure of connectivity is used to connect these nodes (Bullmore and Sporns, 2009). This connectivity measure wij must be able to quantify the relationship between the time-series of ROI i and j. Correlation and mutual information metrics have been extensively used for this purpose (Rubinov and Sporns, 2010). We have used spearman's rank correlation coefficient, the percentage-bend correlation (Wilcox, 1994; Pernet et al., 2012) and partial correlation (Marrelec et al.,


Number of participants from each site for each age group as well as the overall number of participants in a site that were used for this study. Last row shows the total number of subjects in each age-range. The number of MRI samples per fMRI time-series is annotated in brackets in the first column. The Stanford time-series did not have a consistent number of samples thus the number is presented as a range.

2006) as our correlation based measures of connectivity. We also used Sparse Inverse Covariance Estimation (SICE) (Huang et al., 2010) and mutual information as alternative measures of connectivity. More details on each method can be found in the **Supplementary Material**. The implementations used in the open source GraphVar Matlab toolbox (Kruschwitz et al., 2015) was used to compute these connectivity measures.

### Graph Extraction

Once the whole-brain network is available, numerous methods can be used to express it in terms of a graph. The easiest way is to treat each ROI as a node and the connectivity matrix as connection weights. Another approach is to define a threshold T and disregard any edges with values wij < T by changing them to 0. One can then either keep the edge weights for wij > T or change them to 1 to construct a binary graph. It has been shown that binary graphs are easier to characterize using graph theoretical metrics and usually have better defined null models for statistical analysis (Rubinov and Sporns, 2010). As there is no proved way to calculate the value of T for a specific application, a proportional approach is usually used in its place. In this paper, the highest 20% of the weights were changed to 1 and the rest were disregarded as 0.

### Graph Metrics

Graph theoretical analysis was performed on the extracted brain graph for each subject. The calculated graph properties consisted of measures of segregation (Clustering Coefficient, Transitivity), integration (Characteristic Path Length, Efficiency), and centrality (Betweenness centrality, within module degree Zscore, Participation coefficient) of the brain network. Formulas for each metric are presented in **Table A1** (Rubinov and Sporns, 2010). This resulted in a feature space of 817 variables for each subject. More information on this step is available in the **Supplementary Material**.

All steps from Graph extraction to this point were done using the openly available MATLAB toolbox GraphVar (Kruschwitz et al., 2015).

### Classification, Validation, and Comparison

In this study, we used the python Scikit-learn implementation of the gaussian SVM as our classifier. Features were selected using a sequential forward floating algorithm (Pudil et al., 1994). This was done over 10 successive iterations. In the first iteration, all features in the feature space were individually used for classification and the best performing feature was added to a feature subset while being removed from the feature space. In each consecutive iteration, individual components of the feature space are added to feature subset and the best performing feature in combination with previous results is kept for future use. This resulted in 10 features being chosen as the best graph characteristics that distinguish between ASD and HC.

All classification metrics were acquired using a 10-fold stratified cross validation test with the data folds being the same for all algorithms. To further validate our results, the confusion matrix of each model was evaluated to determine model sensitivity and specificity.

FIGURE 1 | Graphical framework of the experiment. (A) Raw fMRI images of subjects; (B) After preprocessing the brain is divided into 116 regions of interest (ROI); (C) By averaging the BOLD activity in each ROI, a time-series is extracted representing brain activity in that region; (D) Using different measures of connectivity, a connectivity matrix is generated from the ROI time-series quantifying the connectivity level between individual ROIs; (E) By treating the ROIs as graph nodes and the connectivity matrix as graph weights the brain network is expressed in graph form; (F) A threshold is applied to keep only the strongest connections; (G) Graph theoretical analysis is applied to the resulting graph from part F to obtain a feature vector for each subject; (H) A wrapper method called sequential feature selection is applied to choose a handful of features that contribute to the highest classification accuracy; (I,J) The resulting feature subset is passed to a linear SVM which trains a model to distinguish between ASD and HC.

FIGURE 2 | Comparison of Model Performance; Left Column: Accuracy of the models trained using features extracted from the pipeline specified on the X axis for the age range specified on the far left (in years). Y axis labels specify the chance level for the classification task. Top preforming model is highlighted in dark blue; Middle Column: p-values of the Welch's t-test preformed on the models trained on different pipelines. Statistical significance (p < 0.05) is highlighted in dark blue; Right Panel: FDR corrected p-values based on the Benjamini, Hochberg method (Benjamini and Hochberg, 1995). The corrected p-values were capped at 1 therefore any value over that threshold was set to 1.

We minimized the risk of overfitting by using three limiting approaches. First, the simplest kernel (linear) was used for the SVM. Second, only 10 features were used to learn to classify between 104 subjects. Finally, using 10-fold cross validation ensured the model is only evaluated on data points that it has not experienced before.

As cross validation is inevitably dependent on how the data was randomly separated, we used a 10 × 10 Welch's t-tests to compare our models. The null hypothesis for these tests was that the two models have equal accuracies. To address the issue of multiple comparisons, we also reported the false discovery rate (FDR) corrected p-values for these tests.

**Figure 1** presents a graphical depiction of the methodology proposed here.

### RESULTS

### Performance of the Classifiers

Our models were able to consistently perform better than the chance level calculated for their respective age ranges. Chance level was evaluated by assuming the model always chooses the most populous group. The left panel of **Figure 2** compares the performance of the different pipelines in each age range. The best preforming model for each age-range is highlighted.

The top preforming pipeline model was generally shown to have a statistically higher (p < 0.05) mean than most of the other pipelines. The only exception occurs in the case of the 10–15 age range in which the concatenation pipeline's accuracy fails to achieve a statistically significance difference with three other pipelines: mutual information, covariance, and bend correlation. The details of this statistical analysis are illustrated on the middle and right panels of **Figure 2**.

To further analyze the performance of the best models, we calculated their respective sensitivity and specificity (**Table 2**). All models exhibited a specificity of > = 80%. The 10–15 age range showed relatively low sensitivity. Specificity shows the percentage of times that a Negative prediction (in this case HC is correct while sensitivity shows the percentage of times that a Positive prediction (ASD) is correct.

### Analysis of Selected Features

To further understand the results, we plotted the regions from which the selected features were derived (**Figure 3**). The results for the top-preforming pipeline for each age range will be presented in the main body of this article. More details about the performance of all other pipelines for a given age range is given in the **Supplementary Material**. The size of the nodes in **Figure 3** correspond to the rank at which that feature was selected. The abbreviations of the node labels can be found in the **Supplementary Table B**. **Supplementary Table B** also tabulates the exact features for each age range as well as the p-value corresponding to the between group difference of that feature. The top group of measures as well as the top measure based on repetition is as follows: Measures of segregation, specifically clustering coefficient for the 5–10 years range. Measures of centrality for all other age ranges, with the most repeated measure being betweenness centrality for the 10–15 years range, TABLE 2 | Classification performance of the best models.


eigenvector centrality for the 15–20 years range, within module degree z-score for the 20–30 years range and betweenness centrality for the >30 years range.

### DISCUSSION

### Comparison With Previous Literature

In this study, we examined several different pipelines for ASD classification. These included 6 different network extraction techniques over 5 age ranges. Furthermore, we used 10-fold cross validation to examine the accuracy of the algorithm for each pipeline which is shown to be better than the leave-one-out cross validation used in previous studies (Kohavi, 1995). In addition, 10-fold cross validation may be used as a substitute for having a separate testing set because the model is evaluated on datapoints it has not seen before. Because of not having the exact models trained in previous studies, we compare our findings with them only by using the reported accuracy, specificity, and sensitivity. All models trained in this study were statistically compared with each other using a 10 by 10 cross validation t-test.

Previous studies were not able to report high prediction accuracies for the ABIDE dataset relative to similar studies on other neurological diseases such as AD. This can be related to the fact that this dataset consists of recordings conducted over multiple sites, some with slightly different image acquisition parameters. Moreover, the whole dataset covers a wide age range (5–64 years). To minimize the effects of age, we separated the dataset into 5 age ranges and trained separate models on each range. To allow for easier reproducibility and thus more meaningful comparisons, we chose to use a publicly available preprocessed version of the data through the Preprocessed Connectomes Project (http://preprocessedconnectomes-project.org/).

**Table 3** shows a detailed comparison with previously reported ASD classification models. It is necessary to state all of the mentioned papers other than Chen et al. (2015) used the complete dataset to train their model while in this study separate models where trained for different age ranges. The crossvalidation results in this study provide an estimate of how the models would perform if data from their respective age ranges were fed to them. Therefore, it can be hypothesized that the performance over the entire dataset would not be worse than the worst preforming age-range if, based on the subject's age, the correct model is used for a previously unseen dataset. Additional

FIGURE 3 | Visualization of the top 10 selected features for each Age range. Two age-ranges show only 9 features. This is because in the 5–10 range PreCG.L was selected two times. In the >30 group the last selected feature was the global Characteristic path length. The full region names along with the abbreviations can be found in Supplementary Table B.


data is needed to confirm this hypothesis. Our worst preforming model, the model for the 10–15 age range, outperformed almost all the previous models in specificity while having an accuracy comparable to that of the other SVM models. All other age ranges showed higher accuracy than all previous models except the Chen et al. random forest. This could be attributed to the fact that the performance metrics for the random forest model were assessed using a different scheme called out of bag prediction errors as opposed to the cross validation used in our models and all other previously reported studies mentioned here.

### Comparison Between Pipelines

While in all age ranges except the 10–15 range, the top model showed a statistical significance in performance than most of the other models, our results do not reach a consensus about what network creation pipeline preforms best in all cases. However, the bend correlation pipeline's model was the second best model over all age range but the >30 range. Furthermore, it did not show any statistically significant difference in model performance from the top preforming model for the 10–15, 15–20, and 20–30 age ranges. Based on this, we would suggest bend correlation to be the first network construction pipeline for graph theoretical analysis of the ABIDE dataset if computational time is limited.

A possible explanation for the relatively lower performance of the 10–15 range compared to other age ranges is that the larger number of subjects in this group translated into higher between site variability in the data. Therefore, even though our model achieved higher specificity than most previous studies, further steps are needed to address the inherent heterogeneity of the ABIDE dataset.

### Analysis of the Selected Features

Centrality measures were shown to be most operative in providing features for the classification tasks in the top 10 selected features. This also held true when selecting the top 5 features. Centrality measures have been shown to undergo changes in ASD. A previous study on the structural network of the brain found that autism is accompanied by centrality alterations in regions relevant for social and sensorimotor processing (Balardin et al., 2015). Another study found changes in hubness of ASD brain networks using resting-state fMRI (Itahashi et al., 2014). Our results suggest that the changes in centrality measures play a key role in being able to differentiate between ASD and HC. The only exception was observed for the 5–10 years age range where clustering coefficient, a measure of segregation, was chosen more times than the rest. This also held true when only looking at the top 5 features. This suggests that at a young age, there may not be many changes to the hubs of the brain network but the organization of the network into sub-networks is altered.

### LIMITATIONS

There are several limitations in the current study. First, ABIDE I data was used in different age ranges to investigate the prediction accuracy of our pipelines while minimizing the effects of aging on the resting-state networks. Furthermore, although to the best of our knowledge ABIDE is the most comprehensive database for ASD functional imaging, further analyses are needed to confirm its representability of the whole ASD population. Second, we relied on a single preprocessing pipeline for the sake of easier comparison between our work and previous studies. It is entirely possible that another preprocessing pipeline is better suited to this graph theoretical approach. Future studies will need to investigate this limitation. Additionally, the comparison between our models and previous studies only used three metrics (accuracy, sensitivity, and specificity). A statistical test may be needed to further analyze the significance of our findings. However, this is not possible without access to the exact cross validation folds or out of bag sample errors of those studies. Nevertheless, due to the observed improvement, we suspect that our algorithm has reached a statistically significant improvement over previous results.

Another shortcoming that is not limited to this study is related to how the classification task is formulated. To the best of our knowledge, all research in this field including the present study have focused on distinguishing HCs from ASDs. However, as the name suggests, ASD is a spectrum and individual cases can vary greatly in how the disorder affects them. To address this issue, databases such as ABIDE will play a vital role. Extensive detailed clinical analysis data will be needed to correctly approximate the position of an individual on the spectrum.

Finally, variability present in the ABIDE dataset, such as different imaging parameters and devices, due to it being a multisite initiative may lead to uncontrolled variations in the data or model being biased toward better represented sites. While the normalization steps in the preprocessing help reduce the variations, further investigations will be needed to confirm if they have been eliminated to a sufficient degree. Our results show better overall performance over previous investigations which suggests these limitations may have been addressed in a satisfactory manner.

### CONCLUSION

In this study we utilized graph theory and ML to propose a novel pipeline for automatic diagnosis of ASD which significantly

improved performance over previously proposed models. The relative strength of our method suggests graph theoretical analysis paired with the right preprocessing pipeline can nullify the effects of multi-site and multi-device image acquisition to a good degree and is more robust than previous methods. Our pipeline automatically selected 10 biomarkers for each age range being investigated. Measures of centrality were shown to be most operative in distinguishing between ASD and HC.

### DATA AVAILABILITY STATEMENT

The raw data supporting the conclusions of this manuscript will be made available by the authors, without undue reservation, to any qualified researcher.

### ETHICS STATEMENT

We used the data collected as part of the ABIDE database and complied with everything that they have asked to be included in any manuscript using that data. The original ethics statement form the (Di Martino et al., 2014) paper is as follows: All contributions were based on studies approved by local IRBs, and data were fully anonymized (removing all 18 HIPAA protected health information identifiers, and face information from structural images). All data distributed were visually inspected prior to release.

### REFERENCES


### AUTHOR CONTRIBUTIONS

The work presented here was carried out in collaboration between all authors. The research was designed by both authors. AK acquired and analyzed the data and carried out the experiment with RS providing supervision and guidance. The manuscript was written by AK and revised by RS. All authors have read and approved the submission of the manuscript.

### ACKNOWLEDGMENTS

This work was partially supported by the Biomedical engineering research scholarship of the University of Calgary. The database used in this study was part of the openly available ABIDE I. The funding source for that project is as follows. Primary support for the work by Adriana Di Martino was provided by the NIMH (K23MH087770) and the Leon Levy Foundation. Primary support for the work by Michael P. Milham and the INDI team was provided by gifts from Joseph P. Healy and the Stavros Niarchos Foundation to the Child Mind Institute, as well as by a NIMH award to MPM (R03MH096321).

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fnins. 2018.01018/full#supplementary-material


short of biomarker standards. YNICL 7, 359–366. doi: 10.1016/j.nicl.2014. 12.013


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Kazeminejad and Sotero. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Edited by: Guoyan Zheng, Universität Bern, Switzerland

#### Reviewed by:

Qi Dou, The Chinese University of Hong Kong, China Chunliang Wang, KTH Royal Institute of Technology, Sweden

#### \*Correspondence:

Min Du dm\_dj90@163.com Xiaobo Qu quxiaobo@xmu.edu.cn

†Data used in the preparation of this paper were obtained from the ADNI database (http://adni.loni.usc.edu/). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at the website (http://adni.loni.usc.edu/wpcontent/uploads/how\_to\_apply/ ADNI\_Acknowledgement\_List.pdf).

#### Specialty section:

This article was submitted to Neurodegeneration, a section of the journal Frontiers in Neuroscience

Received: 04 July 2018 Accepted: 05 October 2018 Published: 05 November 2018

#### Citation:

Lin W, Tong T, Gao Q, Guo D, Du X, Yang Y, Guo G, Xiao M, Du M, Qu X and The Alzheimer's Disease Neuroimaging Initiative (2018) Convolutional Neural Networks-Based MRI Image Analysis for the Alzheimer's Disease Prediction From Mild Cognitive Impairment. Front. Neurosci. 12:777. doi: 10.3389/fnins.2018.00777

## Convolutional Neural Networks-Based MRI Image Analysis for the Alzheimer's Disease Prediction From Mild Cognitive Impairment

Weiming Lin1,2,3, Tong Tong3,4, Qinquan Gao1,3,4, Di Guo<sup>5</sup> , Xiaofeng Du<sup>5</sup> , Yonggui Yang<sup>6</sup> , Gang Guo<sup>6</sup> , Min Xiao<sup>2</sup> , Min Du1,7 \*, Xiaobo Qu<sup>8</sup> \* and The Alzheimer's Disease Neuroimaging Initiative†

<sup>1</sup> College of Physics and Information Engineering, Fuzhou University, Fuzhou, China, <sup>2</sup> School of Opto-Electronic and Communication Engineering, Xiamen University of Technology, Xiamen, China, <sup>3</sup> Fujian Key Lab of Medical Instrumentation & Pharmaceutical Technology, Fuzhou, China, <sup>4</sup> Imperial Vision Technology, Fuzhou, China, <sup>5</sup> School of Computer & Information Engineering, Xiamen University of Technology, Xiamen, China, <sup>6</sup> Department of Radiology, Xiamen 2nd Hospital, Xiamen, China, <sup>7</sup> Fujian Provincial Key Laboratory of Eco-Industrial Green Technology, Nanping, China, <sup>8</sup> Department of Electronic Science, Xiamen University, Xiamen, China

Mild cognitive impairment (MCI) is the prodromal stage of Alzheimer's disease (AD). Identifying MCI subjects who are at high risk of converting to AD is crucial for effective treatments. In this study, a deep learning approach based on convolutional neural networks (CNN), is designed to accurately predict MCI-to-AD conversion with magnetic resonance imaging (MRI) data. First, MRI images are prepared with age-correction and other processing. Second, local patches, which are assembled into 2.5 dimensions, are extracted from these images. Then, the patches from AD and normal controls (NC) are used to train a CNN to identify deep learning features of MCI subjects. After that, structural brain image features are mined with FreeSurfer to assist CNN. Finally, both types of features are fed into an extreme learning machine classifier to predict the AD conversion. The proposed approach is validated on the standardized MRI datasets from the Alzheimer's Disease Neuroimaging Initiative (ADNI) project. This approach achieves an accuracy of 79.9% and an area under the receiver operating characteristic curve (AUC) of 86.1% in leave-one-out cross validations. Compared with other state-of-the-art methods, the proposed one outperforms others with higher accuracy and AUC, while keeping a good balance between the sensitivity and specificity. Results demonstrate great potentials of the proposed CNN-based approach for the prediction of MCI-to-AD conversion with solely MRI data. Age correction and assisted structural brain image features can boost the prediction performance of CNN.

Keywords: Alzheimer's disease, deep learning, convolutional neural networks, mild cognitive impairment, magnetic resonance imaging

## INTRODUCTION

fnins-12-00777 November 5, 2018 Time: 14:30 # 2

Alzheimer's disease (AD) is the cause of over 60% of dementia cases (Burns and Iliffe, 2009), in which patients usually have a progressive loss of memory, language disorders and disorientation. The disease would ultimate lead to the death of patients. Until now, the cause of AD is still unknown, and no effective drugs or treatments have been reported to stop or reverse AD progression. Early diagnosis of AD is essential for making treatment plans to slow down the progress to AD. Mild cognitive impairment (MCI) is known as the transitional stage between normal cognition and dementia (Markesbery, 2010), about 10–15% individuals with MCI progress to AD per year (Grundman et al., 2004). It was reported that MCI and AD were accompanied by losing gray matter in brain (Karas et al., 2004), thus neuropathology changes could be found several years before AD was diagnosed. Many previous studies used neuroimaging biomarkers to classify AD patients at different disease stages or to predict the MCI-to-AD conversion (Cuingnet et al., 2011; Zhang et al., 2011; Tong et al., 2013, 2017; Guerrero et al., 2014; Suk et al., 2014; Cheng et al., 2015; Eskildsen et al., 2015; Li et al., 2015; Liu et al., 2015; Moradi et al., 2015). In these studies, structural magnetic resonance imaging (MRI) is one of the most extensively utilized imaging modality due to non-invasion, high resolution and moderate cost.

To predict MCI-to-AD conversion, we separate MCI patients into two groups by the criteria that whether they convert to AD within 3 years or not (Moradi et al., 2015; Tong et al., 2017). These two groups are referred to as MCI converters and MCI non-converters. The converters generally have more severe deterioration of neuropathology than that of non-converters. The pathological changes between converters and non-converters are similar to those between AD and NC, but much milder. Therefore, it much more difficult to classify converters/nonconverters than AD/NC. This prediction with MRI is challenging because the pathological changes related to AD progression between MCI non-converter and MCI converter are subtle and inter-subject variable. For example, ten MRI-based methods for predicting MCI-to-AD conversion and six of them perform no better than random classifier (Cuingnet et al., 2011). To reduce the interference of inter-subject variability, MRI images are usually spatially registered to a common space (Coupe et al., 2012; Young et al., 2013; Moradi et al., 2015; Tong et al., 2017). However, the registration might change the AD related pathology and loss some useful information. The accuracy of prediction is also influenced by the normal aging brain atrophy, with the removal of age-related effect, the performance of classification was improved (Dukart et al., 2011; Moradi et al., 2015; Tong et al., 2017).

Machine learning algorithms perform well in computer-aided predictions of MCI-to-AD conversion (Dukart et al., 2011; Coupe et al., 2012; Wee et al., 2013; Young et al., 2013; Moradi et al., 2015; Beheshti et al., 2017; Cao et al., 2017; Tong et al., 2017). In recent years, deep learning, as a promising machine learning methodology, has made a big leap in identifying and classifying patterns of images (Li et al., 2015; Zeng et al., 2016, 2018). As the most widely used architecture of deep learning, convolutional neural networks (CNN) has attracted a lot of attention due to its great success in image classification and analysis (Gulshan et al., 2016; Nie et al., 2016; Shin et al., 2016; Rajkomar et al., 2017; Du et al., 2018). The strong ability of CNN motivates us to develop a CNN-based prediction method of AD conversion.

In this work, we propose a CNN-based prediction approach of AD conversion using MRI images. A CNN-based architecture is built to extract high level features of registered and agecorrected hippocampus images for classification. To further improve the prediction, more morphological information is added by including FreeSurfer-based features (FreeSurfer, RRID:SCR\_001847) (Fischl and Dale, 2000; Fischl et al., 2004; Desikan et al., 2006; Han et al., 2006). Both CNN and FreeSurfer features are fed into an extreme learning machine as classifier, which finally makes the decision of MCI-to-AD. Our main contributions to boost the prediction performance include: (1) Multiple 2.5D patches are extracted for data augmentation in CNN; (2) both AD and NC are used to train the CNN, digging out important MCI features; (3) CNN-based features and FreeSurfer-based features are combined to provide complementary information to improve prediction. The performance of the proposed approach was validated on the standardized MRI datasets from the Alzheimer's Disease Neuroimaging Initiative (ADNI – Alzheimer's Disease Neuroimaging Initiative, RRID:SCR\_003007) (Wyman et al., 2013) and compared with other state-of-the-art methods (Moradi et al., 2015; Tong et al., 2017) on the same datasets.

### MATERIALS AND METHODS

The proposed framework is illustrated in **Figure 1**. The MRI data were processed through two paths, which extract the CNNbased and FreeSurfer-based image features, respectively. In the left path, CNN is trained on the AD/NC image patches and then is employed to extract CNN-based features on MCI images. In the right path, FreeSurfer-based features which were calculated with FreeSurfer software. These features, which were further mined with dimension reduction and sparse feature selection via PCA and Lasso, respectively, were concatenated as a features vector and fed to extreme learning machine as classifier. Finally, to evaluate the performance of the proposed approach, the leaveone-out cross validation is then used.

### ADNI Data

Data used in this work were downloaded from the ADNI database. The ADNI is an ongoing, longitudinal study designed to develop clinical, imaging, genetic, and biochemical biomarkers for the early detection and tracking of AD. The ADNI study began in 2004 and its first 6-year study is called ADNI1. Standard analysis sets of MRI data from ADNI1 were used in this work, including 188 AD, 229 NC, and 401 MCI subjects (Wyman et al., 2013). These MCI subjects were grouped as: (1) MCI converters who were diagnosed as MCI at first visit, but converted to AD during the longitudinal visits within 3 years (n = 169);

(2) MCI non-converters who did not convert to AD within 3 years (n = 139). The subjects who were diagnosed as MCI at least twice, but reverse to NC at last, are also considered as MCI non-converters; (3) Unknown MCI subjects who missed some diagnosis which made the last state of these subjects was unknown (n = 93). The demographic information of the dataset are presented in **Table 1**. The age ranges of different groups are similar. The proportions of male and female are close in AD/NC groups while proportions of male are higher than female in MCI groups.

### Image Preprocessing

MRI images were preprocessed following steps in Tong et al. (2017). All images were first skull-stripped according to Leung et al. (2011), and then aligned to the MNI151 template using a B-spline free-form deformation registration (Rueckert et al., 1999). In the implementation, we follow the Tong's way to register images (Tong et al., 2017), showing that the effect of deformable registration with a control point spacing between 10 and 5 mm have the best performance in classifying AD/NC and converters/non-converters. After that, image intensities of

TABLE 1 | The demographic information of the dataset used in this work.


MCIc means MCI converters. MCInc means MCI non-converters, MCIun means MCI unknown.

the subjects were normalized by deform the histogram of each subject's image to match the histogram of the MNI151 template (Nyul and Udupa, 1999). Finally, all MRI images were in the same template space and had the same intensity range.

### Age Correction

fnins-12-00777 November 5, 2018 Time: 14:30 # 4

Normal aging has atrophy effects similar with AD (Giorgio et al., 2010). To reduce the confounding effect of age-related atrophy, age correction is necessary to remove age-related effects, which is estimated by fitting a pixel regression model (Dukart et al., 2011) to the subjects' ages. We assume there are N healthy subjects and M voxels in each preprocessed MRI image, and denote **y**m∈**R** <sup>1</sup> <sup>×</sup> <sup>N</sup> as the vector of the intensity values of N healthy subjects at mth voxel, and α∈**R** <sup>1</sup> <sup>×</sup> <sup>N</sup> as the vector of the ages of N healthy subjects. The age-related effect is estimated by fitting linear regression model **y**<sup>m</sup> = ωmα + b<sup>m</sup> at mth voxel. For nth subject, the new intensity of mth voxel can be calculated as y 0 mn = ωm(C−αn) + ymn, where ymn is original intensity, α<sup>n</sup> is age of nth subject. In this study, C is 75, which is the mean age of all subjects.

### CNN-Based Features

A CNN was adopted to extract features from MRI Images of NC and AD subjects. Then, the trained CNN was used to extract image features of MCI subjects. To explore the multiple plane images in MRI, a 2.5D patch was formed by extracting three 32 × 32 patches from transverse, coronal, and sagittal plane centered at a same point (Shin et al., 2016). Then, three patches were combined into a 2D RBG patch. **Figure 2** shows an example of constructing 2.5D patch. For a given voxel point, three patches of MRI are extracted from three planes and then concatenated into a three channel cube, following the same way of composing a colorful patch with red/green/blue channels that are commonly used in computer vision. This process allows us to mine fruitful information form 3D views of MRI by feeding the 2.5D patch into the typical color image processing CNN network. Data augmentation (Shin et al., 2016) was used to increase training samples, by extracting multiple patches at different locations from MRI images. The choice of locations has three constraints, (1) The patches must be originated in either left or right hippocampus region which have high correlation with AD (van de Pol et al., 2006); (2) There must be at least two voxels distance between each location; (3) All locations were random chosen. With these constraints, 151 patches were extracted from each image and the sampling positions were fixed during experiments. The number of samples was expanded by a factor of 151, which could reduce over-fitting.

Typically extracted patches are presented in **Figure 3**. **Figure 3A** shows four 2.5D patches obtained from one subject. These patches are extracted from different positions and show different portions of hippocampus, which means these patches contain different information of morphology of hippocampus. When trained with these patches that spread in whole hippocampus, CNN learns the morphology of whole hippocampus. **Figure 3B** shows patches extracted in same position from four subjects of different groups, demonstrating that the AD subject has the most severe atrophy of hippocampus and expansion of ventricle. This implies that obvious differences are existed between AD and NC. However, the MCI subjects have the medium atrophy of hippocampus, and non-converter is more like NC rather than AD, and converter is more similar to AD. The difference between converter and non-converter is smaller than the difference between AD and NC.

The architecture of the CNN is summarized in **Figure 4**. The network has an input of 32 × 32 RGB patch. There are three convolutional layers and three pooling layers. The kernel size of convolutional layer is 5 × 5 with 2 pixels padding, and the kernel size and stride of pooling layers is 3 × 3 and 2. The input patch has a size of 32 × 32 and 3 RBG channels. The first convolutional layer generates 32 feature maps with a size of 32 × 32. After max pooling, these 32 feature maps were down-sampled into 16 × 16. The next two convolutional layers and average pooling layers finally generate 64 features maps with a size of 4 × 4. These features are concatenated as a feature vector, and then fed to full connection layer and softmax layer for classification. There are also rectified linear units layers and local response normalization layers in CNN, but are not shown for simplicity.

The CNN was trained with patches from NC and AD subjects, and there are 62967 (subject number 417 times 151) patches which are randomly split into 417 mini-batches. Mini-batch stochastic gradient descent was used to update the coefficients of CNN. In each step, a mini-batch was fed into CNN, and then error back propagation algorithm was carried out to computer gradient g<sup>j</sup> of jth coefficient θ<sup>j</sup> , and update the coefficient as θ 0 <sup>j</sup> = θ<sup>j</sup> + Oθn j, in which Oθn j = mOθn−1 j− η(g<sup>j</sup> + λθj) is the increment of θ<sup>j</sup> at nth step. The momentum m, learning rate η and weight decay λ are set as 0.9, 0.001, and 0.0001, respectively, in this work. It is called one epoch with all mini-batches used to train CNN once. The CNN was trained with 30 epochs. Once the network was trained, CNN will be used to extract high level features of MCI subjects' images. The 1024 features output by the last pooling layer were taken as CNN-based features. Thus, CNN generates 154624 (1024 × 151) features for each image.

### FreeSurfer-Based Features

The FreeSurfer (version 4.3) (Fischl and Dale, 2000; Fischl et al., 2004; Desikan et al., 2006; Han et al., 2006) was used to mine more morphological information of MRI images, such as cortical volume, surface area, cortical thickness average, and standard deviation of thickness in each region of interest. These features can be downloaded directly from ADNI website, and 325 features are used to predict MCI-to-AD conversion after age correction. The age correction for FreeSurfer-based features is similar as described above, but on these 325 features instead of on intensity values of MRI images.

### Features Selection

Redundant features maybe exist among CNN-based features, thus we introduced the principle component analysis (PCA) (Avci and Turkoglu, 2009; Babaoglu et al., 2010 ˘ ; Wu et al., 2013) and least absolute shrinkage and selection operator (LASSO) (Kukreja et al., 2006; Usai et al., 2009; Yamada et al., 2014) to reduce the final number of features.

PCA is an unsupervised learning method that uses an orthogonal transformation to convert a set of samples consisting of possibly correlated features into samples consisting of linearly uncorrelated new features. It has been extensively used in data analysis (Avci and Turkoglu, 2009; Babaoglu et al., 2010 ˘ ; Wu et al., 2013). In this work, PCA is adopted to reduce the dimensions of features. Parameters of PCA are: (1) For CNN-based features, there are 1024 features for each patch. After PCA, P<sup>C</sup> features were left for each patch, since there are 151 patches for one subject, there are still P<sup>C</sup> × 151 features for each subject; (2) For FreeSurfer-based features, P<sup>F</sup> features were left for each MCI subject.

LASSO is a supervised learning method that uses L<sup>1</sup> norm in sparse regression (Kukreja et al., 2006; Usai et al., 2009; Yamada et al., 2014) as follows:

$$\min\_{\alpha} 0.5||\mathcal{y} - \mathsf{D}\alpha||\_2^2 + \lambda||\alpha||\_1 \tag{1}$$

Where **y**∈**R** <sup>1</sup> <sup>×</sup> <sup>N</sup> is the vector consisting of N labels of training samples, **D**∈**R** <sup>N</sup> <sup>×</sup> <sup>M</sup> is the feature matrix of N training samples consisting of M features, λ is the penalty coefficient that was set to 0.1, and α∈**R** <sup>1</sup> <sup>×</sup> <sup>M</sup> is the target sparse coefficients and can be used for selecting features with large coefficients. The LASSO was solved with least angle regression (Efron et al., 2004), and L features are selected after L iterations. Parameters of LASSO are: (1) For CNN-based features, L<sup>C</sup> features were selected from P<sup>C</sup> × 151 features for each MCI subject; (2) For FreeSurfer-based features, L<sup>F</sup> features were selected from P<sup>F</sup> features. After PCA and LASSO, there were L<sup>C</sup> + L<sup>F</sup> features.

**Figure 5** shows more details of CNN-based features. 151 patches are extracted from all MRI images, including AD, NC, and MCI. First, the CNN is trained with patches of all AD and NC subjects. After that, the trained CNN is used to output 1024 features from each MCI patch. The 1024 features of each patch are reduced to P<sup>C</sup> features by PCA, and then features of all 151 patches from one subject are concatenated, and Lasso is used to select L<sup>C</sup> most informative features from them.

### Extreme Learning Machine

The extreme learning machine, a feed-forward neural network with a single layer of hidden nodes, learns much faster than common networks trained with back propagation algorithm (Huang et al., 2012; Zeng et al., 2017). A special extreme learning machine, that adopts kernel (Huang et al., 2012) to calculates the outputs as formula (2) and avoids the random generation of input weight matrix, is chosen to classify converters/non-converters with both CNN-based features and FreeSurfer-based features. In formula (2), the is a matrix with elements Ωi,<sup>j</sup> = **K**(**x**<sup>i</sup> , **x**j), where **K**(**a, b**) is a radial basis function kernel in this study, [**x1**,. . ., **xN**] are N training samples, **y** is the label vector of training samples, and **x** is testing sample. C is a regularization coefficient and was set to 1 in this study.

$$f\left(\mathbf{x}\right) = \begin{bmatrix} K\left(\mathbf{x}, \mathbf{x\_1}\right) \\ \vdots \\ K\left(\mathbf{x}, \mathbf{x\_N}\right) \end{bmatrix}^T \left(\Omega + 1/C\right)^T \mathbf{y} \tag{2}$$

### Implementation

In our implementation, CNN was accomplished with Caffe<sup>1</sup> , LASSO was carried out with SPAMS<sup>2</sup> , and extreme learning machine was performed with shared online code<sup>3</sup> . The hippocampus segmentation was implemented with MALPEM

<sup>1</sup>http://caffe.berkeleyvision.org/

<sup>2</sup>http://spams-devel.gforge.inria.fr/

<sup>3</sup>http://www.ntu.edu.sg/home/egbhuang/

4 (Ledig et al., 2015) for all MRI images. Then all hippocampus masks were registered as corresponding MRI images, and then overlapped to create a mask containing hippocampus regions. All image features were normalized to have zero mean and unit variance before training or selection. To evaluate the performance, Leave-one-out cross validation was used as (Coupé et al., 2012; Ye et al., 2012; Zhang et al., 2012).

### RESULTS

### Validation of the Robustness of 2.5D CNN

To validate the robustness of the CNN, several experiments have been performed with the CNN. In experiments, the binary decisions of CNN for 151 patches were united to make final diagnosis of the testing subject. We compared the performance in four different conditions: (1) The CNN was trained with AD/NC patches and used to classify AD/NC subjects; (2) The CNN was trained with converters/non-converters patches and used to classify converters/non-converters; (3) The CNN was

The results are shown in **Table 2**. The CNN has a poor accuracy of 68.49% in classifying converters/non-converters when trained with converters/non-converters patches, but CNN has obtained a much higher accuracy of 73.04% when trained with AD/NC patches. This means that the CNN learned more useful information from AD/NC data than that from converters/non-converters data. And the prediction performance of CNN is close when different sampling patches are used.

### Effect of Combining Two Types of Features

In this section, we present the performance of CNN-based features, FreeSurfer-based features, and their combinations. The PC, PF, LC, and L<sup>F</sup> parameters were set to 29, 150, 35, and 40, respectively, which were optimized in experiments. Finally, 75 features were selected and fed to the extreme learning machine.

Performance was evaluated by calculating accuracy (the number of correctly classified subjects divided by the total number of subjects), sensitivity (the number of correctly classified MCI converters divided by the total number of

trained with AD/NC patches and used to classify converters/nonconverters; (4) The condition is similar with (3), but with different sampling patches in each validation run.

<sup>4</sup>http://www.christianledig.com/

#### TABLE 2 | The performance of the 2.5D CNN.

fnins-12-00777 November 5, 2018 Time: 14:30 # 8


MCIc means MCI converters. MCInc means MCI non-converters. The results were obtained with 10-fold cross validations, and averaged over 50 runs.

TABLE 3 | The performance of different features used, and the performance without age correction.


Bold values indicate the best performance in each column.

MCI converters), specificity (the number of correctly classified MCI non-converters divided by the total number of MCI non-converters), and AUC (area under the receiver operating characteristic curve). The performances of the proposed method and the approach with only one type of features are summarized in **Table 3**. These results indicates that the approaches with only CNN-based features or FreeSurfer-based features have similar performances, and the proposed method combining both features achieved best accuracy, sensitivity, specificity and AUC. Thus, it is meaningful to combine two features in the prediction of MCI-to-AD conversion. The AUC of the proposed method reached 86.1%, indicating the promising performance of this method. The receiver operating characteristic (ROC) curves of these approaches are shown in **Figure 6**.

### Impact of Age Correction

We investigated the impact of age correction on the prediction of conversion here. The prediction accuracy in **Table 3** and the ROC curves in **Figure 6** implied that age correction can significantly improve the accuracy and AUC, Thus, age correction is an important step in the proposed method.

### Comparisons to Other Methods

In this section, we first compared the extreme learning machine with support vector machine and random forest. The performances of three classifiers are shown in **Table 4**, indicating that extreme learning machine achieves the best accuracy and AUC among three classifiers.

Then we compared the proposed method with other stateof-the-art methods that use the same data (Moradi et al., 2015; Tong et al., 2017), which consists of 100 MCI non-converters and 164 MCI converters. In both methods, MRI images were first preprocessed and registered, but in different ways. After that, features selection was performed to select the most informative voxels among all MRI voxels. Moradi used regularized logistic regression algorithm to select a subset of MRI voxels, and Tong used elastic net algorithm instead. Both methods trained feature selection algorithms with AD/NC data to learn the most discriminative voxels and then used to selected voxels from MCI data. Finally, Moradi used low density separation to calculate MRI biomarkers and to predict MCI converters/non-converters. Tong used elastic net regression to calculate grading biomarkers from MCI features, and SVM was utilized to classify MCI converters/non-converters with grading biomarker.

For fair comparisons, both 10-fold cross validation and leave-one-out cross validation were performed on the proposed method and method of Tong et al. (2017) with only MRI data was used. Parameters of the compared approaches were optimized to achieve best performance. **Table 5** shows the performances of three methods in 10-fold cross validation and **Table 6** summarizes the performances in leave-one-out cross validations. These two tables demonstrate that the proposed method achieves the best accuracy and AUC among three methods, which means that the proposed method is more accurate in predicting MCI-to-AD conversion than other methods. The sensitivity of the proposed method is a little lower than the method of Moradi et al. (2015) but much higher than the method of Tong et al. (2017), and the specificity of the proposed method is between other two methods. Higher sensitivity means lower rate of missed diagnosis of converters, and higher specificity means lower rate of misdiagnosing non-converters as converters. Overall, the proposed method has a good balance between the sensitivity and specificity.

### DISCUSSION

The CNN has a better performance when trained with AD/NC patches rather than MCI patches, we think the reason is that the pathological changes between MCI converters and nonconverters are slighter than those between AD and CN. Thus, it is more difficult for CNN to learn useful information directly from MCI data about AD-related pathological changes than from AD/NC data. The pathological changes are also hampered by inter-subject variations for MCI data. Inspired by the work in Moradi et al. (2015) and Tong et al. (2017) which use information

#### TABLE 4 | Comparison of extreme learning machine with other two classifiers.


Implementation of SVM was performed using third party library LIBSVM (https://www.csie.ntu.edu.tw/~cjlin/libsvm/), and the random forest was utilized with the third party library (http://code.google.com/p/randomforest-matlab). Both classifiers used the default settings.

TABLE 5 | Comparison with others methods on the same dataset in 10-fold cross validation.


The performances of MRI biomarker and global grading biomarker are described in Moradi et al. (2015) and (Tong et al., 2017). The results are averages over 100 runs, and the standard deviation/confidence intervals of accuracy and AUC of the proposed method are 1.19%/[0.7922, 0.7968] and 0.83%/[0.8358, 0.8391]. Bold values indicate the best performance in each column.

TABLE 6 | Comparison with others methods on the same dataset in leave-one-out cross validation.


The global grading biomarkers was download from the web described in Tong et al. (2017) and the experiment was performed with same method as in Tong et al. (2017). Bold values indicate the best performance in each column.

TABLE 7 | The 15 most informative FreeSurfer-based features for predicting MCI-to-AD conversion.


of AD and NC to help classifying MCI, we trained the CNN with the patches from AD and NC subjects and improved the performance.

After non-rigid registration, the differences between all subject's MRI brain image are mainly in hippocampus (Tong et al., 2017). So we extracted 2.5D patches only from hippocampus regions, that makes the information of other regions lost. For this reason, we included the whole brain features calculated by FreeSurfer as complementary information. The accuracy and AUC of classification are increased to 79.9 and 86.1% from 76.9 to 82.9% with the help of FreeSurferbased features. To explore which FreeSurfer-based features contribute mostly when they are used to predict MCI-to-AD conversion, we used Lasso to select the most informative features, and the top 15 features are listed in **Table 7**, in which the features are almost volume and thickness average of regions related to AD. The thickness average of frontal pole is the most discriminative feature. The quantitative features of hippopotamus are not listed, indicating they contribute less than these listed features when predicting conversion. The CNN extract the deep features of hippopotamus morphology, rather than the quantitative features of hippopotamus, which are discriminative for AD diagnosis. Therefore, The CNN-based features and FreeSurfer-based features contain different useful information for classification of converters/non-converters, and they are complementary to improve the performance of classifier.

Different from the two methods used in Moradi et al. (2015) and Tong et al. (2017), which directly used voxels as features, the proposed method employs CNN to learn the deep features from the morphology of hippopotamus, and combined CNNbased features with the globe morphology features that were computed by FreeSurfer. We believe that the learnt CNN features might be more meaningful and more discriminative than voxels. When comparing with these two methods, only MRI data was used, but the performances of these two methods were improved when combined MRI data with age and cognitive measures, so investigating the combination of the propose approach with other modality data for performance improvement is also one of our future works.

We have also listed several deep learning-based studies in recent years for comparison in **Table 8**. Most of them have an accuracy of predicting conversion above 70%, especially the last three approaches (including the proposed one) have the accuracy above 80%. The best accuracy was achieved by Lu et al. (2018a), which uses both MRI and PET data. However, when only MRI data is used, Lu's method declined the accuracy to 75.44%. Although an accuracy of 82.51% was also obtained with PET data (Lu et al., 2018b), PET scanning usually suffers from contrast agents and more expensive cost than the routine MRI. In summary, our approach achieved the best performance when only MRI images were used and is expected to be improved by incorporating other modality data, e.g., PET, in the future.

In this work, the period of predicting conversion was set to 3 years, that separates MCI subjects into MCI non-converters and MCI converters groups by the criterion who covert to AD within 3 years. But not matter what the period for prediction is, there is a disadvantage that even the classifier precisely predict a MCI non-converters who would not convert to AD within a specific period, but the conversion might still happen half year or even 1 month later. Modeling the progression of AD and predicting the time of conversion with longitudinal data are more meaningful


TABLE 8 | Results of previous deep learning based approaches for predicting MCI-to-AD conversion.

MCIc means MCI converters. MCInc means MCI non-converters. Different subjects and modalities of data are used in these approaches. All the criteria are copied from the original literatures. Bold values indicate the best performance in each column.

(Guerrero et al., 2016; Xie et al., 2016). Our future work would investigate the usage of CNN in modeling the progression of AD.

### CONCLUSION

fnins-12-00777 November 5, 2018 Time: 14:30 # 11

In this study, we have developed a framework that only use MRI data to predict the MCI-to-AD conversion, by applying CNN and other machine learning algorithms. Results show that CNN can extract discriminative features of hippocampus for prediction by learning the morphology changes of hippocampus between AD and NC. And FreeSurfer provides extra structural brain image features to improve the prediction performance as complementary information. Compared with other state-of-theart methods, the proposed one outperforms others in higher accuracy and AUC, while keeping a good balance between the sensitivity and specificity.

### AUTHOR CONTRIBUTIONS

WL and XQ conceived the study, designed the experiments, analyzed the data, and wrote the whole manuscript. TT and

### REFERENCES


QG provided the preprocessed data. WL, XQ, DG, XD, and MX carried out experiments. YY and GG helped to analyze the data and experiments result. MD and XQ revised the manuscript.

### FUNDING

This work was partially supported by National Key R&D Program of China under Grants (No. 2017YFC0108703), National Natural Science Foundation of China under Grants (Nos. 61871341, 61571380, 61811530021, 61672335, 61773124, 61802065, and 61601276), Project of Chinese Ministry of Science and Technology under Grants (No. 2016YFE0122700), Natural Science Foundation of Fujian Province of China under Grants (Nos. 2018J06018, 2018J01565, 2016J05205, and 2016J05157), Science and Technology Program of Xiamen under Grants (No. 3502Z20183053), Fundamental Research Funds for the Central Universities under Grants (No. 20720180056), and the Foundation of Educational and Scientific Research Projects for Young and Middle-aged Teachers of Fujian Province under Grants (Nos. JAT160074 and JAT170406).


of diabetic retinopathy in retinal fundus photographs. JAMA 316, 2402–2410. doi: 10.1001/jama.2016.17216


fnins-12-00777 November 5, 2018 Time: 14:30 # 12


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Lin, Tong, Gao, Guo, Du, Yang, Guo, Xiao, Du, Qu and The Alzheimer's Disease Neuroimaging Initiative. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

digital media

of impactful research

article's readership