An Automated Method for Segmenting White Matter Lesions through Multi-Level Morphometric Feature Classification with Application to Lupus

We demonstrate an automated, multi-level method to segment white matter brain lesions and apply it to lupus. The method makes use of local morphometric features based on multiple MR sequences, including T1-weighted, T2-weighted, and fluid attenuated inversion recovery. After preprocessing, including co-registration, brain extraction, bias correction, and intensity standardization, 49 features are calculated for each brain voxel based on local morphometry. At each level of segmentation a supervised classifier takes advantage of a different subset of the features to conservatively segment lesion voxels, passing on more difficult voxels to the next classifier. This multi-level approach allows for a fast lesion classification method with tunable trade-offs between sensitivity and specificity producing accuracy comparable to a human rater.


INTRODUCTION
Systemic lupus erythematosus (SLE) is an autoimmune disease affecting multiple tissues, including the brain. Neurologic and psychiatric complications in SLE, subsumed by the term neuropsychiatric SLE (NPSLE), occurs in up to 95% of SLE patients. While MRI often reveals distinct white matter abnormalities in active NPSLE, the pathologic processes underlying these lesions, whether involving autoimmune mechanisms or purely vascular in nature (e.g., hemostasis), are unknown. Nonetheless, accurately measuring brain lesions can contribute substantially to understanding the pathology causing NPSLE. The traditional method of lesion identifi cation in NPSLE involves the expensive and timeintensive manual rating of each image slice by a radiologist. An accurate automated lesion analysis method would reduce the time and costs associated with SLE brain lesion measurement and, hence, facilitate both research studies and the diagnosis of this prevalent disease.
There exists a large body of work focused on segmenting white matter lesions in various disorders such as multiple sclerosis (Van Leemput et al., 2001;Lao et al., 2006;Khayati et al., 2008;Morra et al., 2008), vascular dementia (Yamashita et al., 2008), and leukoaraiosis (Kruggel et al., 2008;Schwarz et al., 2009), but seldom have these methods been applied to NPSLE. Though white matter lesions appear in the above disorders as hyperintensities in T2-weighted (T2w) MRI images, their etiology differs from NPSLE substantially (Rovaris et al., 2000;Brey, 2007). Particularly, MS lesions arise from Figure 1 outlines the general lesion segmentation protocol from raw data through segmentation.

DATA ACQUISITION
T1w, T2w, and fl uid attenuated inversion recovery (FLAIR) images were acquired for 27 subjects with a diagnosis of SLE on a Siemens Sonata 1.5T scanner at The Mind Research Network with sequence parameters as shown in Table 1. This protocol was selected for compatibility with a standard clinical sequence in use at the University of New Mexico Hospital (UNMH) and clinics where it was originally used for the qualitative analysis of lesion load. The method in this paper was developed to be applied on a standard clinical sequence such as the one employed at UNMH.
Of the 27 subjects collected, 10 subjects were used for training and 17 subjects were reserved for testing. Each subject was evaluated by a neuroradiologist and found to have white matter lesions consistent with SLE with no other obvious neurological issues. All subject data was collected with human subject consent and appropriate approval from the Internal Review Board at the University of New Mexico.
The gold standard lesion segmentation was created via manual tracing of white matter lesions performed by an experienced rater. Raters were requested to review the axial FLAIR images from inferior to superior for visual hyperintensities in the white matter regions. All FLAIR hyperintensities that are regionally bright relative to the surrounding tissue and distant from the ventricles were considered lesions. Lesions that were greater in size than three contiguous voxels were traced by the expert rater. During the identification of the lesion, the rater had the opportunity to refer to the co-registered T1w and T2w images. This was helpful because of the decreased partial volume artifact in the T1w images. For FLAIR hyperintensites that were adjacent to the ventricles, hyperintensities that were significantly hemispherically asymmetric were considered lesion,   while hemispherically symmetric hyperintensities were considered to be periventricular artifact. Symmetry was evaluated in a qualitative manner.

PREPROCESSING
Consistent voxel location and size is required when combining multiple image intensities into a single feature vector. To achieve this all images were resampled to 1mm isotropic resolution using trilinear interpolation. Each subject's T2w and FLAIR were then registered to the T1w using Matte's Mutual registration (Ibanez et al., 2005).
Multiple studies have shown that voxel-based lesion classification is sensitive to B1-fi eld inhomogeneity (Madabhushi and Udupa, 2005;Jäger et al., 2007;Jäger and Hornegger, 2009) within each series acquired on a single subject and between individuals. Multiple solutions have been examined in the literature, but some form of bias correction and intensity standardization is normally employed (Madabhushi and Udupa, 2005). Bias correction was performed using an implementation of Van Leemput et al.'s (1999) method due to its ability to incorporate information from multiple MR sequences. After performing bias correction, brain extraction was performed using (Pierson et al., 2009). Intensity standardization through linear histogram matching, as presented by Nyul et al. (2000), was used to separately match each sequence to a reference image to reduce inter-subject intensity variability.

FEATURE CREATION/RANKING
Human raters do not judge lesions solely on intensities, on a voxelby-voxel basis, independent of their neighbors. The obvious reason is that much of the information that differentiates lesion from non-lesion tissue is contained in the spatial position of the voxel in question, the intensities of its neighbors, the proximity of gray matter, white matter, or CSF tissue, and often the hemispheric symmetry/asymmetry. In order to quantify and incorporate this type of information, 49 different morphometric features were calculated for each voxel. Figure 2 shows a single axial slice, from a single subject, of a subset of the features used.
To incorporate some higher level spatial information the fi rst feature calculated was an estimate of the tissue type (gray, white, or CSF), using k-means seeded with appropriate initial means ( Figure 2D). Next, distance maps were calculated for each tissue type, giving a distance from a voxel to the nearest voxel for each tissue type (Figures 2E-G). The distance maps are meant to incorporate information about tissue type boundaries, such as the transition from white matter to gray matter, as these can affect classifi cation. Since hemispheres are generally sagittally symmetric and lesions are generally hemispherically asymmetric, each image was left-right fl ipped and subtracted from itself (Combès and Prima, 2008) (Figures 2K-M). The purpose of the fl ipped difference is not to indicate the symmetry of the brain, but to enhance lesion locations which are known to be hemispherically asymmetric. This approach runs the risk of noise at tissue boundaries or near physical abnormalities. Neighborhood means and medians were calculated for neighborhoods with radius 1, 2, and 3 voxels 1 in order to capture some of the surrounding tissues characteristics (not shown). For the fi nal features, dilation and erosion with radius 1, 2, and 3 were performed on each image to quantify information about voxels that may be on the boundaries of lesions (Figures 2H-P). Finally, to capture some idea of global location the normalized x, y, and z location of each voxel was calculated (not shown). The spatial normalization involved independently calculating the mean and standard deviation of the x, y, and z values for all voxels within the brain, then subtracting the mean and dividing by the standard deviation. Since all brains were AC-PC aligned this form of spatial normalization provides a means of usefully comparing voxel indexes between subjects without registering to a standard space.
All features were quantized and then ranked using maximum relevance, minimum redundancy (mRMR) (Peng et al., 2005). This method computes the mutual information between each feature and its target, then adds features that are maximally related to the target classifi cation while being minimally related to the features already present. Thus each additional feature should include the greatest amount of new information. Since SVMs are being used to perform the segmentation step, it is not the features themselves that are important but the areas of the feature space defi ned by the combination of features. The results of ranking the features using mRMR are presented in Table 2.
According to the mRMR ranking shown in Table 2, the FLAIR after grayscale dilation of radius two (Figure 2J), and the T1w fl ipped difference (Figure 2K) are the two most relevant features. The dilation of the FLAIR image is important because it essentially makes the entire lesion cluster match the intensity of the center of the lesion and pushes boundary voxels outside the original lesion dimensions. The T1w fl ipped difference is relevant because lesions are generally hypo-intense on T1w; so, after subtracting the lesion from the healthy tissue on the opposite side of the brain, the lesion becomes hyper-intense. Since most radiologists use the T1w, the FLAIR, or both when performing their manual segmentations, it is encouraging that features from these two sequences provide the most mutually independent information according to mRMR. The following fi ve most relevant features include the normalized x, y, and z location and the distance to white and gray matter. The presence of the x, y, and z values so high in the rankings demonstrates that the spatial location of lesions is quite informative, and the distance to white and gray matter shows that the location within tissue types is very relevant. Another very interesting property of the ranked features is that the T1w, T2w, and FLAIR intensities provide much less relevant, non-redundant information than many of the calculated morphometric features, implying that the morphometric features enhance the local features providing additional information that improves the ability of the automated algorithm to detect the lesions.

DATA REDUCTION
After brain extraction, the entire training data set comprised of 10 subjects contained over 12 million voxels. Since most nonlinear classifi ers have time complexity between O(n log n) and O(n 2 ), reducing the number of samples is required to make the problem computationally tractable. The data is also extremely unbalanced with over 1000 non-lesion voxels to 1 lesion voxel. The problem is thus one of selecting the non-lesion samples that are the most difficult to classify, and then training using those and the lesion samples. To achieve this we created a multi-level elimination process.

Distance threshold
The fi rst reduction step is something of a single class clustering approach. The mean of each feature was calculated for all lesion voxels, and then the maximum distance from any lesion feature vector to the lesion center was found. All non-lesion voxels that were farther from the lesion centroid than this max distance were labeled non-lesion and removed from consideration.
Leave-one-out (LOO) error was calculated for all sets of features numbering 1:49, in mRMR order. The optimum performance was achieved with fi ve features with no false negatives and the highest number of true negatives based on the manual tracing as the gold standard. The reduction at this step was between 2 and 5% which, while not substantial, was shown empirically to improve the performance of the Naive Bayes classifi er.

Naive Bayesian classifi er
After distance thresholding the data, a Naive Bayesian classifi er was employed. The number of features used was selected by calculating the LOO rate for an increasing number of features in mRMR order. The  Figure 3 is the receiver operating characteristic (ROC) curve (Obuchowski, 2003) which represents the trade-off between sensitivity 2 and specifi city 3 that occurs by varying the threshold at which a voxel should be labeled as lesion. The fi rst line in red shows the results based on LOO error, generated by training on 9 subjects and testing on the 10th, repeated for all 10 subjects. The line in green shows the results for the entire hold-out set of 17 subjects. The lowest threshold, 0.001, on the right side of the graph, has a sensitivity of 94.3% and a specifi city of 93.1% in the training set and a sensitivity of 94.3% and a specifi city of 93.9% on the hold-out set. The highest threshold of 0.99 is on the left side of the graph with a sensitivity of 4.2% and a specifi city of 100% on the training set and a sensitivity of 2.6% and a specifi city of 100% in the hold-out set. Figure 4A shows an axial slice of the probability mask for a single subject, overlaid on that subjects FLAIR. Figure 4B shows the probability mask after thresholding, and Figure 4C shows the expert tracings on the same slice.

DISCUSSION
The majority of the false positives present in the predictions are on the boundaries of lesion clusters. Disagreement over boundaries is common between human raters often because precise boundaries do not matter to radiologists. Since this approach was based on the work of a single human expert, it is diffi cult to determine how much the boundary differences are due to misclassifi cation and how much the differences are due to valid disagreement. An additional diffi culty comes from the possibility that an automated approach may be more sensitive to lesion tissue in general. Determining where the automated method truly fails will require the ratings of more radiologists on the same subjects, more back-and-forth with the current radiologist, or both. While overall the results are very good, the unbalanced nature of the data makes the specifi city more important than it may seem since a specifi city of 99.9% could be achieved simply by labeling all tissue non-lesion. In order to improve the specifi city it may be worthwhile to implement some form of false positive reduction similar to that used by Lao et al. (2006). This would allow for a lower threshold, resulting in a higher sensitivity while improving the specifi city over the current performance.
Additionally, the current results could likely be improved by reducing noise present in the features. One source of noise is the inter-subject intensity variation, which could be reduced by using a better intensity standardization method. Since the classifi cation method presented is highly dependent on the relationships between images, a method that preserves these relationships, such as that presented by Jäger and Hornegger (2009), would likely improve accuracy.
Another source of noise in a subset of the features is the initial k-means tissue segmentation which is known to be sensitive to the intensity variations caused by lesions. A more robust tissue seg- best performance was achieved with 34 features. After eliminating any non-lesion voxels that had been correctly classifi ed, only 5% of the data remained. The training set for the SVMs was then composed of all nonlesion voxels incorrectly classifi ed as lesion and all true lesion voxels.

MODEL SELECTION
The performance of SVMs is heavily dependent on the kernel selected and the parameters used (Hsu et al., 2003). In order to capture non-linearities in the data, the radial basis function kernel was selected. This kernel has two parameters affecting its performance, the Gaussian radius, γ, and the cost, C. Often a grid search is used to exhaustively search the parameter space (Hsu et al., 2003). Due to the size of our data set and the diffi culty of the decision boundary, an exhaustive search proved too computationally taxing. Instead, a two level, uniform design (Huang et al., 2007) based search was performed with 13 points in the fi rst level and 9 in the second. The best values for C and γ were selected from the second search level. While the data reduction stage did bring the lesion and non-lesion classes closer to being balanced, there were still approximately 40 nonlesion voxels for every lesion voxel. To overcome this imbalance, the lesion class was given higher weight while training. This makes misclassifying a lesion voxel more costly and essentially decreases false negatives at the expense of increased false positives (Yang et al., 2005). After the best parameters were identifi ed the SVMs were trained on the data and confi gured to provide probability estimates. All SVM tasks were performed using libSVM (Chang and Lin, 2001).  menter, using expectation maximization for instance, would help provide more accurate tissue types. Multiple features rely on the tissue segmentation, some of which were well ranked by mRMR, meaning any improvement should result in improved lesion segmentation. Very little of the SVM kernel parameter space was searched which strongly implies the parameters used were non-optimal (Chang and Lin, 2001;Huang et al., 2007). Thus the current method could be improved by implementing a more thorough parameter search, although such a search would be computationally prohibitive.
While the model trained here is specifi c to NPSLE, the framework of features, ranking methods, and classifi ers could be applied to any brain lesion type. Since there exists a number of disorders that produce brain lesions, there are many opportunities for further applications. Performance on more common disorders would also allow for better confi dence due to larger sample sizes.